Seqminer2: an efficient tool to query and retrieve genotypes for statistical genetics analyses from biobank scale sequence dataset

Lina Yang; Shuang Jiang; Bibo Jiang; Dajiang J Liu; Xiaowei Zhan

doi:10.1093/bioinformatics/btaa628

Seqminer2: an efficient tool to query and retrieve genotypes for statistical genetics analyses from biobank scale sequence dataset

Bioinformatics ◽

10.1093/bioinformatics/btaa628 ◽

2020 ◽

Vol 36 (19) ◽

pp. 4951-4954

Author(s):

Lina Yang ◽

Shuang Jiang ◽

Bibo Jiang ◽

Dajiang J Liu ◽

Xiaowei Zhan

Keyword(s):

Genetic Variants ◽

Method Development ◽

State Of The Art ◽

R Package ◽

Supplementary Information ◽

Efficient Tool ◽

File Formats ◽

Software Prototyping ◽

Novel Variant ◽

Scale Sequence

Abstract Summary Here, we present a highly efficient R-package seqminer2 for querying and retrieving sequence variants from biobank scale datasets of millions of individuals and hundreds of millions of genetic variants. Seqminer2 implements a novel variant-based index for querying VCF/BCF files. It improves the speed of query and retrieval by several magnitudes compared to the state-of-the-art tools based upon tabix. It also reimplements support for BGEN and PLINK format, which improves speed over alternative implementations. The improved efficiency and comprehensive support for popular file formats will facilitate method development, software prototyping and data analysis of biobank scale sequence datasets in R. Availability and implementation The seqminer2 R package is available from https://github.com/zhanxw/seqminer. Scripts used for the benchmarks are available in https://github.com/yang-lina/seqminer/blob/master/seqminer2%20benchmark%20script.txt. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CPS analysis: self-contained validation of biomedical data clustering

Bioinformatics ◽

10.1093/bioinformatics/btaa165 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3516-3521 ◽

Cited By ~ 1

Author(s):

Lixiang Zhang ◽

Lin Lin ◽

Jia Li

Keyword(s):

Data Clustering ◽

State Of The Art ◽

R Package ◽

Research Community ◽

Supplementary Information ◽

Biomedical Data ◽

Data Generation ◽

Supplementary Data ◽

Point Set ◽

Class Labels

Abstract Motivation Cluster analysis is widely used to identify interesting subgroups in biomedical data. Since true class labels are unknown in the unsupervised setting, it is challenging to validate any cluster obtained computationally, an important problem barely addressed by the research community. Results We have developed a toolkit called covering point set (CPS) analysis to quantify uncertainty at the levels of individual clusters and overall partitions. Functions have been developed to effectively visualize the inherent variation in any cluster for data of high dimension, and provide more comprehensive view on potentially interesting subgroups in the data. Applying to three usage scenarios for biomedical data, we demonstrate that CPS analysis is more effective for evaluating uncertainty of clusters comparing to state-of-the-art measurements. We also showcase how to use CPS analysis to select data generation technologies or visualization methods. Availability and implementation The method is implemented in an R package called OTclust, available on CRAN. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

VINYL: Variant prIoritizatioN by survivaL analysis

Bioinformatics ◽

10.1093/bioinformatics/btaa1067 ◽

2020 ◽

Author(s):

Matteo Chiara ◽

Pietro Mandreoli ◽

Marco Antonio Tangaro ◽

Anna Maria D’Erchia ◽

Sandro Sorrentino ◽

...

Keyword(s):

Genetic Variants ◽

Functional Annotation ◽

State Of The Art ◽

Clinical Applications ◽

Automated System ◽

Supplementary Information ◽

Variant Prioritization ◽

Sequencing Technologies ◽

Pathological Conditions ◽

Equivalent State

Abstract Motivation Clinical applications of genome re-sequencing technologies typically generate large amounts of data that need to be carefully annotated and interpreted to identify genetic variants potentially associated with pathological conditions. In this context, accurate and reproducible methods for the functional annotation and prioritization of genetic variants are of fundamental importance. Results In this paper, we present VINYL, a flexible and fully automated system for the functional annotation and prioritization of genetic variants. Extensive analyses of both real and simulated datasets suggest that VINYL can identify clinically relevant genetic variants in a more accurate manner compared to equivalent state of the art methods, allowing a more rapid and effective prioritization of genetic variants in different experimental settings. As such we believe that VINYL can establish itself as a valuable tool to assist healthcare operators and researchers in clinical genomics investigations. Availability VINYL is available at http://beaconlab.it/VINYL and https://github.com/matteo14c/VINYL. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

EWASex: an efficient R-package to predict sex in epigenome-wide association studies

Bioinformatics ◽

10.1093/bioinformatics/btaa949 ◽

2020 ◽

Author(s):

Jesper Beltoft Lund ◽

Weilong Li ◽

Afsaneh Mohammadnejad ◽

Shuxia Li ◽

Jan Baumbach ◽

...

Keyword(s):

X Chromosome ◽

State Of The Art ◽

Association Studies ◽

R Package ◽

Important Variable ◽

Supplementary Information ◽

Sex Estimation ◽

Current State ◽

Powerful Approach ◽

Small Set

Abstract Summary Epigenome-Wide Association Study (EWAS) has become a powerful approach to identify epigenetic variations associated with diseases or health traits. Sex is an important variable to include in EWAS to ensure unbiased data processing and statistical analysis. We introduce the R-package EWASex, which allows for fast and highly accurate sex-estimation using DNA methylation data on a small set of CpG sites located on the X-chromosome under stable X-chromosome inactivation in females. Results We demonstrate that EWASex outperforms the current state of the art tools by using different EWAS datasets. With EWASex, we offer an efficient way to predict and to verify sex that can be easily implemented in any EWAS using blood samples or even other tissue types. It comes with pre-trained weights to work without prior sex labels and without requiring access to RAW data, which is a necessity for all currently available methods. Availability and implementation The EWASex R-package along with tutorials, documentation and source code are available at https://github.com/Silver-Hawk/EWASex. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Nubeam-dedup: a fast and RAM-efficient tool to de-duplicate sequencing reads without mapping

Bioinformatics ◽

10.1093/bioinformatics/btaa112 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3254-3256 ◽

Cited By ~ 2

Author(s):

Hang Dai ◽

Yongtao Guan

Keyword(s):

Hash Function ◽

Reference Genome ◽

State Of The Art ◽

Source Code ◽

Supplementary Information ◽

Supplementary Data ◽

Efficient Tool ◽

Cpu Time ◽

Products Of Matrices

Abstract Summary We present Nubeam-dedup, a fast and RAM-efficient tool to de-duplicate sequencing reads without reference genome. Nubeam-dedup represents nucleotides by matrices, transforms reads into products of matrices, and based on which assigns a unique number to a read. Thus, duplicate reads can be efficiently removed by using a collisionless hash function. Compared with other state-of-the-art reference-free tools, Nubeam-dedup uses 50–70% of CPU time and 10–15% of RAM. Availability and implementation Source code in C++ and manual are available at https://github.com/daihang16/nubeamdedup and https://haplotype.org. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A statistical simulator scDesign for rational scRNA-seq experimental design

Bioinformatics ◽

10.1093/bioinformatics/btz321 ◽

2019 ◽

Vol 35 (14) ◽

pp. i41-i50 ◽

Cited By ~ 9

Author(s):

Wei Vivian Li ◽

Jingyi Jessica Li

Keyword(s):

Biological Sciences ◽

Gene Expression ◽

Experimental Design ◽

Method Development ◽

Cell Types ◽

R Package ◽

Computational Method ◽

Supplementary Information ◽

Reduction Methods ◽

Sequencing Platforms

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) has revolutionized biological sciences by revealing genome-wide gene expression levels within individual cells. However, a critical challenge faced by researchers is how to optimize the choices of sequencing platforms, sequencing depths and cell numbers in designing scRNA-seq experiments, so as to balance the exploration of the depth and breadth of transcriptome information. Results Here we present a flexible and robust simulator, scDesign, the first statistical framework for researchers to quantitatively assess practical scRNA-seq experimental design in the context of differential gene expression analysis. In addition to experimental design, scDesign also assists computational method development by generating high-quality synthetic scRNA-seq datasets under customized experimental settings. In an evaluation based on 17 cell types and 6 different protocols, scDesign outperformed four state-of-the-art scRNA-seq simulation methods and led to rational experimental design. In addition, scDesign demonstrates reproducibility across biological replicates and independent studies. We also discuss the performance of multiple differential expression and dimension reduction methods based on the protocol-dependent scRNA-seq data generated by scDesign. scDesign is expected to be an effective bioinformatic tool that assists rational scRNA-seq experimental design and comparison of scRNA–seq computational methods based on specific research goals. Availability and implementation We have implemented our method in the R package scDesign, which is freely available at https://github.com/Vivianstats/scDesign. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics

Bioinformatics ◽

10.1093/bioinformatics/btz961 ◽

2020 ◽

Vol 36 (8) ◽

pp. 2611-2613 ◽

Cited By ~ 5

Author(s):

Thang V Pham ◽

Alex A Henneman ◽

Connie R Jimenez

Keyword(s):

Open Source ◽

State Of The Art ◽

R Package ◽

Protein Quantification ◽

Supplementary Information ◽

Label Free ◽

Software Suite ◽

Data Independent Acquisition ◽

Free Data ◽

Acquisition Mode

Abstract Summary We present an R package called iq to enable accurate protein quantification for label-free data-independent acquisition (DIA) mass spectrometry-based proteomics, a recently developed global approach with superior quantitative consistency. We implement the popular maximal peptide ratio extraction module of the MaxLFQ algorithm, so far only applicable to data-dependent acquisition mode using the software suite MaxQuant. Moreover, our implementation shows, for each protein separately, the validity of quantification over all samples. Hence, iq exports a state-of-the-art protein quantification algorithm to the emerging DIA mode in an open-source implementation. Availability and implementation The open-source R package is available on CRAN, https://github.com/tvpham/iq/releases and oncoproteomics.nl/iq. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Ultra-fast scalable estimation of single-cell differentiation potency from scRNA-Seq data

Bioinformatics ◽

10.1093/bioinformatics/btaa987 ◽

2020 ◽

Author(s):

Andrew E Teschendorff ◽

Alok K Maity ◽

Xue Hu ◽

Chen Weiyan ◽

Matthias Lechner

Keyword(s):

Single Cell ◽

State Of The Art ◽

Computational Cost ◽

R Package ◽

Supplementary Information ◽

Supplementary Data ◽

Rna Seq ◽

Current State ◽

Multipotent Cells ◽

Comparable Accuracy

Abstract Motivation An important task in the analysis of single-cell RNA-Seq data is the estimation of differentiation potency, as this can help identify stem-or-multipotent cells in non-temporal studies or in tissues where differentiation hierarchies are not well established. A key challenge in the estimation of single-cell potency is the need for a fast and accurate algorithm, scalable to large scRNA-Seq studies profiling millions of cells. Results Here, we present a single-cell potency measure, called Correlation of Connectome and Transcriptome (CCAT), which can return accurate single-cell potency estimates of a million cells in minutes, a 100-fold improvement over current state-of-the-art methods. We benchmark CCAT against 8 other single-cell potency models and across 28 scRNA-Seq studies, encompassing over 2 million cells, demonstrating comparable accuracy than the current state-of-the-art, at a significantly reduced computational cost, and with increased robustness to dropouts. Availability and implementation CCAT is part of the SCENT R-package, freely available from https://github.com/aet21/SCENT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ExperimentSubset: an R package to manage subsets of Bioconductor Experiment objects

Bioinformatics ◽

10.1093/bioinformatics/btab179 ◽

2021 ◽

Author(s):

Irzam Sarfraz ◽

Muhammad Asif ◽

Joshua D Campbell

Keyword(s):

Single Cell ◽

R Package ◽

Poor Quality ◽

Data Matrix ◽

Supplementary Information ◽

Data Provenance ◽

Rna Seq ◽

Efficient Management ◽

The Matrix ◽

The Relationship

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BloodGen3Module: Blood transcriptional module repertoire analysis and visualization using R

Bioinformatics ◽

10.1093/bioinformatics/btab121 ◽

2021 ◽

Author(s):

Darawan Rinchai ◽

Jessica Roelands ◽

Mohammed Toufiq ◽

Wouter Hendrickx ◽

Matthew C Altman ◽

...

Keyword(s):

Transcript Abundance ◽

R Package ◽

Supplementary Information ◽

Illustrative Case ◽

Bioinformatic Tools ◽

Transcriptional Module ◽

Wide Range ◽

Downstream Analysis ◽

Computing Module ◽

Parallel Workflow

Abstract Motivation We previously described the construction and characterization of generic and reusable blood transcriptional module repertoires. More recently we released a third iteration (“BloodGen3” module repertoire) that comprises 382 functionally annotated gene sets (modules) and encompasses 14,168 transcripts. Custom bioinformatic tools are needed to support downstream analysis, visualization and interpretation relying on such fixed module repertoires. Results We have developed and describe here a R package, BloodGen3Module. The functions of our package permit group comparison analyses to be performed at the module-level, and to display the results as annotated fingerprint grid plots. A parallel workflow for computing module repertoire changes for individual samples rather than groups of samples is also available; these results are displayed as fingerprint heatmaps. An illustrative case is used to demonstrate the steps involved in generating blood transcriptome repertoire fingerprints of septic patients. Taken together, this resource could facilitate the analysis and interpretation of changes in blood transcript abundance observed across a wide range of pathological and physiological states. Availability The BloodGen3Module package and documentation are freely available from Github: https://github.com/Drinchai/BloodGen3Module Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CorGAT: a tool for the functional annotation of SARS-CoV-2 genomes

Bioinformatics ◽

10.1093/bioinformatics/btaa1047 ◽

2020 ◽

Author(s):

Matteo Chiara ◽

Federico Zambelli ◽

Marco Antonio Tangaro ◽

Pietro Mandreoli ◽

David S Horner ◽

...

Keyword(s):

Functional Annotation ◽

Ad Hoc ◽

State Of The Art ◽

Supplementary Information ◽

Genomic Sequences ◽

Supplementary Data ◽

Evolutionary Patterns ◽

Genomic Variants ◽

Art Methods ◽

Available Resources

Abstract Summary While over 200 000 genomic sequences are currently available through dedicated repositories, ad hoc methods for the functional annotation of SARS-CoV-2 genomes do not harness all currently available resources for the annotation of functionally relevant genomic sites. Here, we present CorGAT, a novel tool for the functional annotation of SARS-CoV-2 genomic variants. By comparisons with other state of the art methods we demonstrate that, by providing a more comprehensive and rich annotation, our method can facilitate the identification of evolutionary patterns in the genome of SARS-CoV-2. Availabilityand implementation Galaxy http://corgat.cloud.ba.infn.it/galaxy; software: https://github.com/matteo14c/CorGAT/tree/Revision_V1; docker: https://hub.docker.com/r/laniakeacloud/galaxy_corgat. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text