Fast alignment and preprocessing of chromatin profiles with Chromap

AbstractAs sequencing depth of chromatin studies continually grows deeper for sensitive profiling of regulatory elements or chromatin spatial structures, aligning and preprocessing of these sequencing data have become the bottleneck for analysis. Here we present Chromap, an ultrafast method for aligning and preprocessing high throughput chromatin profiles. Chromap is comparable to BWA-MEM and Bowtie2 in alignment accuracy and is over 10 times faster than traditional workflows on bulk ChIP-seq/Hi-C profiles and than 10x Genomics’ CellRanger v2.0.0 pipeline on single-cell ATAC-seq profiles.

Download Full-text

Comparison of high-throughput single-cell RNA sequencing data processing pipelines

Briefings in Bioinformatics ◽

10.1093/bib/bbaa116 ◽

2020 ◽

Author(s):

Mingxuan Gao ◽

Mingyi Ling ◽

Xinwei Tang ◽

Shun Wang ◽

Xu Xiao ◽

...

Keyword(s):

Data Processing ◽

Single Cell ◽

Rna Sequencing ◽

High Throughput ◽

Large Scale ◽

Evaluation Framework ◽

Integrated Analysis ◽

Sequencing Data ◽

Single Experiment ◽

Single Cell Rna Sequencing

Abstract With the development of single-cell RNA sequencing (scRNA-seq) technology, it has become possible to perform large-scale transcript profiling for tens of thousands of cells in a single experiment. Many analysis pipelines have been developed for data generated from different high-throughput scRNA-seq platforms, bringing a new challenge to users to choose a proper workflow that is efficient, robust and reliable for a specific sequencing platform. Moreover, as the amount of public scRNA-seq data has increased rapidly, integrated analysis of scRNA-seq data from different sources has become increasingly popular. However, it remains unclear whether such integrated analysis would be biassed if the data were processed by different upstream pipelines. In this study, we encapsulated seven existing high-throughput scRNA-seq data processing pipelines with Nextflow, a general integrative workflow management framework, and evaluated their performance in terms of running time, computational resource consumption and data analysis consistency using eight public datasets generated from five different high-throughput scRNA-seq platforms. Our work provides a useful guideline for the selection of scRNA-seq data processing pipelines based on their performance on different real datasets. In addition, these guidelines can serve as a performance evaluation framework for future developments in high-throughput scRNA-seq data processing.

Download Full-text

The intersectional genetics landscape for humans

GigaScience ◽

10.1093/gigascience/giaa083 ◽

2020 ◽

Vol 9 (8) ◽

Author(s):

Andre Macedo ◽

Alisson M Gontijo

Keyword(s):

Single Cell ◽

Logic Gate ◽

Cell Types ◽

Regulatory Elements ◽

Primary Cell ◽

Interindividual Variation ◽

Cell Type ◽

Sequencing Data ◽

Diagnostic Potential ◽

Cap Analysis

ABSTRACT Background The human body is made up of hundreds—perhaps thousands—of cell types and states, most of which are currently inaccessible genetically. Intersectional genetic approaches can increase the number of genetically accessible cells, but the scope and safety of these approaches have not been systematically assessed. A typical intersectional method acts like an “AND" logic gate by converting the input of 2 or more active, yet unspecific, regulatory elements (REs) into a single cell type specific synthetic output. Results Here, we systematically assessed the intersectional genetics landscape of the human genome using a subset of cells from a large RE usage atlas (Functional ANnoTation Of the Mammalian genome 5 consortium, FANTOM5) obtained by cap analysis of gene expression sequencing (CAGE-seq). We developed the heuristics and algorithms to retrieve and quality-rank “AND" gate intersections. Of the 154 primary cell types surveyed, >90% can be distinguished from each other with as few as 3 to 4 active REs, with quantifiable safety and robustness. We call these minimal intersections of active REs with cell-type diagnostic potential “versatile entry codes" (VEnCodes). Each of the 158 cancer cell types surveyed could also be distinguished from the healthy primary cell types with small VEnCodes, most of which were robust to intra- and interindividual variation. Methods for the cross-validation of CAGE-seq–derived VEnCodes and for the extraction of VEnCodes from pooled single-cell sequencing data are also presented. Conclusions Our work provides a systematic view of the intersectional genetics landscape in humans and demonstrates the potential of these approaches for future gene delivery technologies.

Download Full-text

Fast alignment and preprocessing of chromatin profiles with Chromap

10.1101/2021.06.18.448995 ◽

2021 ◽

Author(s):

Haowen Zhang ◽

Li Song ◽

Xiaotao Wang ◽

Haoyu Cheng ◽

Chenfei Wang ◽

...

Keyword(s):

Single Cell ◽

High Throughput ◽

Alignment Accuracy

We present Chromap, an ultrafast method for aligning and preprocessing high throughput chromatin profiles. Chromap is comparable to BWA-MEM and Bowtie2 in alignment accuracy and is over 10 times faster than traditional workflows on bulk ChIP-seq / Hi-C profiles and than 10x Genomics' CellRanger v2.0.0 pipeline on single-cell ATAC-seq profiles.

Download Full-text

Normalizing single-cell RNA sequencing data with internal spike-in-like genes

10.1101/2020.07.10.198077 ◽

2020 ◽

Author(s):

Li Lin ◽

Minfang Song ◽

Yong Jiang ◽

Xiaojing Zhao ◽

Haopeng Wang ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Single Cells ◽

Sequencing Depth ◽

Sequencing Data ◽

Crucial Step ◽

Single Cell Rna Sequencing ◽

Whole Transcriptome

ABSTRACTNormalization with respect to sequencing depth is a crucial step in single-cell RNA sequencing preprocessing. Most methods normalize data using the whole transcriptome based on the assumption that the majority of transcriptome remains constant and are unable to detect drastic changes of the transcriptome. Here, we develop an algorithm based on a small fraction of constantly expressed genes as internal spike-ins to normalize single cell RNA sequencing data. We demonstrate that the transcriptome of single cells may undergo drastic changes in several case study datasets and accounting for such heterogeneity by ISnorm improves the performance of downstream analyzes.

Download Full-text

Handling of targeted amplicon sequencing data focusing on index hopping and demultiplexing using a nested metabarcoding approach in ecology

Scientific Reports ◽

10.1038/s41598-021-98018-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Yasemin Guenay-Greunke ◽

David A. Bohan ◽

Michael Traugott ◽

Corinna Wallinger

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Cost Effective ◽

Amplicon Sequencing ◽

Sequencing Depth ◽

Sequencing Error ◽

Sequencing Data ◽

Large Sample ◽

Sequencing Errors ◽

Plant Feeding

AbstractHigh-throughput sequencing platforms are increasingly being used for targeted amplicon sequencing because they enable cost-effective sequencing of large sample sets. For meaningful interpretation of targeted amplicon sequencing data and comparison between studies, it is critical that bioinformatic analyses do not introduce artefacts and rely on detailed protocols to ensure that all methods are properly performed and documented. The analysis of large sample sets and the use of predefined indexes create challenges, such as adjusting the sequencing depth across samples and taking sequencing errors or index hopping into account. However, the potential biases these factors introduce to high-throughput amplicon sequencing data sets and how they may be overcome have rarely been addressed. On the example of a nested metabarcoding analysis of 1920 carabid beetle regurgitates to assess plant feeding, we investigated: (i) the variation in sequencing depth of individually tagged samples and the effect of library preparation on the data output; (ii) the influence of sequencing errors within index regions and its consequences for demultiplexing; and (iii) the effect of index hopping. Our results demonstrate that despite library quantification, large variation in read counts and sequencing depth occurred among samples and that the sequencing error rate in bioinformatic software is essential for accurate adapter/primer trimming and demultiplexing. Moreover, setting an index hopping threshold to avoid incorrect assignment of samples is highly recommended.

Download Full-text

Comparison of High-Throughput Single-Cell RNA Sequencing Data Processing Pipelines

10.1101/2020.02.09.940221 ◽

2020 ◽

Cited By ~ 2

Author(s):

Mingxuan Gao ◽

Mingyi Ling ◽

Xinwei Tang ◽

Shun Wang ◽

Xu Xiao ◽

...

Keyword(s):

Data Processing ◽

Single Cell ◽

Rna Sequencing ◽

High Throughput ◽

Large Scale ◽

Evaluation Framework ◽

Integrated Analysis ◽

Sequencing Data ◽

Single Experiment ◽

Single Cell Rna Sequencing

AbstractWith the development of single-cell RNA sequencing (scRNA-seq) technology, it has become possible to perform large-scale transcript profiling for tens of thousands of cells in a single experiment. Many analysis pipelines have been developed for data generated from different high-throughput scRNA-seq platforms, bringing a new challenge to users to choose a proper workflow that is efficient, robust and reliable for a specific sequencing platform. Moreover, as the amount of public scRNA-seq data has increased rapidly, integrated analysis of scRNA-seq data from different sources has become increasingly popular. How-ever, it remains unclear whether such integrated analysis would be biased if the data were processed by different upstream pipelines. In this study, we encapsulated seven existing high-throughput scRNA-seq data processing pipelines with Nextflow, a general integrative workflow management framework, and evaluated their performances in terms of running time, computational resource consumption, and data processing consistency using nine public datasets generated from five different high-throughput scRNA-seq platforms. Our work provides a useful guideline for the selection of scRNA-seq data processing pipelines based on their performances on different real datasets. In addition, these guidelines can serve as a performance evaluation framework for future developments in high-throughput scRNA-seq data processing.

Download Full-text

Impact of sequencing depth and read length on single cell RNA sequencing data of T cells

Scientific Reports ◽

10.1038/s41598-017-12989-x ◽

2017 ◽

Vol 7 (1) ◽

Cited By ~ 35

Author(s):

Simone Rizzetto ◽

Auda A. Eltahla ◽

Peijie Lin ◽

Rowena Bull ◽

Andrew R. Lloyd ◽

...

Keyword(s):

T Cells ◽

Single Cell ◽

Rna Sequencing ◽

Sequencing Depth ◽

Read Length ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Download Full-text

Spatially conserved regulatory elements identified within human and mouse Cd247 gene using high-throughput sequencing data from the ENCODE project

Gene ◽

10.1016/j.gene.2014.05.004 ◽

2014 ◽

Vol 545 (1) ◽

pp. 80-87 ◽

Cited By ~ 3

Author(s):

Sachin Pundhir ◽

Tine Dahlbæk Hannibal ◽

Claus Heiner Bang-Berthelsen ◽

Anne-Marie Karin Wegener ◽

Flemming Pociot ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Regulatory Elements ◽

Sequencing Data ◽

Encode Project ◽

High Throughput Sequencing Data ◽

Human And Mouse

Download Full-text

Dual indexed design of in-Drop single-cell RNA-seq libraries improves sequencing quality and throughput

10.1101/835488 ◽

2019 ◽

Author(s):

Austin N. Southard Smith ◽

Alan J. Simmons ◽

Bob Chen ◽

Angela L. Jones ◽

Marisol A. Ramirez Solano ◽

...

Keyword(s):

Single Cell ◽

High Throughput ◽

Cost Effective ◽

Quality Data ◽

Sequencing Data ◽

High Quality ◽

High Data ◽

Sequencing Technologies ◽

Effective Manner ◽

Sequencing Quality

AbstractThe increasing demand of single-cell RNA-sequencing (scRNA-seq) experiments, such as the number of experiments and cells queried per experiment, necessitates higher sequencing depth coupled to high data quality. New high-throughput sequencers, such as the Illumina NovaSeq 6000, enables this demand to be filled in a cost-effective manner. However, current scRNA-seq library designs present compatibility challenges with newer sequencing technologies, such as index-hopping, and their ability to generate high quality data has yet to be systematically evaluated. Here, we engineered a new dual-indexed library structure, called TruDrop, on top of the inDrop scRNA-seq platform to solve these compatibility challenges, such that TruDrop libraries and standard Illumina libraries can be sequenced alongside each other on the NovaSeq. We overcame the index-hopping issue, demonstrated significant improvements in base-calling accuracy, and provided an example of multiplexing twenty-four scRNA-seq libraries simultaneously. We showed favorable comparisons in transcriptional diversity of TruDrop compared with prior library structures. Our approach enables cost-effective, high throughput generation of sequencing data with high quality, which should enable more routine use of scRNA-seq technologies.

Download Full-text

Normalization by distributional resampling of high throughput single-cell RNA-sequencing data

10.1101/2020.10.28.359901 ◽

2020 ◽

Author(s):

Jared Brown ◽

Zijian Ni ◽

Chitrasen Mohanty ◽

Rhonda Bacher ◽

Christina Kendziorski

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Negative Binomial ◽

R Package ◽

Sequencing Depth ◽

Sequencing Data ◽

Specific Expression ◽

Single Cell Rna Sequencing ◽

Binomial Mixture

AbstractMotivationNormalization to remove technical or experimental artifacts is critical in the analysis of single-cell RNA-sequencing experiments, even those for which unique molecular identifiers (UMIs) are available. The majority of methods for normalizing single-cell RNA-sequencing data adjust average expression in sequencing depth, but allow the variance and other properties of the gene-specific expression distribution to be non-constant in depth, which often results in reduced power and increased false discoveries in downstream analyses. This problem is exacerbated by the high proportion of zeros present in most datasets.ResultsTo address this, we present Dino, a normalization method based on a flexible negative-binomial mixture model of gene expression. As demonstrated in both simulated and case study datasets, by normalizing the entire gene expression distribution, Dino is robust to shallow sequencing depth, sample heterogeneity, and varying zero proportions, leading to improved performance in downstream analyses in a number of settings.Availability and implementationThe R package, Dino, is available on GitHub at https://github.com/JBrownBiostat/[email protected], [email protected]

Download Full-text