Holistic optimization of an RNA-seq workflow for multi-threaded environments

Abstract Summary For many next generation-sequencing pipelines, the most computationally intensive step is the alignment of reads to a reference sequence. As a result, alignment software such as the Burrows-Wheeler Aligner is optimized for speed and is often executed in parallel on the cloud. However, there are other less demanding steps that can also be optimized to significantly increase the speed especially when using many threads. We demonstrate this using a unique molecular identifier RNA-sequencing pipeline consisting of 3 steps: split, align, and merge. Optimization of all three steps yields a 40% increase in speed when executed using a single thread. However, when executed using 16 threads, we observe a 4-fold improvement over the original parallel implementation and more than an 8-fold improvement over the original single-threaded implementation. In contrast, optimizing only the alignment step results in just a 13% improvement over the original parallel workflow using 16 threads. Availability and implementation Code (M.I.T. license), supporting scripts and Dockerfiles are available at https://github.com/BioDepot/LINCS_RNAseq_cpp and Docker images at https://hub.docker.com/r/biodepot/rnaseq-umi-cpp/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Holistic optimization of an RNA-seq workflow for multi-threaded environments

10.1101/345819 ◽

2018 ◽

Author(s):

Ling-Hong Hung ◽

Wes Lloyd ◽

Radhika Agumbe Sridhar ◽

Saranya Devi Athmalingam Ravishankar ◽

Yuguang Xiong ◽

...

Keyword(s):

Next Generation Sequencing ◽

Reference Sequence ◽

Rna Seq ◽

Imple Mentation ◽

Computationally Intensive ◽

Holistic Optimization ◽

Parallel Workflow ◽

Unique Molecular Identifier ◽

Generation Sequencing ◽

Alignment Step

AbstractSummaryFor many next-generation sequencing pipelines, the most computationally intensive step is the alignment of reads to a reference sequence. As a result, alignment software such as the Burrows-Wheeler Aligner (BWA) is optimized for speed and and is often executed in parallel on the cloud. However, there are other less demanding steps that can also be optimized and significantly increase the speed especially when using many threads. We demonstrate this using a Unique-molecular-identifier (UMI) RNA sequencing pipeline consisting of 3 steps: split, align, and merge. Optimization of all three steps yields a 40% increase in speed when executed using a single thread. However, when executed using 16 threads, we observe a 4-fold improvement over the original parallel imple-mentation and more than an 8-fold improvement over the original single-threaded implementation. In contrast, optimizing only the alignment step results in just a 13% improvement over the original parallel workflow using 16 threads.

Download Full-text

BioSeqZip: a collapser of NGS redundant reads for the optimization of sequence analysis

Bioinformatics ◽

10.1093/bioinformatics/btaa051 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2705-2711 ◽

Cited By ~ 2

Author(s):

Gianvito Urgese ◽

Emanuele Parisi ◽

Orazio Scicolone ◽

Santa Di Cataldo ◽

Elisa Ficarra

Keyword(s):

Sequence Analysis ◽

Supplementary Information ◽

Sorting Algorithm ◽

Rna Seq ◽

Compact Sets ◽

Analysis Pipeline ◽

Alignment Algorithms ◽

External Sorting ◽

Computational Resources ◽

Generation Sequencing

Abstract Motivation High-throughput next-generation sequencing can generate huge sequence files, whose analysis requires alignment algorithms that are typically very demanding in terms of memory and computational resources. This is a significant issue, especially for machines with limited hardware capabilities. As the redundancy of the sequences typically increases with coverage, collapsing such files into compact sets of non-redundant reads has the 2-fold advantage of reducing file size and speeding-up the alignment, avoiding to map the same sequence multiple times. Method BioSeqZip generates compact and sorted lists of alignment-ready non-redundant sequences, keeping track of their occurrences in the raw files as well as of their quality score information. By exploiting a memory-constrained external sorting algorithm, it can be executed on either single- or multi-sample datasets even on computers with medium computational capabilities. On request, it can even re-expand the compacted files to their original state. Results Our extensive experiments on RNA-Seq data show that BioSeqZip considerably brings down the computational costs of a standard sequence analysis pipeline, with particular benefits for the alignment procedures that typically have the highest requirements in terms of memory and execution time. In our tests, BioSeqZip was able to compact 2.7 billion of reads into 963 million of unique tags reducing the size of sequence files up to 70% and speeding-up the alignment by 50% at least. Availability and implementation BioSeqZip is available at https://github.com/bioinformatics-polito/BioSeqZip. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CoCo: RNA-seq read assignment correction for nested genes and multimapped reads

Bioinformatics ◽

10.1093/bioinformatics/btz433 ◽

2019 ◽

Vol 35 (23) ◽

pp. 5039-5047 ◽

Cited By ~ 6

Author(s):

Gabrielle Deschamps-Francoeur ◽

Vincent Boivin ◽

Sherif Abou Elela ◽

Michelle S Scott

Keyword(s):

Supplementary Information ◽

Rna Seq ◽

Non Coding Rna ◽

Abundance Estimates ◽

Gene Coverage ◽

Nested Genes ◽

Quantification Accuracy ◽

Whole Transcriptome Analysis ◽

Whole Transcriptome ◽

Generation Sequencing

Abstract Motivation Next-generation sequencing techniques revolutionized the study of RNA expression by permitting whole transcriptome analysis. However, sequencing reads generated from nested and multi-copy genes are often either misassigned or discarded, which greatly reduces both quantification accuracy and gene coverage. Results Here we present count corrector (CoCo), a read assignment pipeline that takes into account the multitude of overlapping and repetitive genes in the transcriptome of higher eukaryotes. CoCo uses a modified annotation file that highlights nested genes and proportionally distributes multimapped reads between repeated sequences. CoCo salvages over 15% of discarded aligned RNA-seq reads and significantly changes the abundance estimates for both coding and non-coding RNA as validated by PCR and bedgraph comparisons. Availability and implementation The CoCo software is an open source package written in Python and available from http://gitlabscottgroup.med.usherbrooke.ca/scott-group/coco. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Index suffix–prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression

Bioinformatics ◽

10.1093/bioinformatics/bty936 ◽

2018 ◽

Vol 35 (12) ◽

pp. 2066-2074 ◽

Cited By ~ 11

Author(s):

Yuansheng Liu ◽

Zuguo Yu ◽

Marcel E Dinger ◽

Jinyan Li

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Reference Sequence ◽

Supplementary Information ◽

The Novel ◽

Rna Seq ◽

File Size ◽

Sequencing Technologies ◽

Efficient Storage ◽

Merging Process

Abstract Motivation Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. Results We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix–prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20–80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. Availability and implementation https://github.com/yuansliu/minicom Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SnakeChunks: modular blocks to build Snakemake workflows for reproducible NGS analyses

10.1101/165191 ◽

2017 ◽

Author(s):

Claire Rioualen ◽

Lucie Charbonnier-Khamvongsa ◽

Jacques van Helden

Keyword(s):

Next Generation Sequencing ◽

Life Sciences ◽

Supplementary Information ◽

Supplementary Data ◽

Rna Seq ◽

Genome Wide ◽

Domains Of Life ◽

Supplementary Material ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

AbstractSummaryNext-Generation Sequencing (NGS) is becoming a routine approach for most domains of life sciences, yet there is a crucial need to improve the automation of processing for the huge amounts of data generated and to ensure reproducible results. We present SnakeChunks, a collection of Snakemake rules enabling to compose modular and user-configurable workflows, and show its usage with analyses of transcriptome (RNA-seq) and genome-wide location (ChIP-seq) data.AvailabilityThe code is freely available (github.com/SnakeChunks/SnakeChunks), and documented with tutorials and illustrative demos (snakechunks.readthedocs.io)[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

pyBedGraph: a python package for fast operations on 1D genomic signal tracks

Bioinformatics ◽

10.1093/bioinformatics/btaa061 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3234-3235

Author(s):

Henry B Zhang ◽

Minji Kim ◽

Jeffrey H Chuang ◽

Yijun Ruan

Keyword(s):

Chromatin Accessibility ◽

Genomic Research ◽

Supplementary Information ◽

Summary Statistics ◽

Rna Seq ◽

Binary Format ◽

Modern Genomic ◽

Binding Intensity ◽

Python Package ◽

Generation Sequencing

Abstract Motivation Modern genomic research is driven by next-generation sequencing experiments such as ChIP-seq and ChIA-PET that generate coverage files for transcription factor binding, as well as DHS and ATAC-seq that yield coverage files for chromatin accessibility. Such files are in a bedGraph text format or a bigWig binary format. Obtaining summary statistics in a given region is a fundamental task in analyzing protein binding intensity or chromatin accessibility. However, the existing Python package for operating on coverage files is not optimized for speed. Results We developed pyBedGraph, a Python package to quickly obtain summary statistics for a given interval in a bedGraph or a bigWig file. When tested on 12 ChIP-seq, ATAC-seq, RNA-seq and ChIA-PET datasets, pyBedGraph is on average 260 times faster than the existing program pyBigWig. On average, pyBedGraph can look up the exact mean signal of 1 million regions in ∼0.26 s and can compute their approximate means in <0.12 s on a conventional laptop. Availability and implementation pyBedGraph is publicly available at https://github.com/TheJacksonLaboratory/pyBedGraph under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BAMixChecker: an automated checkup tool for matched sample pairs in NGS cohort

Bioinformatics ◽

10.1093/bioinformatics/btz479 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4806-4808 ◽

Cited By ~ 2

Author(s):

Hein Chun ◽

Sangwoo Kim

Keyword(s):

Genomic Analysis ◽

Supplementary Information ◽

Nucleotide Polymorphisms ◽

Rna Seq ◽

Sequencing Data ◽

Single Nucleotide ◽

Frequent Problem ◽

Generation Sequencing ◽

User Intervention ◽

Genotype Concordance

Abstract Summary Mislabeling in the process of next generation sequencing is a frequent problem that can cause an entire genomic analysis to fail, and a regular cohort-level checkup is needed to ensure that it has not occurred. We developed a new, automated tool (BAMixChecker) that accurately detects sample mismatches from a given BAM file cohort with minimal user intervention. BAMixChecker uses a flexible, data-specific set of single-nucleotide polymorphisms and detects orphan (unpaired) and swapped (mispaired) samples based on genotype-concordance score and entropy-based file name analysis. BAMixChecker shows ∼100% accuracy in real WES, RNA-Seq and targeted sequencing data cohorts, even for small panels (<50 genes). BAMixChecker provides an HTML-style report that graphically outlines the sample matching status in tables and heatmaps, with which users can quickly inspect any mismatch events. Availability and implementation BAMixChecker is available at https://github.com/heinc1010/BAMixChecker Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Improved representation of sequence Bloom trees

Bioinformatics ◽

10.1093/bioinformatics/btz662 ◽

2019 ◽

Cited By ~ 2

Author(s):

Robert S Harris ◽

Paul Medvedev

Keyword(s):

Supplementary Information ◽

Supplementary File ◽

Biological Databases ◽

Rna Seq ◽

End User ◽

Indexing Methods ◽

Source Program ◽

Supplementary Text ◽

Free Open Source ◽

Generation Sequencing

Abstract Motivation Algorithmic solutions to index and search biological databases are a fundamental part of bioinformatics, providing underlying components to many end-user tools. Inexpensive next generation sequencing has filled publicly available databases such as the Sequence Read Archive beyond the capacity of traditional indexing methods. Recently, the Sequence Bloom Tree (SBT) and its derivatives were proposed as a way to efficiently index such data for queries about transcript presence. Results We build on the SBT framework to construct the HowDe-SBT data structure, which uses a novel partitioning of information to reduce the construction and query time as well as the size of the index. Compared to previous SBT methods on real RNA-seq data, HowDe-SBT can construct the index in less than 36% of the time, and with 39% less space, and can answer small-batch queries at least five times faster. We also develop a theoretical framework in which we can analyze and bound the space and query performance of HowDe-SBT compared to other SBT methods. Availability and implementation HowDe-SBT is available as a free open source program on https://github.com/medvedevgroup/HowDeSBT. Supplementary information Supplementary text and figures available as single Supplementary file.

Download Full-text

BBKNN: fast batch alignment of single cell transcriptomes

Bioinformatics ◽

10.1093/bioinformatics/btz625 ◽

2019 ◽

Cited By ~ 33

Author(s):

Krzysztof Polański ◽

Matthew D Young ◽

Zhichao Miao ◽

Kerstin B Meyer ◽

Sarah A Teichmann ◽

...

Keyword(s):

Data Integration ◽

Single Cell ◽

Large Scale ◽

Supplementary Information ◽

Supplementary Data ◽

Rna Seq ◽

Batch Effects ◽

Integration Algorithm ◽

Data Explosion ◽

Computationally Intensive

Abstract Motivation Increasing numbers of large scale single cell RNA-Seq projects are leading to a data explosion, which can only be fully exploited through data integration. A number of methods have been developed to combine diverse datasets by removing technical batch effects, but most are computationally intensive. To overcome the challenge of enormous datasets, we have developed BBKNN, an extremely fast graph-based data integration algorithm. We illustrate the power of BBKNN on large scale mouse atlasing data, and favourably benchmark its run time against a number of competing methods. Availability and implementation BBKNN is available at https://github.com/Teichlab/bbknn, along with documentation and multiple example notebooks, and can be installed from pip. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Transcriptomic Profiling of Skeletal Muscle Reveals Candidate Genes Influencing Muscle Growth and Associated Lipid Composition in Portuguese Local Pig Breeds

Animals ◽

10.3390/ani11051423 ◽

2021 ◽

Vol 11 (5) ◽

pp. 1423

Author(s):

André Albuquerque ◽

Cristina Óvilo ◽

Yolanda Núñez ◽

Rita Benítez ◽

Adrián López-Garcia ◽

...

Keyword(s):

Candidate Genes ◽

Muscle Growth ◽

Rna Seq ◽

Next Generation Sequencing Technology ◽

Pig Breeds ◽

Nfkb Activation ◽

Main Gene ◽

Slow Type ◽

Longissimus Lumborum ◽

Generation Sequencing

Gene expression is one of the main factors to influence meat quality by modulating fatty acid metabolism, composition, and deposition rates in muscle tissue. This study aimed to explore the transcriptomics of the Longissimus lumborum muscle in two local pig breeds with distinct genetic background using next-generation sequencing technology and Real-Time qPCR. RNA-seq yielded 49 differentially expressed genes between breeds, 34 overexpressed in the Alentejano (AL) and 15 in the Bísaro (BI) breed. Specific slow type myosin heavy chain components were associated with AL (MYH7) and BI (MYH3) pigs, while an overexpression of MAP3K14 in AL may be associated with their lower loin proportion, induced insulin resistance, and increased inflammatory response via NFkB activation. Overexpression of RUFY1 in AL pigs may explain the higher intramuscular (IMF) content via higher GLUT4 recruitment and consequently higher glucose uptake that can be stored as fat. Several candidate genes for lipid metabolism, excluded in the RNA-seq analysis due to low counts, such as ACLY, ADIPOQ, ELOVL6, LEP and ME1 were identified by qPCR as main gene factors defining the processes that influence meat composition and quality. These results agree with the fatter profile of the AL pig breed and adiponectin resistance can be postulated as responsible for the overexpression of MAP3K14′s coding product NIK, failing to restore insulin sensitivity.

Download Full-text