BioSeqZip: a collapser of NGS redundant reads for the optimization of sequence analysis

Gianvito Urgese; Emanuele Parisi; Orazio Scicolone; Santa Di Cataldo; Elisa Ficarra

doi:10.1093/bioinformatics/btaa051

BioSeqZip: a collapser of NGS redundant reads for the optimization of sequence analysis

Bioinformatics ◽

10.1093/bioinformatics/btaa051 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2705-2711 ◽

Cited By ~ 2

Author(s):

Gianvito Urgese ◽

Emanuele Parisi ◽

Orazio Scicolone ◽

Santa Di Cataldo ◽

Elisa Ficarra

Keyword(s):

Sequence Analysis ◽

Supplementary Information ◽

Sorting Algorithm ◽

Rna Seq ◽

Compact Sets ◽

Analysis Pipeline ◽

Alignment Algorithms ◽

External Sorting ◽

Computational Resources ◽

Generation Sequencing

Abstract Motivation High-throughput next-generation sequencing can generate huge sequence files, whose analysis requires alignment algorithms that are typically very demanding in terms of memory and computational resources. This is a significant issue, especially for machines with limited hardware capabilities. As the redundancy of the sequences typically increases with coverage, collapsing such files into compact sets of non-redundant reads has the 2-fold advantage of reducing file size and speeding-up the alignment, avoiding to map the same sequence multiple times. Method BioSeqZip generates compact and sorted lists of alignment-ready non-redundant sequences, keeping track of their occurrences in the raw files as well as of their quality score information. By exploiting a memory-constrained external sorting algorithm, it can be executed on either single- or multi-sample datasets even on computers with medium computational capabilities. On request, it can even re-expand the compacted files to their original state. Results Our extensive experiments on RNA-Seq data show that BioSeqZip considerably brings down the computational costs of a standard sequence analysis pipeline, with particular benefits for the alignment procedures that typically have the highest requirements in terms of memory and execution time. In our tests, BioSeqZip was able to compact 2.7 billion of reads into 963 million of unique tags reducing the size of sequence files up to 70% and speeding-up the alignment by 50% at least. Availability and implementation BioSeqZip is available at https://github.com/bioinformatics-polito/BioSeqZip. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CoCo: RNA-seq read assignment correction for nested genes and multimapped reads

Bioinformatics ◽

10.1093/bioinformatics/btz433 ◽

2019 ◽

Vol 35 (23) ◽

pp. 5039-5047 ◽

Cited By ~ 6

Author(s):

Gabrielle Deschamps-Francoeur ◽

Vincent Boivin ◽

Sherif Abou Elela ◽

Michelle S Scott

Keyword(s):

Supplementary Information ◽

Rna Seq ◽

Non Coding Rna ◽

Abundance Estimates ◽

Gene Coverage ◽

Nested Genes ◽

Quantification Accuracy ◽

Whole Transcriptome Analysis ◽

Whole Transcriptome ◽

Generation Sequencing

Abstract Motivation Next-generation sequencing techniques revolutionized the study of RNA expression by permitting whole transcriptome analysis. However, sequencing reads generated from nested and multi-copy genes are often either misassigned or discarded, which greatly reduces both quantification accuracy and gene coverage. Results Here we present count corrector (CoCo), a read assignment pipeline that takes into account the multitude of overlapping and repetitive genes in the transcriptome of higher eukaryotes. CoCo uses a modified annotation file that highlights nested genes and proportionally distributes multimapped reads between repeated sequences. CoCo salvages over 15% of discarded aligned RNA-seq reads and significantly changes the abundance estimates for both coding and non-coding RNA as validated by PCR and bedgraph comparisons. Availability and implementation The CoCo software is an open source package written in Python and available from http://gitlabscottgroup.med.usherbrooke.ca/scott-group/coco. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Holistic optimization of an RNA-seq workflow for multi-threaded environments

Bioinformatics ◽

10.1093/bioinformatics/btz169 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4173-4175 ◽

Cited By ~ 3

Author(s):

Ling-Hong Hung ◽

Wes Lloyd ◽

Radhika Agumbe Sridhar ◽

Saranya Devi Athmalingam Ravishankar ◽

Yuguang Xiong ◽

...

Keyword(s):

Parallel Implementation ◽

Reference Sequence ◽

Supplementary Information ◽

Rna Seq ◽

Computationally Intensive ◽

Holistic Optimization ◽

Parallel Workflow ◽

Unique Molecular Identifier ◽

Generation Sequencing ◽

Alignment Step

Abstract Summary For many next generation-sequencing pipelines, the most computationally intensive step is the alignment of reads to a reference sequence. As a result, alignment software such as the Burrows-Wheeler Aligner is optimized for speed and is often executed in parallel on the cloud. However, there are other less demanding steps that can also be optimized to significantly increase the speed especially when using many threads. We demonstrate this using a unique molecular identifier RNA-sequencing pipeline consisting of 3 steps: split, align, and merge. Optimization of all three steps yields a 40% increase in speed when executed using a single thread. However, when executed using 16 threads, we observe a 4-fold improvement over the original parallel implementation and more than an 8-fold improvement over the original single-threaded implementation. In contrast, optimizing only the alignment step results in just a 13% improvement over the original parallel workflow using 16 threads. Availability and implementation Code (M.I.T. license), supporting scripts and Dockerfiles are available at https://github.com/BioDepot/LINCS_RNAseq_cpp and Docker images at https://hub.docker.com/r/biodepot/rnaseq-umi-cpp/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SnakeChunks: modular blocks to build Snakemake workflows for reproducible NGS analyses

10.1101/165191 ◽

2017 ◽

Author(s):

Claire Rioualen ◽

Lucie Charbonnier-Khamvongsa ◽

Jacques van Helden

Keyword(s):

Next Generation Sequencing ◽

Life Sciences ◽

Supplementary Information ◽

Supplementary Data ◽

Rna Seq ◽

Genome Wide ◽

Domains Of Life ◽

Supplementary Material ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

AbstractSummaryNext-Generation Sequencing (NGS) is becoming a routine approach for most domains of life sciences, yet there is a crucial need to improve the automation of processing for the huge amounts of data generated and to ensure reproducible results. We present SnakeChunks, a collection of Snakemake rules enabling to compose modular and user-configurable workflows, and show its usage with analyses of transcriptome (RNA-seq) and genome-wide location (ChIP-seq) data.AvailabilityThe code is freely available (github.com/SnakeChunks/SnakeChunks), and documented with tutorials and illustrative demos (snakechunks.readthedocs.io)[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

pyBedGraph: a python package for fast operations on 1D genomic signal tracks

Bioinformatics ◽

10.1093/bioinformatics/btaa061 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3234-3235

Author(s):

Henry B Zhang ◽

Minji Kim ◽

Jeffrey H Chuang ◽

Yijun Ruan

Keyword(s):

Chromatin Accessibility ◽

Genomic Research ◽

Supplementary Information ◽

Summary Statistics ◽

Rna Seq ◽

Binary Format ◽

Modern Genomic ◽

Binding Intensity ◽

Python Package ◽

Generation Sequencing

Abstract Motivation Modern genomic research is driven by next-generation sequencing experiments such as ChIP-seq and ChIA-PET that generate coverage files for transcription factor binding, as well as DHS and ATAC-seq that yield coverage files for chromatin accessibility. Such files are in a bedGraph text format or a bigWig binary format. Obtaining summary statistics in a given region is a fundamental task in analyzing protein binding intensity or chromatin accessibility. However, the existing Python package for operating on coverage files is not optimized for speed. Results We developed pyBedGraph, a Python package to quickly obtain summary statistics for a given interval in a bedGraph or a bigWig file. When tested on 12 ChIP-seq, ATAC-seq, RNA-seq and ChIA-PET datasets, pyBedGraph is on average 260 times faster than the existing program pyBigWig. On average, pyBedGraph can look up the exact mean signal of 1 million regions in ∼0.26 s and can compute their approximate means in <0.12 s on a conventional laptop. Availability and implementation pyBedGraph is publicly available at https://github.com/TheJacksonLaboratory/pyBedGraph under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BAMixChecker: an automated checkup tool for matched sample pairs in NGS cohort

Bioinformatics ◽

10.1093/bioinformatics/btz479 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4806-4808 ◽

Cited By ~ 2

Author(s):

Hein Chun ◽

Sangwoo Kim

Keyword(s):

Genomic Analysis ◽

Supplementary Information ◽

Nucleotide Polymorphisms ◽

Rna Seq ◽

Sequencing Data ◽

Single Nucleotide ◽

Frequent Problem ◽

Generation Sequencing ◽

User Intervention ◽

Genotype Concordance

Abstract Summary Mislabeling in the process of next generation sequencing is a frequent problem that can cause an entire genomic analysis to fail, and a regular cohort-level checkup is needed to ensure that it has not occurred. We developed a new, automated tool (BAMixChecker) that accurately detects sample mismatches from a given BAM file cohort with minimal user intervention. BAMixChecker uses a flexible, data-specific set of single-nucleotide polymorphisms and detects orphan (unpaired) and swapped (mispaired) samples based on genotype-concordance score and entropy-based file name analysis. BAMixChecker shows ∼100% accuracy in real WES, RNA-Seq and targeted sequencing data cohorts, even for small panels (<50 genes). BAMixChecker provides an HTML-style report that graphically outlines the sample matching status in tables and heatmaps, with which users can quickly inspect any mismatch events. Availability and implementation BAMixChecker is available at https://github.com/heinc1010/BAMixChecker Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RaNA-Seq: Interactive RNA-Seq analysis from FASTQ files to functional analysis

Bioinformatics ◽

10.1093/bioinformatics/btz854 ◽

2019 ◽

Author(s):

Carlos Prieto ◽

David Barrios

Keyword(s):

Rapid Analysis ◽

Supplementary Information ◽

Interactive Graphics ◽

Rna Seq ◽

Web Interface ◽

Cloud Platform ◽

Analysis Pipeline ◽

Quality Control Metrics ◽

Full Analysis ◽

Functional Analyses

Abstract Summary RaNA-Seq is a cloud platform for the rapid analysis and visualization of RNA-Seq data. It performs a full analysis in minutes by quantifying FASTQ files, calculating quality control metrics, running differential expression analyses and enabling the explanation of results with functional analyses. Our analysis pipeline applies generally accepted and reproducible protocols that can be applied with two simple steps in its web interface. Analysis results are presented as interactive graphics and reports, ready for their interpretation and publication. Availability RaNA-Seq web service is freely available online at https://ranaseq.eu Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LSTrAP-Kingdom: an automated pipeline to generate annotated gene expression atlases for kingdoms of life

Bioinformatics ◽

10.1093/bioinformatics/btab168 ◽

2021 ◽

Author(s):

William Goh1 ◽

Marek Mutwil1

Keyword(s):

Gene Expression ◽

Large Scale ◽

Supplementary Information ◽

Expression Data ◽

Supplementary Data ◽

Rna Seq ◽

Analysis Pipeline ◽

Study Gene Expression ◽

Automated Pipeline ◽

Bacteria And Fungi

Abstract Motivation There are now more than two million RNA sequencing experiments for plants, animals, bacteria and fungi publicly available, allowing us to study gene expression within and across species and kingdoms. However, the tools allowing the download, quality control and annotation of this data for more than one species at a time are currently missing. Results To remedy this, we present the Large-Scale Transcriptomic Analysis Pipeline in Kingdom of Life (LSTrAP-Kingdom) pipeline, which we used to process 134,521 RNA-seq samples, achieving ∼12,000 processed samples per day. Our pipeline generated quality-controlled, annotated gene expression matrices that rival the manually curated gene expression data in identifying functionally-related genes. Availability LSTrAP-Kingdom is available from: https://github.com/wirriamm/plants-pipeline and is fully implemented in Python and Bash. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

iRAP - an integrated RNA-seq Analysis Pipeline

10.1101/005991 ◽

2014 ◽

Cited By ~ 23

Author(s):

Nuno A. Fonseca ◽

Robert Petryszak ◽

John Marioni ◽

Alvis Brazma

Keyword(s):

Enrichment Analysis ◽

Transcriptome Profiling ◽

Transcript Level ◽

Gene Set Enrichment Analysis ◽

Rna Seq ◽

Gene Set Enrichment ◽

Analysis Pipeline ◽

Advantages And Disadvantages ◽

Computational Resources ◽

Whole Transcriptome

RNA-sequencing (RNA-Seq) has become the technology of choice for whole-transcriptome profiling. However, processing the millions of sequence reads generated requires considerable bioinformatics skills and computational resources. At each step of the processing pipeline many tools are available, each with specific advantages and disadvantages. While using a specific combination of tools might be desirable, integrating the different tools can be time consuming, often due to specificities in the formats of input/output files required by the different programs. Here we present iRAP, an integrated RNA-seq analysis pipeline that allows the user to select and apply their preferred combination of existing tools for mapping reads, quantifying expression, testing for differential expression. iRAP also includes multiple tools for gene set enrichment analysis and generates web browsable reports of the results obtained in the different stages of the pipeline. Depending upon the application, iRAP can be used to quantify expression at the gene, exon or transcript level. iRAP is aimed at a broad group of users with basic bioinformatics training and requires little experience with the command line. Despite this, it also provides more advanced users with the ability to customise the options used by their chosen tools.

Download Full-text

SAMSA2: A standalone metatranscriptome analysis pipeline

10.1101/195826 ◽

2017 ◽

Cited By ~ 1

Author(s):

Samuel T Westreich ◽

Michelle L Treiber ◽

David A Mills ◽

Ian Korf ◽

Danielle G Lemay

Keyword(s):

Sequence Analysis ◽

Large Volume ◽

High Throughput Sequencing ◽

Sequence Data ◽

Rna Seq ◽

Analysis Pipeline ◽

Input And Output ◽

Cluster Environment ◽

Reference Databases ◽

Computationally Intensive

AbstractBackgroundComplex microbial communities are an area of rapid growth in biology. Metatranscriptomics allows one to investigate the gene activity in an environmental sample via high-throughput sequencing. Metatranscriptomic experiments are computationally intensive because the experiments generate a large volume of sequence data and the sequences must be compared with many references.ResultsHere we present SAMSA2, an upgrade to the original Simple Annotation of Metatranscriptomes by Sequence Analysis (SAMSA) pipeline that has been redesigned for use on a supercomputing cluster. SAMSA2 is faster due to the use of the DIAMOND aligner, and more flexible and reproducible because it uses local databases. SAMSA2 is available with detailed documentation, and example input and output files along with examples of master scripts for full pipeline execution.ConclusionsUsing publicly available example data, we demonstrate that SAMSA2 is a rapid and efficient metatranscriptome pipeline for analyzing large paired-end RNA-seq datasets in a supercomputing cluster environment. SAMSA2 provides simplified output that can be examined directly or used for further analyses, and its reference databases may be upgraded, altered or customized to fit the specifics of any experiment.

Download Full-text

Improved representation of sequence Bloom trees

Bioinformatics ◽

10.1093/bioinformatics/btz662 ◽

2019 ◽

Cited By ~ 2

Author(s):

Robert S Harris ◽

Paul Medvedev

Keyword(s):

Supplementary Information ◽

Supplementary File ◽

Biological Databases ◽

Rna Seq ◽

End User ◽

Indexing Methods ◽

Source Program ◽

Supplementary Text ◽

Free Open Source ◽

Generation Sequencing

Abstract Motivation Algorithmic solutions to index and search biological databases are a fundamental part of bioinformatics, providing underlying components to many end-user tools. Inexpensive next generation sequencing has filled publicly available databases such as the Sequence Read Archive beyond the capacity of traditional indexing methods. Recently, the Sequence Bloom Tree (SBT) and its derivatives were proposed as a way to efficiently index such data for queries about transcript presence. Results We build on the SBT framework to construct the HowDe-SBT data structure, which uses a novel partitioning of information to reduce the construction and query time as well as the size of the index. Compared to previous SBT methods on real RNA-seq data, HowDe-SBT can construct the index in less than 36% of the time, and with 39% less space, and can answer small-batch queries at least five times faster. We also develop a theoretical framework in which we can analyze and bound the space and query performance of HowDe-SBT compared to other SBT methods. Availability and implementation HowDe-SBT is available as a free open source program on https://github.com/medvedevgroup/HowDeSBT. Supplementary information Supplementary text and figures available as single Supplementary file.

Download Full-text