scholarly journals CoCo: RNA-seq read assignment correction for nested genes and multimapped reads

2019 ◽  
Vol 35 (23) ◽  
pp. 5039-5047 ◽  
Author(s):  
Gabrielle Deschamps-Francoeur ◽  
Vincent Boivin ◽  
Sherif Abou Elela ◽  
Michelle S Scott

Abstract Motivation Next-generation sequencing techniques revolutionized the study of RNA expression by permitting whole transcriptome analysis. However, sequencing reads generated from nested and multi-copy genes are often either misassigned or discarded, which greatly reduces both quantification accuracy and gene coverage. Results Here we present count corrector (CoCo), a read assignment pipeline that takes into account the multitude of overlapping and repetitive genes in the transcriptome of higher eukaryotes. CoCo uses a modified annotation file that highlights nested genes and proportionally distributes multimapped reads between repeated sequences. CoCo salvages over 15% of discarded aligned RNA-seq reads and significantly changes the abundance estimates for both coding and non-coding RNA as validated by PCR and bedgraph comparisons. Availability and implementation The CoCo software is an open source package written in Python and available from http://gitlabscottgroup.med.usherbrooke.ca/scott-group/coco. Supplementary information Supplementary data are available at Bioinformatics online.

2018 ◽  
Author(s):  
Gabrielle Deschamps-Francoeur ◽  
Vincent Boivin ◽  
Sherif Abou Elela ◽  
Michelle S Scott

AbstractMotivationNext generation sequencing techniques revolutionized the study of RNA expression by permitting whole transcriptome analysis. However, sequencing reads generated from nested and multi-copy genes are often either misassigned or discarded, which greatly reduces both quantification accuracy and gene coverage.ResultsHere we present CoCo, a read assignment pipeline that takes into account the multitude of overlapping and repetitive genes in the transcriptome of higher eukaryotes. CoCo uses a modified annotation file that highlights nested genes and proportionally distributes multimapped reads between repeated sequences. CoCo salvages over 15% of discarded aligned RNA-seq reads and significantly changes the abundance estimates for both coding and non-coding RNA as validated by PCR and bed-graph comparisons.AvailabilityThe CoCo software is an open source package written in Python and available from http://gitlabscottgroup.med.usherbrooke.ca/scott-group/[email protected]


2021 ◽  
Author(s):  
Xiaopeng An ◽  
Yue Zhang ◽  
Fu Li ◽  
Zhanhang Wang ◽  
Shaohua Yang ◽  
...  

Abstract BackgroundEstrous cycle is one of female characteristics after sexual maturity, including estrus (ES) and diestrus (DS) stages. Estrous cycle is important in female physiology and its disorder may lead to diseases. In the latest years, effects of non-coding RNAs and mRNA on estrous cycle start to arouse much concern, however, a whole transcriptome analysis among non-coding RNAs and mRNA has not been reported.ResultsHere we report a whole transcriptome analysis of goat ovary in estrus and diestrus periods. Estrus synchronization was conducted to induce the estrus phase and on day 32, the goats naturally shifted into diestrus stage. The ovary RNA of estrus and diestrus stages was respectively collected to perform RNA-sequencing. Then the circular RNA; microRNA; long non-coding RNA; mRNA databases of goat ovary were acquired, and the differentially expressions between estrus and diestrus stages were screened to construct circRNA-miRNA-mRNA/lncRNA and lncRNA-miRNA/mRNA networks, thus providing potential pathways that involved in the regulation of estrous cycle. Differentially expressed mRNAs, such as MMP9, TIMP1, 3BHSD and PTGIS, and differentially expressed microRNAs, such as miR-21-3p,miR-202-3p and miR-223-3p, which play key roles in estrous cycle regulation were extracted from the network.ConclusionsOur data provided the miRNA, circRNA, lncRNA and mRNA databases of goat ovary and each differentially expressed profile between ES and DS. Networks among differentially expressed miRNAs, circRNAs, lncRNAs and mRNAs were constructed to provide valuable resources for the study of estrous cycle and related diseases.


2020 ◽  
Vol 36 (9) ◽  
pp. 2705-2711 ◽  
Author(s):  
Gianvito Urgese ◽  
Emanuele Parisi ◽  
Orazio Scicolone ◽  
Santa Di Cataldo ◽  
Elisa Ficarra

Abstract Motivation High-throughput next-generation sequencing can generate huge sequence files, whose analysis requires alignment algorithms that are typically very demanding in terms of memory and computational resources. This is a significant issue, especially for machines with limited hardware capabilities. As the redundancy of the sequences typically increases with coverage, collapsing such files into compact sets of non-redundant reads has the 2-fold advantage of reducing file size and speeding-up the alignment, avoiding to map the same sequence multiple times. Method BioSeqZip generates compact and sorted lists of alignment-ready non-redundant sequences, keeping track of their occurrences in the raw files as well as of their quality score information. By exploiting a memory-constrained external sorting algorithm, it can be executed on either single- or multi-sample datasets even on computers with medium computational capabilities. On request, it can even re-expand the compacted files to their original state. Results Our extensive experiments on RNA-Seq data show that BioSeqZip considerably brings down the computational costs of a standard sequence analysis pipeline, with particular benefits for the alignment procedures that typically have the highest requirements in terms of memory and execution time. In our tests, BioSeqZip was able to compact 2.7 billion of reads into 963 million of unique tags reducing the size of sequence files up to 70% and speeding-up the alignment by 50% at least. Availability and implementation BioSeqZip is available at https://github.com/bioinformatics-polito/BioSeqZip. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (20) ◽  
pp. 4173-4175 ◽  
Author(s):  
Ling-Hong Hung ◽  
Wes Lloyd ◽  
Radhika Agumbe Sridhar ◽  
Saranya Devi Athmalingam Ravishankar ◽  
Yuguang Xiong ◽  
...  

Abstract Summary For many next generation-sequencing pipelines, the most computationally intensive step is the alignment of reads to a reference sequence. As a result, alignment software such as the Burrows-Wheeler Aligner is optimized for speed and is often executed in parallel on the cloud. However, there are other less demanding steps that can also be optimized to significantly increase the speed especially when using many threads. We demonstrate this using a unique molecular identifier RNA-sequencing pipeline consisting of 3 steps: split, align, and merge. Optimization of all three steps yields a 40% increase in speed when executed using a single thread. However, when executed using 16 threads, we observe a 4-fold improvement over the original parallel implementation and more than an 8-fold improvement over the original single-threaded implementation. In contrast, optimizing only the alignment step results in just a 13% improvement over the original parallel workflow using 16 threads. Availability and implementation Code (M.I.T. license), supporting scripts and Dockerfiles are available at https://github.com/BioDepot/LINCS_RNAseq_cpp and Docker images at https://hub.docker.com/r/biodepot/rnaseq-umi-cpp/ Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document