scholarly journals TraRECo: a greedy approach based de novo transcriptome assembler with read error correction using consensus matrix

BMC Genomics ◽  
2018 ◽  
Vol 19 (1) ◽  
Author(s):  
Seokhyun Yoon ◽  
Daeseung Kim ◽  
Keunsoo Kang ◽  
Woong June Park
2015 ◽  
Author(s):  
Matthew D MacManes

Motivation: The correction of sequencing errors contained in Illumina reads derived from genomic DNA is a common pre-processing step in many de novo genome assembly pipelines, and has been shown to improved the quality of resultant assemblies. In contrast, the correction of errors in transcriptome sequence data is much less common, but can potentially yield similar improvements in mapping and assembly quality. This manuscript evaluates several popular read-correction tool's ability to correct sequence errors commonplace to transcriptome derived Illumina reads. Results: I evaluated the efficacy of correction of transcriptome derived sequencing reads using using several metrics across a variety of sequencing depths. This evaluation demonstrates a complex relationship between the quality of the correction, depth of sequencing, and hardware availability which results in variable recommendations depending on the goals of the experiment, tolerance for false positives, and depth of coverage. Overall, read error correction is an important step in read quality control, and should become a standard part of analytical pipelines. Availability: Results are non-deterministically repeatable using AMI:ami-3dae4956 (MacManes EC 2015) and the Makefile available here: https://goo.gl/oVIuE0


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Nathan LaPierre ◽  
Rob Egan ◽  
Wei Wang ◽  
Zhong Wang

Abstract Background Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. The limited sensitivity of existing read-based error correction methods can cause large-scale mis-assemblies in the assembled genomes, motivating further innovation in this area. Results Here we developed a Convolutional Neural Network (CNN) based method, called MiniScrub, for identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments to minimize their interference in downstream assembly process. MiniScrub first generates read-to-read overlaps via MiniMap2, then encodes the overlaps into images, and finally builds CNN models to predict low-quality segments. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality, and improves read error correction in the metagenome setting. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors. Conclusions MiniScrub is able to robustly improve read quality of Oxford Nanopore reads, especially in the metagenome setting, making it useful for downstream applications such as de novo assembly. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub.


2017 ◽  
Author(s):  
Seokhyun Yoon ◽  
Daeseung Kim ◽  
Keunsoo Kang ◽  
Woong June Park

AbstractBackgroundChallenges in developing a good de novo transcriptome assembler include how to deal with read errors and sequence repeats. Almost all de novo assemblers utilize de Bruijn graph, which has a complexity linearly growing with data size while suffers from errors and repeat. Although one can correct errors by inspecting topological structure of the graph, it is an uneasy task when there are too many branches. There are two research directions: improving either graph reliability or path search precision. We focused on improving the reliability.ResultsWe present TraRECo, a greedy approach to de novo assembly employing error-aware graph construction. The idea is similar to overlap-layout-consensus approach used for genome assembly, but is different in that consensus is made through the entire graph construction step. Basically, we built contigs by direct read alignment within a distance margin and performed junction search to construct splicing graphs. While doing so, however, a contig of length l was represented by 4×1 matrix (called consensus matrix), of which each element was the base count of aligned reads so far. A representative sequence is obtained, by taking majority in each column of the consensus matrix, to be used for further read alignment. Once splicing graphs were obtained, we used IsoLasso to find paths with noticeable read depth. The experiments using real and simulated reads showed that the method provides considerable improvements in sensitivity and reasonably better performances when comparing both sensitivity and precision. This could be achieved by making more erroneous reads to be participated in graph construction, which, in turn, improved the depth information quality used for the subsequent path search step. The results for simulated reads showed also challenges are still remaining since non-negligible percentage of transcripts with high abundance were not recovered by the assemblers we considered.Conclusionde novo assembly is mainly to explore not-yet-discovered isoforms and must be able to represent as much reads as possible in an efficient way. In this sense, TraRECo provides us a potential alternative to improve graph reliability, even though the computational burden can be much higher than single k-mer de Bruijn graph approach.


Author(s):  
Felix Kallenborn ◽  
Andreas Hildebrandt ◽  
Bertil Schmidt

Abstract Motivation Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. Results We present CARE—an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration. Availabilityand implementation CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Álvaro Figueroa ◽  
Antonio Brante ◽  
Leyla Cárdenas

AbstractThe polychaete Boccardia wellingtonensis is a poecilogonous species that produces different larval types. Females may lay Type I capsules, in which only planktotrophic larvae are present, or Type III capsules that contain planktotrophic and adelphophagic larvae as well as nurse eggs. While planktotrophic larvae do not feed during encapsulation, adelphophagic larvae develop by feeding on nurse eggs and on other larvae inside the capsules and hatch at the juvenile stage. Previous works have not found differences in the morphology between the two larval types; thus, the factors explaining contrasting feeding abilities in larvae of this species are still unknown. In this paper, we use a transcriptomic approach to study the cellular and genetic mechanisms underlying the different larval trophic modes of B. wellingtonensis. By using approximately 624 million high-quality reads, we assemble the de novo transcriptome with 133,314 contigs, coding 32,390 putative proteins. We identify 5221 genes that are up-regulated in larval stages compared to their expression in adult individuals. The genetic expression profile differed between larval trophic modes, with genes involved in lipid metabolism and chaetogenesis over expressed in planktotrophic larvae. In contrast, up-regulated genes in adelphophagic larvae were associated with DNA replication and mRNA synthesis.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Daniel Stribling ◽  
Peter L. Chang ◽  
Justin E. Dalton ◽  
Christopher A. Conow ◽  
Malcolm Rosenthal ◽  
...  

Abstract Objectives Arachnids have fascinating and unique biology, particularly for questions on sex differences and behavior, creating the potential for development of powerful emerging models in this group. Recent advances in genomic techniques have paved the way for a significant increase in the breadth of genomic studies in non-model organisms. One growing area of research is comparative transcriptomics. When phylogenetic relationships to model organisms are known, comparative genomic studies provide context for analysis of homologous genes and pathways. The goal of this study was to lay the groundwork for comparative transcriptomics of sex differences in the brain of wolf spiders, a non-model organism of the pyhlum Euarthropoda, by generating transcriptomes and analyzing gene expression. Data description To examine sex-differential gene expression, short read transcript sequencing and de novo transcriptome assembly were performed. Messenger RNA was isolated from brain tissue of male and female subadult and mature wolf spiders (Schizocosa ocreata). The raw data consist of sequences for the two different life stages in each sex. Computational analyses on these data include de novo transcriptome assembly and differential expression analyses. Sample-specific and combined transcriptomes, gene annotations, and differential expression results are described in this data note and are available from publicly-available databases.


Sign in / Sign up

Export Citation Format

Share Document