DEEP-LONG: A Fast and Accurate Aligner for Long RNA-Seq

Mapping Intimacies ◽

10.21203/rs.3.rs-79489/v1 ◽

2020 ◽

Author(s):

Li Hou ◽

Yadong Wang

Keyword(s):

Alternative Splicing ◽

Error Rate ◽

Sequencing Error ◽

Rna Seq ◽

Sequencing Technology ◽

Short Reads ◽

Sequencing Errors ◽

Sequencing Error Rate ◽

Long Reads ◽

Gene Structures

Abstract BackgroundIn recent years, because of the development of sequencing technology, long reads were widely used in many studies, include transcriptomics studies. Obviously, Long reads have more advantages than short reads. And long reads align also different from short reads align. Until now Lots of tools can process long RNA-Seq, but there still have some problems need to solve. ResultsWe developed Deep-Long to process long RNA-Seq, Deep-Long is a fast and accurate tool. Deep-Long can handle troubles come from complicated gene structures and sequencing errors well, Deep-Long does well especially on alternative splicing and small exons. When sequencing error rate is low, Deep-Long can rapidly get more accurate results. While sequencing error rate rising, Deep-Long will use more time, but still more fast and accurate than most other tools.ConclusionsDeep-Long is an useful tool to align long RNA-Seq to genome, and Deep-Long can find more exons and splices.

Download Full-text

353 ASAS-EAAP Talk: Low-coverage whole-genome sequencing in local livestock breeds

Journal of Animal Science ◽

10.1093/jas/skaa278.149 ◽

2020 ◽

Vol 98 (Supplement_4) ◽

pp. 81-82

Author(s):

Joaquim Casellas ◽

Melani Martín de Hijas-Villalba ◽

Marta Vázquez-Gómez ◽

Samir Id Lahoucine

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Error Rate ◽

Allele Frequencies ◽

Paternity Testing ◽

Sequencing Error ◽

Whole Genome ◽

Genomic Evaluation ◽

Sequencing Error Rate ◽

Low Coverage

Abstract Current European regulations for autochthonous livestock breeds put a special emphasis on pedigree completeness, which requires laboratory paternity testing by genetic markers in most cases. This entails significant economic expenditure for breed societies and precludes other investments in breeding programs, such as genomic evaluation. Within this context, we developed paternity testing through low-coverage whole-genome data in order to reuse these data for genomic evaluation at no cost. Simulations relied on diploid genomes composed by 30 chromosomes (100 cM each) with 3,000,000 SNP per chromosome. Each population evolved during 1,000 non-overlapping generations with effective size 100, mutation rate 10–4, and recombination by Kosambi’s function. Only those populations with 1,000,000 ± 10% polymorphic SNP per chromosome in generation 1,000 were retained for further analyses, and expanded to the required number of parents and offspring. Individuals were sequenced at 0.01, 0.05, 0.1, 0.5 and 1X depth, with 100, 500, 1,000 or 10,000 base-pair reads and by assuming a random sequencing error rate per SNP between 10–2 and 10–5. Assuming known allele frequencies in the population and sequencing error rate, 0.05X depth sufficed to corroborate the true father (85,0%) and to discard other candidates (96,3%). Those percentages increased up to 99,6% and 99,9% with 0,1X depth, respectively (read length = 10,000 bp; smaller read lengths slightly improved the results because they increase the number of sequenced SNP). Results were highly sensitive to biases in allele frequencies and robust to inaccuracies regarding sequencing error rate. Low-coverage whole-genome sequencing data could be subsequently integrated into genomic BLUP equations by appropriately constructing the genomic relationship matrix. This approach increased the correlation between simulated and predicted breeding values by 1.21% (h2 = 0.25; 100 parents and 900 offspring; 0.1X depth by 10,000 bp reads). Although small, this increase opens the door to genomic evaluation in local livestock breeds.

Download Full-text

Micro-dissection and integration of long and short reads to create a robust catalog of kidney compartment-specific isoforms

10.1101/2021.09.07.459298 ◽

2021 ◽

Author(s):

Ridvan Eksi ◽

Daiyao Yi ◽

Hongyang Li ◽

Bradley Godfrey ◽

Lisa R. Mathew ◽

...

Keyword(s):

Human Kidney ◽

Computational Pipeline ◽

Rna Seq ◽

Healthy Human ◽

Short Reads ◽

Short Read Sequencing ◽

Pacbio Rs Ii ◽

Long Reads ◽

Novel Transcripts ◽

Isoform Expression

AbstractStudying isoform expression at the microscopic level has always been a challenging task. A classical example is kidney, where glomerular and tubulo-insterstitial compartments carry out drastically different physiological functions and thus presumably their isoform expression also differs. We aim at developing an experimental and computational pipeline for identifying isoforms at microscopic structure-level. We microdissed glomerular and tubulo-interstitial compartments from healthy human kidney tissues from two cohorts. The two compartments were separately sequenced with the PacBio RS II platform. These transcripts were then validated using transcripts of the same samples by the traditional Illumina RNA-Seq protocol, distinct Illumina RNA-Seq short reads from European Renal cDNA Bank (ERCB) samples, and annotated GENCODE transcript list, thus identifying novel transcripts. We identified 14,739 and 14,259 annotated transcripts, and 17,268 and 13,118 potentially novel transcripts in the glomerular and tubulo-interstitial compartments, respectively. Of note, relying solely on either short or long reads would have resulted in many erroneous identifications. We identified distinct pathways involved in glomerular and tubulointerstitial compartments at the isoform level.We demonstrated the possibility of micro-dissecting a tissue, incorporating both long- and short-read sequencing to identify isoforms for each compartment.

Download Full-text

Assembling reads improves taxonomic classification of species

10.21203/rs.3.rs-22309/v1 ◽

2020 ◽

Author(s):

Quang Tran ◽

Vinhthuy Phan

Keyword(s):

Classification Performance ◽

Performance Characteristics ◽

Metagenomic Data ◽

Species Classification ◽

Short Read ◽

Short Reads ◽

Sequencing Errors ◽

Trade Offs ◽

Long Reads ◽

Long Read

Abstract Background: Most current metagenomic classifiers and profilers employ short reads to classify, bin and profile microbial genomes that are present in metagenomic samples. Many of these methods adopt techniques that aim to identify unique genomic regions of genomes so as to differentiate them. Because of this, short-read lengths might be suboptimal. Longer read lengths might improve the performance of classification and profiling. However, longer reads produced by current technology tend to have a higher rate of sequencing errors, compared to short reads. It is not clear if the trade-off between longer length versus higher sequencing errors will increase or decrease classification and profiling performance.Results: We compared performance of popular metagenomic classifiers on short reads and longer reads, which are assembled from the same short reads. When using a number of popular assemblers to assemble long reads from the short reads, we discovered that most classifiers made fewer predictions with longer reads and that they achieved higher classification performance on synthetic metagenomic data. Specifically, across most classifiers, we observed a significant increase in precision, while recall remained the same, resulting in higher overall classification performance. On real metagenomic data, we observed a similar trend that classifiers made fewer predictions. This suggested that they might have the same performance characteristics of having higher precision while maintaining the same recall with longer reads.Conclusions: This finding has two main implications. First, it suggests that classifying species in metagenomic environments can be achieved with higher overall performance simply by assembling short reads. This suggested that they might have the same performance characteristics of having higher precision while maintaining the same recall as shorter reads. Second, this finding suggests that it might be a good idea to consider utilizing long-read technologies in species classification for metagenomic applications. Current long-read technologies tend to have higher sequencing errors and are more expensive compared to short-read technologies. The trade-offs between the pros and cons should be investigated.

Download Full-text

Sequencing error profiles of Illumina sequencing instruments

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab019 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Nicholas Stoler ◽

Anton Nekrutenko

Keyword(s):

Error Rate ◽

Sequencing Error ◽

Sequencing Technology ◽

Sequence Context ◽

Sequencing Experiment ◽

The Difference ◽

Public Datasets ◽

Error Profiles ◽

Sequence Bias

Abstract Sequencing technology has achieved great advances in the past decade. Studies have previously shown the quality of specific instruments in controlled conditions. Here, we developed a method able to retroactively determine the error rate of most public sequencing datasets. To do this, we utilized the overlaps between reads that are a feature of many sequencing libraries. With this method, we surveyed 1943 different datasets from seven different sequencing instruments produced by Illumina. We show that among public datasets, the more expensive platforms like HiSeq and NovaSeq have a lower error rate and less variation. But we also discovered that there is great variation within each platform, with the accuracy of a sequencing experiment depending greatly on the experimenter. We show the importance of sequence context, especially the phenomenon where preceding bases bias the following bases toward the same identity. We also show the difference in patterns of sequence bias between instruments. Contrary to expectations based on the underlying chemistry, HiSeq X Ten and NovaSeq 6000 share notable exceptions to the preceding-base bias. Our results demonstrate the importance of the specific circumstances of every sequencing experiment, and the importance of evaluating the quality of each one.

Download Full-text

Analysis of 454 sequencing error rate, error sources, and artifact recombination for detection of Low-frequency drug resistance mutations in HIV-1 DNA

Retrovirology ◽

10.1186/1742-4690-10-18 ◽

2013 ◽

Vol 10 (1) ◽

Cited By ~ 79

Author(s):

Wei Shao ◽

Valerie F Boltz ◽

Jonathan E Spindler ◽

Mary F Kearney ◽

Frank Maldarelli ◽

...

Keyword(s):

Drug Resistance ◽

Error Rate ◽

Low Frequency ◽

454 Sequencing ◽

Sequencing Error ◽

Resistance Mutations ◽

Error Sources ◽

Drug Resistance Mutations ◽

Sequencing Error Rate ◽

Hiv 1

Download Full-text

The impact of DNA polymerase and number of rounds of amplification in PCR on 16S rRNA gene sequence data

10.1101/565598 ◽

2019 ◽

Cited By ~ 2

Author(s):

Marc A Sze ◽

Patrick D Schloss

Keyword(s):

16S Rrna ◽

Microbial Communities ◽

Error Rate ◽

Sequence Data ◽

Pcr Amplification ◽

Sequencing Error ◽

Mock Community ◽

Sequencing Error Rate ◽

Stool Samples ◽

Human Stool

AbstractPCR amplification of 16S rRNA genes is a critical, yet under appreciated step in the generation of sequence data to describe the taxonomic composition of microbial communities. Numerous factors in the design of PCR can impact the sequencing error rate, the abundance of chimeric sequences, and the degree to which the fragments in the product represent their abundance in the original sample (i.e. bias). We compared the performance of high fidelity polymerases and varying number of rounds of amplification when amplifying a mock community and human stool samples. Although it was impossible to derive specific recommendations, we did observe general trends. Namely, using a polymerase with the highest possible fidelity and minimizing the number of rounds of PCR reduced the sequencing error rate, fraction of chimeric sequences, and bias. Evidence of bias at the sequence level was subtle and could not be ascribed to the fragments’ fraction of bases that were guanines or cytosines. When analyzing mock community data, the amount that the community deviated from the expected composition increased with rounds of PCR. This bias was inconsistent for human stool samples. Overall the results underscore the difficulty of comparing sequence data that are generated by different PCR protocols. However, the results indicate that the variation in human stool samples is generally larger than that introduced by the choice of polymerase or number of rounds of PCR.ImportanceA steep decline in sequencing costs drove an explosion in studies characterizing microbial communities from diverse environments. Although a significant amount of effort has gone into understanding the error profiles of DNA sequencers, little has been done to understand the downstream effects of the PCR amplification protocol. We quantified the effects of the choice of polymerase and number of PCR cycles on the quality of downstream data. We found that these choices can have a profound impact on the way that a microbial community is represented in the sequence data. The effects are relatively small compared to the variation in human stool samples, however, care should be taken to use polymerases with the highest possible fidelity and to minimize the number of rounds of PCR. These results also underscore that it is not possible to directly compare sequence data generated under different PCR conditions.

Download Full-text

Long read metagenomics, the next step?

10.1101/2020.11.11.378109 ◽

2020 ◽

Author(s):

Jose M. Haro-Moreno ◽

Mario López-Pérez ◽

Francisco Rodríguez-Valera

Keyword(s):

Error Rate ◽

Population Genomics ◽

Short Read ◽

Short Reads ◽

Third Generation Sequencing ◽

Long Reads ◽

Oxford Nanopore ◽

Flexible Genome ◽

Long Read ◽

Generation Sequencing

ABSTRACTBackgroundThird-generation sequencing has penetrated little in metagenomics due to the high error rate and dependence for assembly on short-read designed bioinformatics. However, 2nd generation sequencing metagenomics (mostly Illumina) suffers from limitations, particularly in allowing assembly of microbes with high microdiversity or retrieving the flexible (adaptive) compartment of prokaryotic genomes.ResultsHere we have used different 3rd generation techniques to study the metagenome of a well-known marine sample from the mixed epipelagic water column of the winter Mediterranean. We have compared Oxford Nanopore and PacBio last generation technologies with the classical approach using Illumina short reads followed by assembly. PacBio Sequel II CCS appears particularly suitable for cellular metagenomics due to its low error rate. Long reads allow efficient direct retrieval of complete genes (473M/Tb) and operons before assembly, facilitating annotation and compensates the limitations of short reads or short-read assemblies. MetaSPAdes was the most appropriate assembly program when used in combination with short reads. The assemblies of the long reads allow also the reconstruction of much more complete metagenome-assembled genomes, even from microbes with high microdiversity. The flexible genome of reconstructed MAGs is much more complete and allows rescuing more adaptive genes.ConclusionsFor most applications of metagenomics, from community structure analysis to ecosystem functioning, long-reads should be applied whenever possible. Particularly for in-silico screening of biotechnologically useful genes, or population genomics, long-read metagenomics appears presently as a very fruitful approach and can be used from raw reads, before a computing-demanding (and potentially artefactual) assembly step.

Download Full-text

The Impact of DNA Polymerase and Number of Rounds of Amplification in PCR on 16S rRNA Gene Sequence Data

mSphere ◽

10.1128/msphere.00163-19 ◽

2019 ◽

Vol 4 (3) ◽

Cited By ~ 16

Author(s):

Marc A. Sze ◽

Patrick D. Schloss

Keyword(s):

16S Rrna ◽

Microbial Communities ◽

Error Rate ◽

Sequence Data ◽

Pcr Amplification ◽

Sequencing Error ◽

Mock Community ◽

Sequencing Error Rate ◽

Stool Samples ◽

Human Stool

ABSTRACTPCR amplification of 16S rRNA genes is a critical yet underappreciated step in the generation of sequence data to describe the taxonomic composition of microbial communities. Numerous factors in the design of PCR can impact the sequencing error rate, the abundance of chimeric sequences, and the degree to which the fragments in the product represent their abundance in the original sample (i.e., bias). We compared the performance of high fidelity polymerases and various numbers of rounds of amplification when amplifying a mock community and human stool samples. Although it was impossible to derive specific recommendations, we did observe general trends. Namely, using a polymerase with the highest possible fidelity and minimizing the number of rounds of PCR reduced the sequencing error rate, fraction of chimeric sequences, and bias. Evidence of bias at the sequence level was subtle and could not be ascribed to the fragments’ fraction of bases that were guanines or cytosines. When analyzing mock community data, the amount that the community deviated from the expected composition increased with the number of rounds of PCR. This bias was inconsistent for human stool samples. Overall, the results underscore the difficulty of comparing sequence data that are generated by different PCR protocols. However, the results indicate that the variation in human stool samples is generally larger than that introduced by the choice of polymerase or number of rounds of PCR.IMPORTANCEA steep decline in sequencing costs drove an explosion in studies characterizing microbial communities from diverse environments. Although a significant amount of effort has gone into understanding the error profiles of DNA sequencers, little has been done to understand the downstream effects of the PCR amplification protocol. We quantified the effects of the choice of polymerase and number of PCR cycles on the quality of downstream data. We found that these choices can have a profound impact on the way that a microbial community is represented in the sequence data. The effects are relatively small compared to the variation in human stool samples; however, care should be taken to use polymerases with the highest possible fidelity and to minimize the number of rounds of PCR. These results also underscore that it is not possible to directly compare sequence data generated under different PCR conditions.

Download Full-text

Needlestack: an ultra-sensitive variant caller for multi-sample next generation sequencing data

10.1101/639377 ◽

2019 ◽

Cited By ~ 2

Author(s):

Tiffany M. Delhomme ◽

Patrice H. Avogbe ◽

Aurélie Gabriel ◽

Nicolas Alcala ◽

Noemie Leblay ◽

...

Keyword(s):

Next Generation Sequencing ◽

Error Rate ◽

Somatic Mutations ◽

Next Generation Sequencing Data ◽

Sequencing Error ◽

Next Generation ◽

Sequencing Error Rate ◽

Main Challenge ◽

A Genome ◽

Generation Sequencing

ABSTRACTThe emergence of Next-Generation Sequencing (NGS) has revolutionized the way of reaching a genome sequence, with the promise of potentially providing a comprehensive characterization of DNA variations. Nevertheless, detecting somatic mutations is still a difficult problem, in particular when trying to identify low abundance mutations such as subclonal mutations, tumour-derived alterations in body fluids or somatic mutations from histological normal tissue. The main challenge is to precisely distinguish between sequencing artefacts and true mutations, particularly when the latter are so rare they reach similar abundance levels as artefacts. Here, we present needlestack, a highly sensitive variant caller, which directly learns from the data the level of systematic sequencing errors to accurately call mutations. Needlestack is based on the idea that the sequencing error rate can be dynamically estimated from analyzing multiple samples together. We show that the sequencing error rate varies across alterations, illustrating the need to precisely estimate it. We evaluate the performance of needlestack for various types of variations, and we show that needlestack is robust among positions and outperforms existing state-of-the-art method for low abundance mutations. Needlestack, along with its source code is freely available on the GitHub plateform: https://github.com/IARCbioinfo/needlestack.

Download Full-text

Hybrid correction of highly noisy Oxford Nanopore long reads using a variable-order de Bruijn graph

10.1101/238808 ◽

2017 ◽

Cited By ~ 3

Author(s):

Pierre Morisse ◽

Thierry Lecroq ◽

Arnaud Lefebvre

Keyword(s):

Error Correction ◽

Error Rate ◽

De Bruijn Graph ◽

Variable Order ◽

Short Reads ◽

Pacific Biosciences ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

De Bruijn

AbstractMotivationThe recent rise of long read sequencing technologies such as Pacific Biosciences and Oxford Nanopore allows to solve assembly problems for larger and more complex genomes than what allowed short reads technologies. However, these long reads are very noisy, reaching an error rate of around 10 to 15% for Pacific Biosciences, and up to 30% for Oxford Nanopore. The error correction problem has been tackled by either self-correcting the long reads, or using complementary short reads in a hybrid approach, but most methods only focus on Pacific Biosciences data, and do not apply to Oxford Nanopore reads. Moreover, even though recent chemistries from Oxford Nanopore promise to lower the error rate below 15%, it is still higher in practice, and correcting such noisy long reads remains an issue.ResultsWe present HG-CoLoR, a hybrid error correction method that focuses on a seed-and-extend approach based on the alignment of the short reads to the long reads, followed by the traversal of a variable-order de Bruijn graph, built from the short reads. Our experiments show that HG-CoLoR manages to efficiently correct Oxford Nanopore long reads that display an error rate as high as 44%. When compared to other state-of-the-art long read error correction methods able to deal with Oxford Nanopore data, our experiments also show that HG-CoLoR provides the best trade-off between runtime and quality of the results, and is the only method able to efficiently scale to eukaryotic genomes.Availability and implementationHG-CoLoR is implemented is C++, supported on Linux platforms and freely available at https://github.com/morispi/HG-CoLoRContact: [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text