Correcting values of DNA sequence similarity for errors in sequencing

Mapping Intimacies ◽

10.1101/237990 ◽

2017 ◽

Author(s):

Timothy J. Hackmann

Keyword(s):

Ribosomal Dna ◽

Dna Sequences ◽

Sequence Similarity ◽

Single Equation ◽

Error Rates ◽

Sequencing Error ◽

Original Sequence ◽

Sequencing Errors ◽

Correct Sequence ◽

Similarity Thresholds

AbstractThe similarity between two DNA sequences is one of the most important measures in bioinformatics, but errors introduced during sequencing make values of similarity lower than they should be. Here we develop a method to correct raw sequence similarity for sequencing errors and estimate the original sequence similarity. Our method is simple and consists of a single equation with terms for 1) raw sequence similarity and 2) error rates (e.g., from Phred quality scores). We show the importance of this correction for 16S ribosomal DNA sequences from bacterial communities, where 97% similarity is a frequent threshold for clustering sequences for analysis. At that threshold and typical error rate of 0.2%, correcting for error increases similarity by 0.36 percentage points. This result shows that, if uncorrected, sequencing error would increase similarity thresholds and generate false clusters for analysis. Our method could be used to adjust thresholds for cluster-based analyses. Alternatively, because it requires no clustering to correct sequence similarity, it could usher in a new age of analyzing ribosomal DNA sequences without clustering.

Download Full-text

Minimizer-space de Bruijn graphs

10.1101/2021.06.09.447586 ◽

2021 ◽

Author(s):

Barış Ekim ◽

Bonnie Berger ◽

Rayan Chikhi

Keyword(s):

Human Genome ◽

Dna Sequences ◽

Graphical Representation ◽

Error Rates ◽

Sequencing Error ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Human Genome Assembly ◽

Long Read ◽

Metagenome Assembly

DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates. We focus on the problem of assembling such reads into genomes, which poses challenges in terms of accuracy and computational resources when using cutting-edge assembly approaches, e.g. those based on overlapping reads using minimizer sketches. Here, we introduce the concept of minimizer-space sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet. By projecting DNA sequences into ordered lists of minimizers, our key idea is to enumerate what we call k-min-mers, that are k-mers over a larger alphabet consisting of minimizer tokens. Our approach, mdBG or minimizer-dBG, achieves orders-of magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy. We demonstrate three uses cases of mdBG: human genome assembly, metagenome assembly, and the representation of large pangenomes. For assembly, we implemented mdBG in software we call rust-mdbg, resulting in ultra-fast, low memory and highly-contiguous assembly of PacBio HiFi reads. A human genome is assembled in under 10 minutes using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 minutes using 1 GB RAM. For pangenome graphs, we newly allow a graphical representation of a collection of 661,405 bacterial genomes as an mdBG and successfully search it (in minimizer-space) for anti-microbial resistance (AMR) genes. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics.

Download Full-text

A Vector Representation of DNA Sequences Using Locality Sensitive Hashing

10.1101/726729 ◽

2019 ◽

Cited By ~ 1

Author(s):

Lizhen Shi ◽

Bo Chen

Keyword(s):

Natural Language ◽

Language Processing ◽

Dna Sequences ◽

Genomic Sequence ◽

Sequence Data ◽

Error Rates ◽

Locality Sensitive Hashing ◽

Alternative Methods ◽

Sequencing Error ◽

Training Time

ABSTRACTDrawing from the analogy between natural language and "genomic sequence language", we explored the applicability of word embeddings in natural language processing (NLP) to represent DNA reads in Metagenomics studies. Here, k-mer is the equivalent concept of word in NLP and it has been widely used in analyzing sequence data. However, directly replacing word embedding with k-mer embedding is problematic due to two reasons: First, the number of k-mers is many times of the number of words in NLP, making the model too big to be useful. Second, sequencing errors create lots of rare k-mers (noise), making the model hard to be trained. In this work, we leverage Locality Sensitive Hashing (LSH) to overcoming these challenges. We then adopted the skip-gram with negative sampling model to learn k-mer embeddings. Experiments on metagenomic datasets with labels demonstrated that LSH can not only accelerate training time and reduce the memory requirements to store the model, but also achieve higher accuracy than alternative methods. Finally, we demonstrate the trained low-dimensional k-mer embeddings can be potentially used for accurate metagenomic read clustering and predict their taxonomy, and this method is robust on reads with high sequencing error rates (12-22%).

Download Full-text

From White to Black, from Darkness to Light: Species Delimitation and UNITE Species Hypothesis Testing in the Russula Albonigra Species Complex.

10.21203/rs.3.rs-118250/v1 ◽

2020 ◽

Author(s):

Ruben De Lange ◽

Slavomír Adamčík ◽

Katarína Adamčíkova ◽

Pieter Asselman ◽

Jan Borovička ◽

...

Keyword(s):

Species Delimitation ◽

Species Complex ◽

Sequence Data ◽

Sequence Similarity ◽

Perfect Match ◽

Distribution Area ◽

Traditional Concept ◽

European Species ◽

Correct Sequence ◽

Similarity Thresholds

Abstract Russula albonigra is considered a well-known species, morphologically delimited by the context of the basidiomata that is blackening without intermediate reddening, and the menthol-cooling taste of the lamellae. It is supposed to have a broad ecological amplitude and a large distribution area. A thorough molecular analysis based on four nuclear markers (ITS, LSU, RPB2 and TEF1-α) shows this traditional concept of R. albonigra s.l. represents a species complex consisting of at least five European, three North-American and one Chinese species. Morphological study shows traditional characters used to delimit R. albonigra are not always reliable. Therefore, a new delimitation of the R. albonigra lineage is proposed and a key to the described European species of R. subg. Compactae is presented. A lectotype and an epitype are designated for R. albonigra and three new European species are described: R. ambusta, R. nigrifacta and R. ustulata. UNITE species hypotheses at different thresholds were tested against the taxonomic data. The species hypotheses at the similarity threshold 0.5% give a perfect match to the phylogenetically defined species within the R. albonigra lineage. Publicly available sequence data can contribute to species delimitation and expand knowledge on ecology and distribution, but the pitfalls are short and low quality sequences. The importance of updating public taxonomic data and using correct sequence similarity thresholds is emphasised.

Download Full-text

Long-read sequencing technology indicates genome-wide effects of non-B DNA on polymerization speed and error rate

10.1101/237461 ◽

2017 ◽

Author(s):

Wilfried M. Guiblet ◽

Marzia A. Cremona ◽

Monika Cechova ◽

Robert S. Harris ◽

Iva Kejnovska ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Tandem Repeats ◽

Neurological Diseases ◽

Error Rates ◽

Polymerization Kinetics ◽

Sequencing Error ◽

Dna Polymerization ◽

Sequencing Errors ◽

Genome Wide

ABSTRACTDNA conformation may deviate from the classical B-form in ~13% of the human genome. Non-B DNA regulates many cellular processes; however, its effects on DNA polymerization speed and accuracy have not been investigated genome-wide. Such an inquiry is critical for understanding neurological diseases and cancer genome instability. Here we present the first simultaneous examination of DNA polymerization kinetics and errors in the human genome sequenced with Single-Molecule-Real-Time technology. We show that polymerization speed differs between non-B and B-DNA: it decelerates at G-quadruplexes and fluctuates periodically at disease-causing tandem repeats. Analyzing polymerization kinetics profiles, we predict and validate experimentally non-B DNA formation for a novel motif. We demonstrate that several non-B motifs affect sequencing errors (e.g., G-quadruplexes increase error rates) and that sequencing errors are positively associated with polymerase slowdown. Finally, we show that highly divergent G4 motifs have pronounced polymerization slowdown and high sequencing error rates, suggesting similar mechanisms for sequencing errors and germline mutations.

Download Full-text

PB-Motif—A Method for Identifying Gene/Pseudogene Rearrangements With Long Reads: An Application to CYP21A2 Genotyping

Frontiers in Genetics ◽

10.3389/fgene.2021.716586 ◽

2021 ◽

Vol 12 ◽

Author(s):

Zachary Stephens ◽

Dragana Milosevic ◽

Benjamin Kipp ◽

Stefan Grebe ◽

Ravishankar K. Iyer ◽

...

Keyword(s):

Phase Variation ◽

Variant Calling ◽

Error Rates ◽

Clinical Samples ◽

Sequencing Error ◽

Carrier Status ◽

Sequencing Errors ◽

Sequencing Technologies ◽

Long Reads ◽

Genomic Regions

Long read sequencing technologies have the potential to accurately detect and phase variation in genomic regions that are difficult to fully characterize with conventional short read methods. These difficult to sequence regions include several clinically relevant genes with highly homologous pseudogenes, many of which are prone to gene conversions or other types of complex structural rearrangements. We present PB-Motif, a new method for identifying rearrangements between two highly homologous genomic regions using PacBio long reads. PB-Motif leverages clustering and filtering techniques to efficiently report rearrangements in the presence of sequencing errors and other systematic artifacts. Supporting reads for each high-confidence rearrangement can then be used for copy number estimation and phased variant calling. First, we demonstrate PB-Motif's accuracy with simulated sequence rearrangements of PMS2 and its pseudogene PMS2CL using simulated reads sweeping over a range of sequencing error rates. We then apply PB-Motif to 26 clinical samples, characterizing CYP21A2 and its pseudogene CYP21A1P as part of a diagnostic assay for congenital adrenal hyperplasia. We successfully identify damaging variation and patient carrier status concordant with clinical diagnosis obtained from multiplex ligation-dependent amplification (MLPA) and Sanger sequencing. The source code is available at: github.com/zstephens/pb-motif.

Download Full-text

Inferring an Original Sequence from Erroneous Copies: Two Approaches

Asia-Pacific Biotech News ◽

10.1142/s0219030303000284 ◽

2003 ◽

Vol 07 (03) ◽

pp. 107-114 ◽

Cited By ~ 3

Author(s):

Jonathan M. Keith ◽

Peter Adams ◽

Darryn Bryant ◽

Keith R. Mitchelson ◽

Duncan A. E. Cochran ◽

...

Keyword(s):

Sequence Alignment ◽

Dna Sequences ◽

Multiple Sequence Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Original Sequence ◽

Multiple Sequence Alignments ◽

Sequencing Errors ◽

The Cost ◽

New Algorithms

This paper considers the problem of inferring an original sequence from a number of erroneous copies. The problem arises in DNA sequencing, particularly in the context of emerging technologies that provide high throughput or other advantages at the cost of an increased number of errors. We describe and compare two approaches that have recently been developed by the authors. The first approach searches for a sequence known as a Steiner string; the second searches for the most probable original sequence with respect to a simple Bayesian model of sequencing errors. We present the results of extensive tests in which erroneous copies of real DNA sequences were simulated and the algorithms were used to infer the original sequences. The results are used to compare the two approaches to each other and to a third, more conventional, approach based on multiple sequence alignment. We find that the Bayesian approach is superior to the Steiner approach, which in turn is superior to the alignment approach. The two new algorithms can also be used to construct multiple sequence alignments. We show that the two methods produce alignments of approximately equal quality, and conclude that the Steiner approach is better for this purpose because it is faster. Both methods produce better alignments than a well-known multiple sequence alignment package, for the cases tested.

Download Full-text

Ribosomal DNA Sequences of Bifidobacteria: Implications for Sequence-based Identification of the Human Colonic Flora

Microbial Ecology in Health and Disease ◽

10.3402/mehd.v6i1.8088 ◽

1993 ◽

Vol 6 (1) ◽

Author(s):

R. Frothingham ◽

A. J. Duncan ◽

K. H. Wilson

Keyword(s):

Ribosomal Dna ◽

Dna Sequences ◽

Colonic Flora

Download Full-text

Estimating sequencing error rates using families

BioData Mining ◽

10.1186/s13040-021-00259-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Kelley Paskov ◽

Jae-Yoon Jung ◽

Brianna Chrisman ◽

Nate T. Stockham ◽

Peter Washington ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Exome Sequencing ◽

Genome Sequencing ◽

Variant Calling ◽

Error Rates ◽

Sequencing Error ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Platform ◽

Whole Exome

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

Download Full-text

Meiotic Recombination Between Paralogous RBCSB Genes on Sister Chromatids of Arabidopsis thaliana

Genetics ◽

10.1093/genetics/166.2.947 ◽

2004 ◽

Vol 166 (2) ◽

pp. 947-957 ◽

Cited By ~ 1

Author(s):

John G Jelesko ◽

Kristy Carter ◽

Whitney Thompson ◽

Yuki Kinoshita ◽

Wilhelm Gruissem

Keyword(s):

Gene Cluster ◽

Dna Sequences ◽

Meiotic Recombination ◽

Sequence Similarity ◽

Specific Gene ◽

High Sequence Similarity ◽

Paralogous Genes ◽

Chimeric Genes ◽

Unequal Recombination ◽

Sister Chromatids

Abstract Paralogous genes organized as a gene cluster can rapidly evolve by recombination between misaligned paralogs during meiosis, leading to duplications, deletions, and novel chimeric genes. To model unequal recombination within a specific gene cluster, we utilized a synthetic RBCSB gene cluster to isolate recombinant chimeric genes resulting from meiotic recombination between paralogous genes on sister chromatids. Several F1 populations hemizygous for the synthRBCSB1 gene cluster gave rise to Luc+ F2 plants at frequencies ranging from 1 to 3 × 10-6. A nonuniform distribution of recombination resolution sites resulted in the biased formation of recombinant RBCS3B/1B::LUC genes with nonchimeric exons. The positioning of approximately half of the mapped resolution sites was effectively modeled by the fractional length of identical DNA sequences. In contrast, the other mapped resolution sites fit an alternative model in which recombination resolution was stimulated by an abrupt transition from a region of relatively high sequence similarity to a region of low sequence similarity. Thus, unequal recombination between paralogous RBCSB genes on sister chromatids created an allelic series of novel chimeric genes that effectively resulted in the diversification rather than the homogenization of the synthRBCSB1 gene cluster.

Download Full-text

Phylogenetic Analysis of Trichaptum Based on Nuclear 18S, 5.8S and ITS Ribosomal DNA Sequences

Mycologia ◽

10.2307/3761129 ◽

1997 ◽

Vol 89 (5) ◽

pp. 727 ◽

Cited By ~ 11

Author(s):

Kwan S. Ko ◽

Soon G. Hong ◽

Hack S. Jung

Keyword(s):

Phylogenetic Analysis ◽

Ribosomal Dna ◽

Dna Sequences

Download Full-text