sequencing errors
Recently Published Documents


TOTAL DOCUMENTS

286
(FIVE YEARS 124)

H-INDEX

34
(FIVE YEARS 6)

2022 ◽  
Author(s):  
David Pellow ◽  
Abhinav Dutta ◽  
Ron Shamir

As sequencing datasets keep growing larger, time and memory efficiency of read mapping are becoming more critical. Many clever algorithms and data structures were used to develop mapping tools for next generation sequencing, and in the last few years also for third generation long reads. A key idea in mapping algorithms is to sketch sequences with their minimizers. Recently, syncmers were introduced as an alternative sketching method that is more robust to mutations and sequencing errors. Here we introduce parameterized syncmer schemes, and provide a theoretical analysis for multi-parameter schemes. By combining these schemes with downsampling or minimizers we can achieve any desired compression and window guarantee. We introduced syncmer schemes into the popular minimap2 and Winnowmap2 mappers. In tests on simulated and real long read data from a variety of genomes, the syncmer-based algorithms reduced unmapped reads by 20-60% at high compression while using less memory. The advantage of syncmer-based mapping was even more pronounced at lower sequence identity. At sequence identity of 65-75% and medium compression, syncmer mappers had 50-60% fewer unmapped reads, and ∼ 10% fewer of the reads that did map were incorrectly mapped. We conclude that syncmer schemes improve mapping under higher error and mutation rates. This situation happens, for example, when the high error rate of long reads is compounded by a high mutation rate in a cancer tumor, or due to differences between strains of viruses or bacteria.


2021 ◽  
Author(s):  
Marcos A. Caraballo-Ortiz ◽  
Sayaka Miura ◽  
Maxwell Sanderford ◽  
Tenzin Dolker ◽  
Qiqing Tao ◽  
...  

Motivation: Building reliable phylogenies from very large collections of sequences with a limited number of phylogenetically informative sites is challenging because sequencing errors and recurrent/backward mutations interfere with the phylogenetic signal, confounding true evolutionary relationships. Massive global efforts of sequencing genomes and reconstructing the phylogeny of SARS-CoV-2 strains exemplify these difficulties since there are only hundreds of phylogenetically informative sites and millions of genomes. For such datasets, we set out to develop a method for building the phylogenetic tree of genomic haplotypes consisting of positions harboring common variants to improve the signal-to-noise ratio for more accurate phylogenetic inference of resolvable phylogenetic features. Results: We present the TopHap approach that determines spatiotemporally common haplotypes of common variants and builds their phylogeny at a fraction of the computational time of traditional methods. To assess topological robustness, we develop a bootstrap resampling strategy that resamples genomes spatiotemporally. The application of TopHap to build a phylogeny of 68,057 genomes (68KG) produced an evolutionary tree of major SARS-CoV-2 haplotypes. This phylogeny is concordant with the mutation tree inferred using the co-occurrence pattern of mutations and recovers key phylogenetic relationships from more traditional analyses. We also evaluated alternative roots of the SARS-CoV-2 phylogeny and found that the earliest sampled genomes in 2019 likely evolved by four mutations of the most recent common ancestor of all SARS-CoV-2 genomes. An application of TopHap to more than 1 million genomes reconstructed the most comprehensive evolutionary relationships of major variants, which confirmed the 68KG phylogeny and provided evolutionary origins of major variants of concern. Availability: TopHap is available on the web at https://github.com/SayakaMiura/TopHap.


2021 ◽  
Vol 1 (1) ◽  
Author(s):  
Pengfan Zhang ◽  
Stjin Spaepen ◽  
Yang Bai ◽  
Stephane Hacquard ◽  
Ruben Garrido-Oter

AbstractSynthetic microbial communities (SynComs) constitute an emerging and powerful tool in biological, biomedical, and biotechnological research. Despite recent advances in algorithms for the analysis of culture-independent amplicon sequencing data from microbial communities, there is a lack of tools specifically designed for analyzing SynCom data, where reference sequences for each strain are available. Here we present Rbec, a tool designed for the analysis of SynCom data that accurately corrects PCR and sequencing errors in amplicon sequences and identifies intra-strain polymorphic variation. Extensive evaluation using mock bacterial and fungal communities show that our tool outperforms current methods for samples of varying complexity, diversity, and sequencing depth. Furthermore, Rbec also allows accurate detection of contaminants in SynCom experiments.


2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Masachika Ikegami ◽  
Shinji Kohsaka ◽  
Takeshi Hirose ◽  
Toshihide Ueno ◽  
Satoshi Inoue ◽  
...  

AbstractThe clinical sequencing of tumors is usually performed on formalin-fixed, paraffin-embedded samples and results in many sequencing errors. We identified that most of these errors are detected in chimeric reads caused by single-strand DNA molecules with microhomology. During the end-repair step of library preparation, mutations are introduced by the mis-annealing of two single-strand DNA molecules comprising homologous sequences. The mutated bases are distributed unevenly near the ends in the individual reads. Our filtering pipeline, MicroSEC, focuses on the uneven distribution of mutations in each read and removes the sequencing errors in formalin-fixed, paraffin-embedded samples without over-eliminating the mutations detected also in fresh frozen samples. Amplicon-based sequencing using 97 mutations confirmed that the sensitivity and specificity of MicroSEC were 97% (95% confidence interval: 82–100%) and 96% (95% confidence interval: 88–99%), respectively. Our pipeline will increase the reliability of the clinical sequencing and advance the cancer research using formalin-fixed, paraffin-embedded samples.


2021 ◽  
Author(s):  
Sumit Tarafder ◽  
Mazharul Islam ◽  
Swakkhar Shatabda ◽  
Atif Rahman

Motivation: Advances in sequencing technologies have led to sequencing of genomes of a multitude of organisms. However, draft genomes of many of these organisms contain a large number of gaps due to repeats in genomes, low sequencing coverage and limitations in sequencing technologies. Although there exist several tools for filling gaps, many of these do not utilize all information relevant to gap filling. Results: Here, we present a probabilistic method for filling gaps in draft genome assemblies using second generation reads based on a generative model for sequencing that takes into account information on insert sizes and sequencing errors. Our method is based on the expectation-maximization(EM) algorithm unlike the graph based methods adopted in the literature. Experiments on real biological datasets show that this novel approach can fill up large portions of gaps with small number of errors and misassemblies compared to other state of the art gap filling tools. Availability and Implementation:The method is implemented using C++ in a software named "Filling Gaps by Iterative Read Distribution (Figbird)", which is available at: https://github.com/SumitTarafder/Figbird.


2021 ◽  
Author(s):  
Peter W Schafran ◽  
Fay-Wei W Li ◽  
Carl Rothfels

Inferring the true biological sequences from amplicon mixtures remains a difficult bioinformatic problem. The traditional approach is to cluster sequencing reads by similarity thresholds and treat the consensus sequence of each cluster as an "operational taxonomic unit" (OTU). Recently, this approach has been improved upon by model-based methods that correct PCR and sequencing errors in order to infer "amplicon sequence variants" (ASVs). To date, ASV approaches have been used primarily in metagenomics, but they are also useful for identifying allelic or paralogous variants and for determining homeologs in polyploid organisms. To facilitate the usage of ASV methods among polyploidy researchers, we incorporated ASV inference alongside OTU clustering in PURC v2.0, a major update to PURC (Pipeline for Untangling Reticulate Complexes). In addition to preserving original PURC functions, PURC v2.0 allows users to process PacBio CCS/HiFi reads through DADA2 to generate and annotate ASVs for multiplexed data, with outputs including separate alignments for each locus ready for phylogenetic inference. In addition, PURC v2.0 features faster demultiplexing than the original version and has been updated to be compatible with Python 3. In this chapter we present results indicating that PURC v2.0 (using the ASV approach) is more likely to infer the correct biological sequences in comparison to the earlier OTU-based PURC, and describe how to prepare sequencing data, run PURC v2.0 under several different modes, and interpret the output. We expect that PURC v2.0 will provide biologists with a method for generating multi-locus "moderate data" datasets that are large enough to be phylogenetically informative and small enough for manual curation.


2021 ◽  
Author(s):  
Hagay Enav ◽  
Ruth E. Ley

AbstractIn the human gut microbiome, specific strains emerge due to within-host evolution and can occasionally be transferred to or from other hosts. Phenotypic variance among such strains can have implications for strain transmission and interaction with the host. Surveilling strains of the same species, within and between individuals, can further our knowledge about the way in which microbial diversity is generated and maintained in host populations. Existing methods to estimate the biological relatedness of similar strains usually rely on either detection of single nucleotide polymorphisms (SNP), which may include sequencing errors, or on the analysis of pangenomes, which can be limited by the requirement for extensive gene databases. To complement existing methods, we developed SynTracker. This strain-comparison tool is based on synteny comparisons between strains, or the comparison of the arrangement of sequence blocks in two homologous genomic regions in pairs of metagenomic assemblies or genomes. Our method is executed in a species-specific manner, has a low sensitivity to SNPs, does not require a pre-existing database, and can correctly resolve strains using complete or draft genomes and metagenomic samples using <5% of the genome length. When applied to metagenomic datasets, we detected person-specific strains with an average sensitivity of 97% and specificity of 99%, and strain-sharing events in mother-infant pairs. SynTracker can be used to study the population structure of specific microbial species between and within environments, to identify evolutionary trajectories in longitudinal datasets, and to further understanding of strain sharing networks.


2021 ◽  
Author(s):  
Miguel Mendez Sandin ◽  
Sarah Romac ◽  
Fabrice Not

Ribosomal DNA (rDNA) genes are known to be valuable markers for the barcoding of eukaryotic life and its phylogenetic classification at various taxonomic levels. The large scale exploration of environmental microbial diversity through metabarcoding approaches have been focused mainly on the hypervariable regions V4 and V9 of the 18S rDNA gene. Yet, the accurate interpretation of such environmental surveys is hampered by technical (e.g., PCR and sequencing errors) and biological biases (e.g., intra-genomic variability). Here we explored the intra-genomic diversity of Nassellaria and Spumellaria specimens (Radiolaria) by comparing Sanger sequencing with two different high-throughput sequencing platforms: Illumina and Oxford Nanopore Technologies (MinION). Our analysis determined that intra-genomic variability of Nassellaria and Spumellaria is generally low, yet in some Spumellaria specimens we found two different copies of the V4 with a similarity lower than 97%. From the different sequencing methods, Illumina showed the highest number of contaminations (i.e., environmental DNA, cross-contamination, tag-jumping), revealed by its high sequencing depth; and Minion showed the highest sequencing rate error (~14%). Yet the long reads produced by MinION (~2900 bp) allowed accurate phylogenetic reconstruction studies. These results, highlight the requirement for a careful interpretation of Illumina based metabarcoding studies, in particular regarding low abundant amplicons, and open future perspectives towards full environmental rDNA metabarcoding surveys.


NAR Cancer ◽  
2021 ◽  
Vol 3 (4) ◽  
Author(s):  
Paul Little ◽  
Heejoon Jo ◽  
Alan Hoyle ◽  
Angela Mazul ◽  
Xiaobei Zhao ◽  
...  

Abstract Despite years of progress, mutation detection in cancer samples continues to require significant manual review as a final step. Expert review is particularly challenging in cases where tumors are sequenced without matched normal control DNA. Attempts have been made to call somatic point mutations without a matched normal sample by removing well-known germline variants, utilizing unmatched normal controls, and constructing decision rules to classify sequencing errors and private germline variants. With budgetary constraints related to computational and sequencing costs, finding the appropriate number of controls is a crucial step to identifying somatic variants. Our approach utilizes public databases for canonical somatic variants as well as germline variants and leverages information gathered about nearby positions in the normal controls. Drawing from our cohort of targeted capture panel sequencing of tumor and normal samples with varying tumortypes and demographics, these served as a benchmark for our tumor-only variant calling pipeline to observe the relationship between our ability to correctly classify variants against a number of unmatched normals. With our benchmarked samples, approximately ten normal controls were needed to maintain 94% sensitivity, 99% specificity and 76% positive predictive value, far outperforming comparable methods. Our approach, called UNMASC, also serves as a supplement to traditional tumor with matched normal variant calling workflows and can potentially extend to other concerns arising from analyzing next generation sequencing data.


PLoS ONE ◽  
2021 ◽  
Vol 16 (10) ◽  
pp. e0257521
Author(s):  
Clara Delahaye ◽  
Jacques Nicolas

Oxford Nanopore Technologies’ (ONT) long read sequencers offer access to longer DNA fragments than previous sequencer generations, at the cost of a higher error rate. While many papers have studied read correction methods, few have addressed the detailed characterization of observed errors, a task complicated by frequent changes in chemistry and software in ONT technology. The MinION sequencer is now more stable and this paper proposes an up-to-date view of its error landscape, using the most mature flowcell and basecaller. We studied Nanopore sequencing error biases on both bacterial and human DNA reads. We found that, although Nanopore sequencing is expected not to suffer from GC bias, it is a crucial parameter with respect to errors. In particular, low-GC reads have fewer errors than high-GC reads (about 6% and 8% respectively). The error profile for homopolymeric regions or regions with short repeats, the source of about half of all sequencing errors, also depends on the GC rate and mainly shows deletions, although there are some reads with long insertions. Another interesting finding is that the quality measure, although over-estimated, offers valuable information to predict the error rate as well as the abundance of reads. We supplemented this study with an analysis of a rapeseed RNA read set and shown a higher level of errors with a higher level of deletion in these data. Finally, we have implemented an open source pipeline for long-term monitoring of the error profile, which enables users to easily compute various analysis presented in this work, including for future developments of the sequencing device. Overall, we hope this work will provide a basis for the design of better error-correction methods.


Sign in / Sign up

Export Citation Format

Share Document