scholarly journals AmpliCI: a high-resolution model-based approach for denoising Illumina amplicon data

Author(s):  
Xiyu Peng ◽  
Karin S Dorman

Abstract Motivation Next-generation amplicon sequencing is a powerful tool for investigating microbial communities. A main challenge is to distinguish true biological variants from errors caused by amplification and sequencing. In traditional analyses, such errors are eliminated by clustering reads within a sequence similarity threshold, usually 97%, and constructing operational taxonomic units, but the arbitrary threshold leads to low resolution and high false-positive rates. Recently developed ‘denoising’ methods have proven able to resolve single-nucleotide amplicon variants, but they still miss low-frequency sequences, especially those near more frequent sequences, because they ignore the sequencing quality information. Results We introduce AmpliCI, a reference-free, model-based method for rapidly resolving the number, abundance and identity of error-free sequences in massive Illumina amplicon datasets. AmpliCI considers the quality information and allows the data, not an arbitrary threshold or an external database, to drive conclusions. AmpliCI estimates a finite mixture model, using a greedy strategy to gradually select error-free sequences and approximately maximize the likelihood. AmpliCI has better performance than three popular denoising methods, with acceptable computation time and memory usage. Availability and implementation Source code is available at https://github.com/DormanLab/AmpliCI. Supplementary information Supplementary material are available at Bioinformatics online.

2020 ◽  
Author(s):  
Xiyu Peng ◽  
Karin Dorman

AbstractMotivationNext-generation amplicon sequencing is a powerful tool for investigating microbial communities. One main challenge is to distinguish true biological variants from errors caused by PCR and sequencing. In the traditional analysis pipeline, such errors are eliminated by clustering reads within a sequence similarity threshold, usually 97%, and constructing operational taxonomic units, but the arbitrary threshold leads to low resolution and high false positive rates. Recently developed “denoising” methods have proven able to resolve single-nucleotide amplicon variants, but they still miss low frequency sequences, especially those near abundant variants, because they ignore the sequencing quality information.ResultsWe introduce AmpliCI, a reference-free, model-based method for rapidly resolving the number, abundance and identity of error-free sequences in massive Illumina amplicon datasets. AmpliCI takes into account quality information and allows the data, not an arbitrary threshold or an external database, to drive conclusions. AmpliCI estimates a finite mixture model, using a greedy strategy to gradually select error-free sequences and approximately maximize the likelihood. We show that AmpliCI is superior to three popular denoising methods, with acceptable computation time and memory usage.AvailabilitySource code available at https://github.com/DormanLab/AmpliCI


2020 ◽  
Vol 36 (20) ◽  
pp. 4991-4999
Author(s):  
Matej Lexa ◽  
Pavel Jedlicka ◽  
Ivan Vanat ◽  
Michal Cervenansky ◽  
Eduard Kejnovsky

Abstract Motivation Transposable elements (TEs) in eukaryotes often get inserted into one another, forming sequences that become a complex mixture of full-length elements and their fragments. The reconstruction of full-length elements and the order in which they have been inserted is important for genome and transposon evolution studies. However, the accumulation of mutations and genome rearrangements over evolutionary time makes this process error-prone and decreases the efficiency of software aiming to recover all nested full-length TEs. Results We created software that uses a greedy recursive algorithm to mine increasingly fragmented copies of full-length LTR retrotransposons in assembled genomes and other sequence data. The software called TE-greedy-nester considers not only sequence similarity but also the structure of elements. This new tool was tested on a set of natural and synthetic sequences and its accuracy was compared to similar software. We found TE-greedy-nester to be superior in a number of parameters, namely computation time and full-length TE recovery in highly nested regions. Availability and implementation http://gitlab.fi.muni.cz/lexa/nested. Supplementary information Supplementary data are available at Bioinformatics online.


Biology ◽  
2021 ◽  
Vol 10 (7) ◽  
pp. 569
Author(s):  
Chakriya Sansupa ◽  
Sara Fareed Mohamed Wahdan ◽  
Terd Disayathanoowat ◽  
Witoon Purahong

This study aims to estimate the proportion and diversity of soil bacteria derived from eDNA-based and culture-based methods. Specifically, we used Illumina Miseq to sequence and characterize the bacterial communities from (i) DNA extracted directly from forest soil and (ii) DNA extracted from a mixture of bacterial colonies obtained by enrichment cultures on agar plates of the same forest soil samples. The amplicon sequencing of enrichment cultures allowed us to rapidly screen a culturable community in an environmental sample. In comparison with an eDNA community (based on a 97% sequence similarity threshold), the fact that enrichment cultures could capture both rare and abundant bacterial taxa in forest soil samples was demonstrated. Enrichment culture and eDNA communities shared 2% of OTUs detected in total community, whereas 88% of enrichment cultures community (15% of total community) could not be detected by eDNA. The enrichment culture-based methods observed 17% of the bacteria in total community. FAPROTAX functional prediction showed that the rare and unique taxa, which were detected with the enrichment cultures, have potential to perform important functions in soil systems. We suggest that enrichment culture-based amplicon sequencing could be a beneficial approach to evaluate a cultured bacterial community. Combining this approach together with the eDNA method could provide more comprehensive information of a bacterial community. We expected that more unique cultured taxa could be detected if further studies used both selective and non-selective culture media to enrich bacteria at the first step.


2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i857-i865
Author(s):  
Derrick Blakely ◽  
Eamon Collins ◽  
Ritambhara Singh ◽  
Andrew Norton ◽  
Jack Lanchantin ◽  
...  

Abstract Motivation Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size. Results In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. Availability and implementation Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Damien Jacot ◽  
Trestan Pillonel ◽  
Gilbert Greub ◽  
Claire Bertelli

Although many laboratories worldwide have developed their sequencing capacities in response to the need for SARS-CoV-2 genome-based surveillance of variants, only few reported some quality criteria to ensure sequence quality before lineage assignment and submission to public databases. Hence, we aimed here to provide simple quality control criteria for SARS-CoV-2 sequencing to prevent erroneous interpretation of low quality or contaminated data. We retrospectively investigated 647 SARS-CoV-2 genomes obtained over ten tiled amplicons sequencing runs. We extracted 26 potentially relevant metrics covering the entire workflow from sample selection to bioinformatics analysis. Based on data distribution, critical values were established for eleven selected metrics to prompt further quality investigations for problematic samples, in particular those with a low viral RNA quantity. Low frequency variants (<70% of supporting reads) can result from PCR amplification errors, sample cross contaminations or presence of distinct SARS-CoV2 genomes in the sample sequenced. The number and the prevalence of low frequency variants can be used as a robust quality criterion to identify possible sequencing errors or contaminations. Overall, we propose eleven metrics with fixed cutoff values as a simple tool to evaluate the quality of SARS-CoV-2 genomes, among which cycle thresholds, mean depth, proportion of genome covered at least 10x and the number of low frequency variants combined with mutation prevalence data.


2018 ◽  
Author(s):  
Estelle Couradeau ◽  
Joelle Sasse ◽  
Danielle Goudeau ◽  
Nandita Nath ◽  
Terry C. Hazen ◽  
...  

AbstractThe ability to link soil microbial diversity to soil processes requires technologies that differentiate active subpopulations of microbes from so-called relic DNA and dormant cells. Measures of microbial activity based on various techniques including DNA labelling have suggested that most cells in soils are inactive, a fact that has been difficult to reconcile with observed high levels of bulk soil activities. We hypothesized that measures of in situ DNA synthesis may be missing the soil microbes that are metabolically active but not replicating, and we therefore applied BONCAT (Bioorthogonal Non Canonical Amino Acid Tagging) i.e. a proxy for activity that does not rely on cell division, to measure translationally active cells in soils. We compared the active population of two soil depths from Oak Ridge (TN) incubated under the same conditions for up to seven days. Depending on the soil, a maximum of 25 – 70% of the cells were active, accounting for 3-4 million cells per gram of soil type, which is an order of magnitude higher than previous estimates. The BONCAT positive cell fraction was recovered by fluorescence activated cell sorting (FACS) and identified by 16S rDNA amplicon sequencing. The diversity of the active fraction was a selected subset of the bulk soil community. Excitingly, some of the same members of the community were recruited at both depths independently from their abundance rank. On average, 86% of sequence reads recovered from the active community shared >97% sequence similarity with cultured isolates from the field site. Our observations are in line with a recent report that, of the few taxa that are both abundant and ubiquitous in soil, 45% are also cultured – and indeed some of these ubiquitous microorganisms were found to be translationally active. The use of BONCAT on soil microbiomes provides evidence that a large portion of the soil microbes can be active simultaneously. We conclude that BONCAT coupled to FACS and sequencing is effective for interrogating the active fraction of soil microbiomes in situ and provides new perspectives to link metabolic capacity to overall soil ecological traits and processes.


Author(s):  
Frédéric Lemoine ◽  
Luc Blassel ◽  
Jakub Voznica ◽  
Olivier Gascuel

AbstractMotivationThe first cases of the COVID-19 pandemic emerged in December 2019. Until the end of February 2020, the number of available genomes was below 1,000, and their multiple alignment was easily achieved using standard approaches. Subsequently, the availability of genomes has grown dramatically. Moreover, some genomes are of low quality with sequencing/assembly errors, making accurate re-alignment of all genomes nearly impossible on a daily basis. A more efficient, yet accurate approach was clearly required to pursue all subsequent bioinformatics analyses of this crucial data.ResultshCoV-19 genomes are highly conserved, with very few indels and no recombination. This makes the profile HMM approach particularly well suited to align new genomes, add them to an existing alignment and filter problematic ones. Using a core of ∼2,500 high quality genomes, we estimated a profile using HMMER, and implemented this profile in COVID-Align, a user-friendly interface to be used online or as standalone via Docker. The alignment of 1,000 genomes requires less than 20mn on our cluster. Moreover, COVID-Align provides summary statistics, which can be used to determine the sequencing quality and evolutionary novelty of input genomes (e.g. number of new mutations and indels).Availabilityhttps://covalign.pasteur.cloud, hub.docker.com/r/evolbioinfo/[email protected], [email protected] informationSupplementary information is available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document