Assessing alignment-based taxonomic classification of ancient microbial DNA

10.7287/peerj.preprints.27166 ◽

2018 ◽

Author(s):

Raphael Eisenhofer ◽

Laura Susan Weyrich

Keyword(s):

Ancient Dna ◽

Dna Sequences ◽

Random Sequence ◽

Taxonomic Classification ◽

Metagenomic Data ◽

Data Sets ◽

Protein Alignments ◽

Microbial Dna ◽

Dna Characteristics

The field of paleomicrobiology—the study of ancient microorganisms—is rapidly growing due to recent methodological and technological advancements. It is now possible to obtain vast quantities of DNA data from ancient specimens in a high-throughput manner and use this information to investigate the dynamics and evolution of past microbial communities. However, we still know very little about how the characteristics of ancient DNA influence our ability to accurately assign microbial taxonomies (i.e. identify species) within ancient metagenomic samples. Here, we use both simulated and published metagenomic data sets to investigate how ancient DNA characteristics affect alignment-based taxonomic classification. We find that nucleotide-to-nucleotide, rather than nucleotide-to-protein, alignments are preferable when assigning taxonomies to DNA fragment lengths routinely identified within ancient specimens (<60 bp). We determine that deamination (a form of ancient DNA damage) and random sequence substitutions corresponding to ~100,000 years of genomic divergence minimally impact alignment-based classification. We also test four different reference databases and find that database choice can significantly bias the results of alignment-based taxonomic classification in ancient metagenomic studies. Finally, we perform a reanalysis of previously published ancient dental calculus data, increasing the number of microbial DNA sequences assigned taxonomically by an average of 64.2-fold and identifying microbial species previously unidentified in the original study. Overall, this study enhances our understanding of how ancient DNA characteristics influence alignment-based taxonomic classification of ancient microorganisms and provides recommendations for future paleomicrobiological studies.

Download Full-text

Assessing alignment-based taxonomic classification of ancient microbial DNA

PeerJ ◽

10.7717/peerj.6594 ◽

2019 ◽

Vol 7 ◽

pp. e6594 ◽

Cited By ~ 5

Author(s):

Raphael Eisenhofer ◽

Laura Susan Weyrich

Keyword(s):

Ancient Dna ◽

Dna Sequences ◽

Random Sequence ◽

Taxonomic Classification ◽

Metagenomic Data ◽

Data Sets ◽

Protein Alignments ◽

Microbial Dna ◽

Dna Characteristics

The field of palaeomicrobiology—the study of ancient microorganisms—is rapidly growing due to recent methodological and technological advancements. It is now possible to obtain vast quantities of DNA data from ancient specimens in a high-throughput manner and use this information to investigate the dynamics and evolution of past microbial communities. However, we still know very little about how the characteristics of ancient DNA influence our ability to accurately assign microbial taxonomies (i.e. identify species) within ancient metagenomic samples. Here, we use both simulated and published metagenomic data sets to investigate how ancient DNA characteristics affect alignment-based taxonomic classification. We find that nucleotide-to-nucleotide, rather than nucleotide-to-protein, alignments are preferable when assigning taxonomies to short DNA fragment lengths routinely identified within ancient specimens (<60 bp). We determine that deamination (a form of ancient DNA damage) and random sequence substitutions corresponding to ∼100,000 years of genomic divergence minimally impact alignment-based classification. We also test four different reference databases and find that database choice can significantly bias the results of alignment-based taxonomic classification in ancient metagenomic studies. Finally, we perform a reanalysis of previously published ancient dental calculus data, increasing the number of microbial DNA sequences assigned taxonomically by an average of 64.2-fold and identifying microbial species previously unidentified in the original study. Overall, this study enhances our understanding of how ancient DNA characteristics influence alignment-based taxonomic classification of ancient microorganisms and provides recommendations for future palaeomicrobiological studies.

Download Full-text

Bayesian Classification of Microbial Communities Based on 16S rRNA Metagenomic Data

10.1101/340653 ◽

2018 ◽

Cited By ~ 1

Author(s):

Arghavan Bahadorinejad ◽

Ivan Ivanov ◽

Johanna W Lampe ◽

Meredith AJ Hullar ◽

Robert S Chapkin ◽

...

Keyword(s):

16S Rrna ◽

Sample Size ◽

Microbial Communities ◽

State Of The Art ◽

Metagenomic Data ◽

Data Sets ◽

Sequencing Data ◽

Sample Data

AbstractWe propose a Bayesian method for the classification of 16S rRNA metagenomic profiles of bacterial abundance, by introducing a Poisson-Dirichlet-Multinomial hierarchical model for the sequencing data, constructing a prior distribution from sample data, calculating the posterior distribution in closed form; and deriving an Optimal Bayesian Classifier (OBC). The proposed algorithm is compared to state-of-the-art classification methods for 16S rRNA metagenomic data, including Random Forests and the phylogeny-based Metaphyl algorithm, for varying sample size, classification difficulty, and dimensionality (number of OTUs), using both synthetic and real metagenomic data sets. The results demonstrate that the proposed OBC method, with either noninformative or constructed priors, is competitive or superior to the other methods. In particular, in the case where the ratio of sample size to dimensionality is small, it was observed that the proposed method can vastly outperform the others.Author summaryRecent studies have highlighted the interplay between host genetics, gut microbes, and colorectal tumor initiation/progression. The characterization of microbial communities using metagenomic profiling has therefore received renewed interest. In this paper, we propose a method for classification, i.e., prediction of different outcomes, based on 16S rRNA metagenomic data. The proposed method employs a Bayesian approach, which is suitable for data sets with small ration of number of available instances to the dimensionality. Results using both synthetic and real metagenomic data show that the proposed method can outperform other state-of-the-art metagenomic classification algorithms.

Download Full-text

Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT

Genome Biology ◽

10.1186/s13059-019-1817-x ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 26

Author(s):

F. A. Bastiaan von Meijenfeldt ◽

Ksenia Arkhipova ◽

Diego D. Cambuy ◽

Felipe H. Coutinho ◽

Bas E. Dutilh

Keyword(s):

Dna Sequences ◽

De Novo ◽

Taxonomic Classification ◽

Classification Method ◽

Reference Database ◽

Annotation Tool ◽

Multiple Signals

Abstract Current-day metagenomics analyses increasingly involve de novo taxonomic classification of long DNA sequences and metagenome-assembled genomes. Here, we show that the conventional best-hit approach often leads to classifications that are too specific, especially when the sequences represent novel deep lineages. We present a classification method that integrates multiple signals to classify sequences (Contig Annotation Tool, CAT) and metagenome-assembled genomes (Bin Annotation Tool, BAT). Classifications are automatically made at low taxonomic ranks if closely related organisms are present in the reference database and at higher ranks otherwise. The result is a high classification precision even for sequences from considerably unknown organisms.

Download Full-text

Bagging Approach for Medical Plants Recognition Based on Their DNA Sequences

International Journal of Social Ecology and Sustainable Development ◽

10.4018/ijsesd.2018100103 ◽

2018 ◽

Vol 9 (4) ◽

pp. 45-60

Author(s):

Mohamed Elhadi Rahmani ◽

Abdelmalek Amine ◽

Reda Mohamed Hamou

Keyword(s):

Dna Sequences ◽

Majority Vote ◽

Data Sets ◽

Data Set ◽

Drug Production ◽

Medical Plants

Many drugs in modern medicines originate from plants and the first step in drug production, is the recognition of plants needed for this purpose. This article presents a bagging approach for medical plants recognition based on their DNA sequences. In this work, the authors have developed a system that recognize DNA sequences of 14 medical plants, first they divided the 14-class data set into bi class sub-data sets, then instead of using an algorithm to classify the 14-class data set, they used the same algorithm to classify the sub-data sets. By doing so, they have simplified the problem of classification of 14 plants into sub-problems of bi class classification. To construct the subsets, the authors extracted all possible pairs of the 14 classes, so they gave each class more chances to be well predicted. This approach allows the study of the similarity between DNA sequences of a plant with each other plants. In terms of results, the authors have obtained very good results in which the accuracy has been doubled (from 45% to almost 80%). Classification of a new sequence was completed according to majority vote.

Download Full-text

metaxa2: improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data

Molecular Ecology Resources ◽

10.1111/1755-0998.12399 ◽

2015 ◽

Vol 15 (6) ◽

pp. 1403-1414 ◽

Cited By ~ 188

Author(s):

Johan Bengtsson-Palme ◽

Martin Hartmann ◽

Karl Martin Eriksson ◽

Chandan Pal ◽

Kaisa Thorell ◽

...

Keyword(s):

Large Subunit ◽

Taxonomic Classification ◽

Metagenomic Data ◽

Large Subunit Rrna

Download Full-text

Contig annotation tool CAT robustly classifies assembled metagenomic contigs and long sequences

10.1101/072868 ◽

2016 ◽

Cited By ~ 13

Author(s):

Diego D. Cambuy ◽

Felipe H. Coutinho ◽

Bas E. Dutilh

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

Taxonomic Classification ◽

Annotation Tool ◽

Single Molecule Sequencing ◽

Short Read ◽

Long Read ◽

Micro Organisms ◽

Taxonomic Annotation

AbstractIn modern-day metagenomics, there is an increasing need for robust taxonomic annotation of long DNA sequences from unknown micro-organisms. Long metagenomic sequences may be derived from assembly of short-read metagenomes, or from long-read single molecule sequencing. Here we introduce CAT, a pipeline for robust taxonomic classification of long DNA sequences. We show that CAT correctly classifies contigs at different taxonomic levels, even in simulated metagenomic datasets that are very distantly related from the sequences in the database. CAT is implemented in Python and the required scripts can be freely downloaded from Github.

Download Full-text

TaxaTarget: Fast, Sensitive, and Precise Classification of Microeukaryotes in Metagenomic Data

10.21203/rs.3.rs-1186624/v1 ◽

2021 ◽

Author(s):

Seth Commichaux ◽

Kiran Javkar ◽

Harihara Subrahmaniam Muralidharan ◽

Padmini Ramachandran ◽

Andrea Ottesen ◽

...

Keyword(s):

State Of The Art ◽

Marker Gene ◽

Taxonomic Classification ◽

Discriminatory Power ◽

Training Data ◽

Metagenomic Data ◽

Marker Genes ◽

Amino Acid Region ◽

Database Structure

Abstract BackgroundMicrobial eukaryotes are nearly ubiquitous in microbiomes on Earth and contribute to many integral ecological functions. Metagenomics is a proven tool for studying the microbial diversity, functions, and ecology of microbiomes, but has been underutilized for microeukaryotes due to the computational challenges they present. For taxonomic classification, the use of a eukaryotic marker gene database can improve the computational efficiency, precision and sensitivity. However, state-of-the-art tools which use marker gene databases implement universal thresholds for classification rather than dynamically learning the thresholds from the database structure, impacting the accuracy of the classification process.ResultsHere we introduce taxaTarget, a method for the taxonomic classification of microeukaryotes in metagenomic data. Using a database of eukaryotic marker genes and a supervised learning approach for training, we learned the discriminatory power and classification thresholds for each 20 amino acid region of each marker gene in our database. This approach provided improved sensitivity and precision compared to other state-of-the-art approaches, with rapid runtimes and low memory usage. Additionally, taxaTarget was better able to detect the presence of multiple closely related species as well as species with no representative sequences in the database. One of the greatest challenges faced during the development of taxaTarget was the general sparsity of available sequences for microeukaryotes. Several algorithms were implemented, including threshold padding, which effectively handled the missing training data and reduced classification errors. Using taxaTarget on metagenomes from human fecal microbiomes, a broader range of genera were detected, including multiple parasites that the other tested tools missed.ConclusionData-driven methods for learning classification thresholds from the structure of an input database can provide granular information about the discriminatory power of the sequences and improve the sensitivity and precision of classification. These methods will help facilitate a more comprehensive analysis of metagenomic data and expand our knowledge about the diverse eukaryotes in microbial communities.

Download Full-text

Deep learning models for bacteria taxonomic classification of metagenomic data

BMC Bioinformatics ◽

10.1186/s12859-018-2182-6 ◽

2018 ◽

Vol 19 (S7) ◽

Cited By ~ 29

Author(s):

Antonino Fiannaca ◽

Laura La Paglia ◽

Massimo La Rosa ◽

Giosue’ Lo Bosco ◽

Giovanni Renda ◽

...

Keyword(s):

Deep Learning ◽

Taxonomic Classification ◽

Metagenomic Data ◽

Learning Models

Download Full-text

k-SLAM: accurate and ultra-fast taxonomic classification and gene identification for large metagenomic data sets

Nucleic Acids Research ◽

10.1093/nar/gkw1248 ◽

2016 ◽

pp. gkw1248 ◽

Cited By ~ 8

Author(s):

David Ainsworth ◽

Michael J.E. Sternberg ◽

Come Raczy ◽

Sarah A. Butcher

Keyword(s):

Taxonomic Classification ◽

Metagenomic Data ◽

Gene Identification ◽

Data Sets

Download Full-text