BERTax: taxonomic classification of DNA sequences with Deep Neural Networks

Mapping Intimacies ◽

10.1101/2021.07.09.451778 ◽

2021 ◽

Author(s):

Florian Mock ◽

Fleming Kretschmer ◽

Anton Kriese ◽

Sebastian Böcker ◽

Manja Marz

Keyword(s):

Language Processing ◽

Dna Sequences ◽

Information Gain ◽

Taxonomic Classification ◽

Training Data ◽

Misclassification Rate ◽

Genomic Sequences ◽

Similar Species ◽

Common Task

Taxonomic classification, i.e., the identification and assignment to groups of biological organisms with the same origin and characteristics, is a common task in genetics. Nowadays, taxonomic classification is mainly based on genome similarity search to large genome databases. In this process, the classification quality depends heavily on the database since representative relatives have to be known already. Many genomic sequences cannot be classified at all or only with a high misclassification rate. Here we present BERTax, a program that uses a deep neural network to precisely classify the superkingdom, phylum, and genus of DNA sequences taxonomically without the need for a known representative relative from a database. For this, BERTax uses the natural language processing model BERT trained to represent DNA. We show BERTax to be at least on par with the state-of-the-art approaches when taxonomically similar species are part of the training data. In case of an entirely novel organism, however, BERTax clearly outperforms any existing approach. Finally, we show that BERTax can also be combined with database approaches to further increase the prediction quality. Since BERTax is not based on homologous entries in databases, it allows precise taxonomic classification of a broader range of genomic sequences. This leads to a higher number of correctly classified sequences and thus increases the overall information gain.

Download Full-text

Higher-order Markov models for metagenomic sequence classification

Bioinformatics ◽

10.1093/bioinformatics/btaa562 ◽

2020 ◽

Vol 36 (14) ◽

pp. 4130-4136

Author(s):

David J Burks ◽

Rajeev K Azad

Keyword(s):

Dna Sequences ◽

Markov Models ◽

Fragment Size ◽

Higher Order ◽

Training Data ◽

Supplementary Information ◽

Local Alignment ◽

Metagenomic Sequence ◽

Higher Order Models

Abstract Motivation Alignment-free, stochastic models derived from k-mer distributions representing reference genome sequences have a rich history in the classification of DNA sequences. In particular, the variants of Markov models have previously been used extensively. Higher-order Markov models have been used with caution, perhaps sparingly, primarily because of the lack of enough training data and computational power. Advances in sequencing technology and computation have enabled exploitation of the predictive power of higher-order models. We, therefore, revisited higher-order Markov models and assessed their performance in classifying metagenomic sequences. Results Comparative assessment of higher-order models (HOMs, 9th order or higher) with interpolated Markov model, interpolated context model and lower-order models (8th order or lower) was performed on metagenomic datasets constructed using sequenced prokaryotic genomes. Our results show that HOMs outperform other models in classifying metagenomic fragments as short as 100 nt at all taxonomic ranks, and at lower ranks when the fragment size was increased to 250 nt. HOMs were also found to be significantly more accurate than local alignment which is widely relied upon for taxonomic classification of metagenomic sequences. A novel software implementation written in C++ performs classification faster than the existing Markovian metagenomic classifiers and can therefore be used as a standalone classifier or in conjunction with existing taxonomic classifiers for more robust classification of metagenomic sequences. Availability and implementation The software has been made available at https://github.com/djburks/SMM. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT

Genome Biology ◽

10.1186/s13059-019-1817-x ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 26

Author(s):

F. A. Bastiaan von Meijenfeldt ◽

Ksenia Arkhipova ◽

Diego D. Cambuy ◽

Felipe H. Coutinho ◽

Bas E. Dutilh

Keyword(s):

Dna Sequences ◽

De Novo ◽

Taxonomic Classification ◽

Classification Method ◽

Reference Database ◽

Annotation Tool ◽

Multiple Signals

Abstract Current-day metagenomics analyses increasingly involve de novo taxonomic classification of long DNA sequences and metagenome-assembled genomes. Here, we show that the conventional best-hit approach often leads to classifications that are too specific, especially when the sequences represent novel deep lineages. We present a classification method that integrates multiple signals to classify sequences (Contig Annotation Tool, CAT) and metagenome-assembled genomes (Bin Annotation Tool, BAT). Classifications are automatically made at low taxonomic ranks if closely related organisms are present in the reference database and at higher ranks otherwise. The result is a high classification precision even for sequences from considerably unknown organisms.

Download Full-text

Contig annotation tool CAT robustly classifies assembled metagenomic contigs and long sequences

10.1101/072868 ◽

2016 ◽

Cited By ~ 13

Author(s):

Diego D. Cambuy ◽

Felipe H. Coutinho ◽

Bas E. Dutilh

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

Taxonomic Classification ◽

Annotation Tool ◽

Single Molecule Sequencing ◽

Short Read ◽

Long Read ◽

Micro Organisms ◽

Taxonomic Annotation

AbstractIn modern-day metagenomics, there is an increasing need for robust taxonomic annotation of long DNA sequences from unknown micro-organisms. Long metagenomic sequences may be derived from assembly of short-read metagenomes, or from long-read single molecule sequencing. Here we introduce CAT, a pipeline for robust taxonomic classification of long DNA sequences. We show that CAT correctly classifies contigs at different taxonomic levels, even in simulated metagenomic datasets that are very distantly related from the sequences in the database. CAT is implemented in Python and the required scripts can be freely downloaded from Github.

Download Full-text

TaxaTarget: Fast, Sensitive, and Precise Classification of Microeukaryotes in Metagenomic Data

10.21203/rs.3.rs-1186624/v1 ◽

2021 ◽

Author(s):

Seth Commichaux ◽

Kiran Javkar ◽

Harihara Subrahmaniam Muralidharan ◽

Padmini Ramachandran ◽

Andrea Ottesen ◽

...

Keyword(s):

State Of The Art ◽

Marker Gene ◽

Taxonomic Classification ◽

Discriminatory Power ◽

Training Data ◽

Metagenomic Data ◽

Marker Genes ◽

Amino Acid Region ◽

Database Structure

Abstract BackgroundMicrobial eukaryotes are nearly ubiquitous in microbiomes on Earth and contribute to many integral ecological functions. Metagenomics is a proven tool for studying the microbial diversity, functions, and ecology of microbiomes, but has been underutilized for microeukaryotes due to the computational challenges they present. For taxonomic classification, the use of a eukaryotic marker gene database can improve the computational efficiency, precision and sensitivity. However, state-of-the-art tools which use marker gene databases implement universal thresholds for classification rather than dynamically learning the thresholds from the database structure, impacting the accuracy of the classification process.ResultsHere we introduce taxaTarget, a method for the taxonomic classification of microeukaryotes in metagenomic data. Using a database of eukaryotic marker genes and a supervised learning approach for training, we learned the discriminatory power and classification thresholds for each 20 amino acid region of each marker gene in our database. This approach provided improved sensitivity and precision compared to other state-of-the-art approaches, with rapid runtimes and low memory usage. Additionally, taxaTarget was better able to detect the presence of multiple closely related species as well as species with no representative sequences in the database. One of the greatest challenges faced during the development of taxaTarget was the general sparsity of available sequences for microeukaryotes. Several algorithms were implemented, including threshold padding, which effectively handled the missing training data and reduced classification errors. Using taxaTarget on metagenomes from human fecal microbiomes, a broader range of genera were detected, including multiple parasites that the other tested tools missed.ConclusionData-driven methods for learning classification thresholds from the structure of an input database can provide granular information about the discriminatory power of the sequences and improve the sensitivity and precision of classification. These methods will help facilitate a more comprehensive analysis of metagenomic data and expand our knowledge about the diverse eukaryotes in microbial communities.

Download Full-text

Multi-feature fusion framework for sarcasm identification on twitter data: A machine learning based approach

PLoS ONE ◽

10.1371/journal.pone.0252918 ◽

2021 ◽

Vol 16 (6) ◽

pp. e0252918

Author(s):

Christopher Ifeanyi Eke ◽

Azah Anir Norman ◽

Liyana Shuib

Keyword(s):

Language Processing ◽

Feature Fusion ◽

Contextual Information ◽

Training Data ◽

Lexical Feature ◽

Twitter Data ◽

Stage Classification ◽

Bow Technique ◽

Fusion Framework

Sarcasm is the main reason behind the faulty classification of tweets. It brings a challenge in natural language processing (NLP) as it hampers the method of finding people’s actual sentiment. Various feature engineering techniques are being investigated for the automatic detection of sarcasm. However, most related techniques have always concentrated only on the content-based features in sarcastic expression, leaving the contextual information in isolation. This leads to a loss of the semantics of words in the sarcastic expression. Another drawback is the sparsity of the training data. Due to the word limit of microblog, the feature vector’s values for each sample constructed by BoW produces null features. To address the above-named problems, a Multi-feature Fusion Framework is proposed using two classification stages. The first stage classification is constructed with the lexical feature only, extracted using the BoW technique, and trained using five standard classifiers, including SVM, DT, KNN, LR, and RF, to predict the sarcastic tendency. In stage two, the constructed lexical sarcastic tendency feature is fused with eight other proposed features for modelling a context to obtain a final prediction. The effectiveness of the developed framework is tested with various experimental analysis to obtain classifiers’ performance. The evaluation shows that our constructed classification models based on the developed novel feature fusion obtained results with a precision of 0.947 using a Random Forest classifier. Finally, the obtained results were compared with the results of three baseline approaches. The comparison outcome shows the significance of the proposed framework.

Download Full-text

Taxonomic identification from metagenomic and metabarcoding data using any genetic marker

10.1101/253377 ◽

2018 ◽

Author(s):

Johan Bengtsson-Palme ◽

Rodney T. Richardson ◽

Marco Meola ◽

Christian Wurzbacher ◽

Émilie D. Tremblay ◽

...

Keyword(s):

Genetic Marker ◽

Dna Sequences ◽

Sequence Data ◽

Taxonomic Diversity ◽

Taxonomic Classification ◽

Taxonomic Identification ◽

Link Type

Correct taxonomic identification of DNA sequences is central to studies of biodiversity using both shotgun metagenomic and metabarcoding approaches. However, there is no genetic marker that gives sufficient performance across all the biological kingdoms, hampering studies of taxonomic diversity in many groups of organisms. We here present a major update to Metaxa2 (http://microbiology.se/software/metaxa2/) that enables the use of any genetic marker for taxonomic classification of metagenome and amplicon sequence data.

Download Full-text

Assessing alignment-based taxonomic classification of ancient microbial DNA

10.7287/peerj.preprints.27166v1 ◽

2018 ◽

Author(s):

Raphael Eisenhofer ◽

Laura Susan Weyrich

Keyword(s):

Ancient Dna ◽

Dna Sequences ◽

Random Sequence ◽

Taxonomic Classification ◽

Metagenomic Data ◽

Data Sets ◽

Protein Alignments ◽

Microbial Dna ◽

Dna Characteristics

The field of paleomicrobiology—the study of ancient microorganisms—is rapidly growing due to recent methodological and technological advancements. It is now possible to obtain vast quantities of DNA data from ancient specimens in a high-throughput manner and use this information to investigate the dynamics and evolution of past microbial communities. However, we still know very little about how the characteristics of ancient DNA influence our ability to accurately assign microbial taxonomies (i.e. identify species) within ancient metagenomic samples. Here, we use both simulated and published metagenomic data sets to investigate how ancient DNA characteristics affect alignment-based taxonomic classification. We find that nucleotide-to-nucleotide, rather than nucleotide-to-protein, alignments are preferable when assigning taxonomies to DNA fragment lengths routinely identified within ancient specimens (<60 bp). We determine that deamination (a form of ancient DNA damage) and random sequence substitutions corresponding to ~100,000 years of genomic divergence minimally impact alignment-based classification. We also test four different reference databases and find that database choice can significantly bias the results of alignment-based taxonomic classification in ancient metagenomic studies. Finally, we perform a reanalysis of previously published ancient dental calculus data, increasing the number of microbial DNA sequences assigned taxonomically by an average of 64.2-fold and identifying microbial species previously unidentified in the original study. Overall, this study enhances our understanding of how ancient DNA characteristics influence alignment-based taxonomic classification of ancient microorganisms and provides recommendations for future paleomicrobiological studies.

Download Full-text

The Modified DNA Identification Classification on Fuzzy Relation

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.48-49.1275 ◽

2011 ◽

Vol 48-49 ◽

pp. 1275-1281 ◽

Cited By ~ 3

Author(s):

Yu Jen Hu ◽

Yuh Hua Hu ◽

Jyh Bin Ke

Keyword(s):

Machine Learning ◽

Dna Sequences ◽

Fuzzy Relation ◽

Fuzzy Cluster ◽

Training Data ◽

Dna Identification ◽

Modified Dna

We proposed a categorized method of DNA sequences matrix by FCM (fuzzy cluster means). FCM avoided the errors caused by the reduction of dimensions. It further reached comprehensive machine learning. In our experiment, there are 40 training data which are artificial samples, and we verify the proposed method with 182 natural DNA sequences. The result showed the proposed method enhanced the accuracy of the classification of genes from 76% to 93%.

Download Full-text

Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT

10.1101/530188 ◽

2019 ◽

Cited By ~ 9

Author(s):

F.A. Bastiaan von Meijenfeldt ◽

Ksenia Arkhipova ◽

Diego D. Cambuy ◽

Felipe H. Coutinho ◽

Bas E. Dutilh

Keyword(s):

Dna Sequences ◽

Real World ◽

Taxonomic Classification ◽

Reference Database ◽

Annotation Tool ◽

High Quality

ABSTRACTCurrent-day metagenomics increasingly requires taxonomic classification of long DNA sequences and metagenome-assembled genomes (MAGs) of unknown microorganisms. We show that the standard best-hit approach often leads to classifications that are too specific. We present tools to classify high-quality metagenomic contigs (Contig Annotation Tool, CAT) and MAGs (Bin Annotation Tool, BAT) and thoroughly benchmark them with simulated metagenomic sequences that are classified against a reference database where related sequences are increasingly removed, thereby simulating increasingly unknown queries. We find that the query sequences are correctly classified at low taxonomic ranks if closely related organisms are present in the reference database, while classifications are made higher in the taxonomy when closely related organisms are absent, thus avoiding spurious classification specificity. In a real-world challenge, we apply BAT to over 900 MAGs from a recent rumen metagenomics study and classified 97% consistently with prior phylogeny-based classifications, but in a fully automated fashion.

Download Full-text

ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels

10.1101/394932 ◽

2018 ◽

Author(s):

Gurjit S. Randhawa ◽

Kathleen A. Hill ◽

Lila Kari

Keyword(s):

Software Tool ◽

Digital Signal ◽

Taxonomic Classification ◽

Genomic Sequences ◽

Mitochondrial Genomes ◽

Genomic Signatures ◽

Alignment Free ◽

Benchmark Datasets ◽

Similar Accuracy

AbstractBackgroundAlthough methods and software tools abound for the comparison, analysis, identification, and taxonomic classification of the enormous amount of genomic sequences that are continuously being produced, taxonomic classification remains challenging. The difficulty lies within both the magnitude of the dataset and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods.ResultsWe combine supervised Machine Learning with Digital Signal Processing to design ML-DSP, an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels.We test ML-DSP by classifying 7,396 full mitochondrial genomes from the kingdom to genus levels, with 98% classification accuracy. Compared with the alignment-based classification tool MEGA7 (with sequences aligned with either MUSCLE, or CLUSTALW), ML-DSP has similar accuracy scores while being significantly faster on two small benchmark datasets (2,250 to 67,600 times faster for 41 mammalian mitochondrial genomes). ML-DSP also successfully scales to accurately classify a large dataset of 4,322 complete vertebrate mtDNA genomes, a task which MEGA7 with MUSCLE or CLUSTALW did not complete after several hours, and had to be terminated. ML-DSP also outperforms the alignment-free tool FFP (Feature Frequency Profiles) in terms of both accuracy and time, being three times faster for the vertebrate mtDNA genomes dataset.ConclusionsWe provide empirical evidence that ML-DSP distinguishes complete genome sequences at all taxonomic levels. Ultrafast and accurate taxonomic classification of genomic sequences is predicted to be highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures, in identifying mechanistic determinants of genomic signatures, and in evaluating genome integrity.

Download Full-text