scholarly journals Scalable classification of organisms into a taxonomy using hierarchical supervised learners

2020 ◽  
Vol 18 (05) ◽  
pp. 2050026
Author(s):  
Gihad N. Sohsah ◽  
Ali Reza Ibrahimzada ◽  
Huzeyfe Ayaz ◽  
Ali Cakmak

Accurately identifying organisms based on their partially available genetic material is an important task to explore the phylogenetic diversity in an environment. Specific fragments in the DNA sequence of a living organism have been defined as DNA barcodes and can be used as markers to identify species efficiently and effectively. The existing DNA barcode-based classification approaches suffer from three major issues: (i) most of them assume that the classification is done within a given taxonomic class and/or input sequences are pre-aligned, (ii) highly performing classifiers, such as SVM, cannot scale to large taxonomies due to high memory requirements, (iii) mutations and noise in input DNA sequences greatly reduce the taxonomic classification score. In order to address these issues, we propose a multi-level hierarchical classifier framework to automatically assign taxonomy labels to DNA sequences. We utilize an alignment-free approach called spectrum kernel method for feature extraction. We build a proof-of-concept hierarchical classifier with two levels, and evaluated it on real DNA sequence data from barcode of life data systems. We demonstrate that the proposed framework provides higher f1-score than regular classifiers. Besides, hierarchical framework scales better to large datasets enabling researchers to employ classifiers with high classification performance and high memory requirement on large datasets. Furthermore, we show that the proposed framework is more robust to mutations and noise in sequence data than the non-hierarchical classifiers.

2020 ◽  
Author(s):  
Gihad N. Sohsah ◽  
Ali Reza Ibrahimzada ◽  
Huzeyfe Ayaz ◽  
Ali Cakmak

Taxonomy of living organisms gains major importance in making the study of vastly heterogeneous living things easier. In addition, various fields of applied biology (e.g., agriculture) depend on classification of living creatures. Specific fragments of the DNA sequence of a living organism have been defined as DNA barcodes and can be used as markers to identify species efficiently and effectively. The existing DNA barcode-based classification approaches suffer from three major issues: (i) most of them assume that the classification is done within a given taxonomic class and/or input sequences are prealigned, (ii) highly performing classifiers, such as SVM, cannot scale to large taxonomies due to high memory requirements, (iii) mutations and noise in input DNA sequences greatly reduce the taxonomic classification accuracy. In order to address these issues, we propose a multi-level hierarchical classifier framework to automatically assign taxonomy labels to DNA sequences. We utilize an alignment-free approach called spectrum kernel method for feature extraction. We build a proof-of-concept hierarchical classifier with two levels, and evaluated it on real DNA sequence data from BOLD systems. We demonstrate that the proposed framework provides higher accuracy than regular classifiers. Besides, hierarchical framework scales better to large datasets enabling researchers to employ classifiers with high accuracy and high memory requirement on large datasets. Furthermore, we show that the proposed framework is more robust to mutations and noise in sequence data than the non-hierarchical classifiers.


Genetics ◽  
1993 ◽  
Vol 134 (4) ◽  
pp. 1195-1204
Author(s):  
S Tarès ◽  
J M Cornuet ◽  
P Abad

Abstract An AluI family of highly reiterated nontranscribed sequences has been found in the genome of the honeybee Apis mellifera. This repeated sequence is shown to be present at approximately 23,000 copies per haploid genome constituting about 2% of the total genomic DNA. The nucleotide sequence of 10 monomers was determined. The consensus sequences is 176 nucleotides long and has an A + T content of 58%. There are clusters of both direct and inverted repeats. Internal subrepeating units ranging from 11 to 17 nucleotides are observed, suggesting that it could have evolved from a shorter sequence. DNA sequence data reveal that this repeat class is unusually homogeneous compared to the other class of invertebrate highly reiterated DNA sequences. The average pairwise sequence divergence between the repeats is 2.5%. In spite of this unusual homogeneity, divergence has been found in the repeated sequence hybridization ladder between four different honeybee subspecies. Therefore, the AluI highly reiterated sequences provide a new probe for fingerprinting in A. m. mellifera.


Zootaxa ◽  
2012 ◽  
Vol 3361 (1) ◽  
pp. 56-62 ◽  
Author(s):  
JOSEFINA CURIEL ◽  
JUAN J. MORRONE

Insect life stages are known imperfectly in many cases, and classifications are usually based on adult morphology. This isunfortunate as information on other life stages may be useful for biomonitoring. The major impediment to using elmid(Coleoptera) larvae for freshwater biomonitoring is the lack of larval descriptions and illustrations. Reliable molecular proto-cols may be used to associate larvae and adults. After adults of seven species of Mexican Macrelmis were identified morpho-logically, seven larval specimens were associated to them based on two gene fragments: Cox1 and Cob. The phylogeneticanalysis allowed identifying the larval specimens as Macrelmis leonilae, M. scutellaris, M. species 7, M. species 10, and M.species 11. Two species based on adults associated uncertainly with one larva, and one larva did not match with any adult. Adult/larval association in elmids using DNA sequence data seems to be promising in terms of speed and reliability.


Jurnal MIPA ◽  
2015 ◽  
Vol 4 (2) ◽  
pp. 131
Author(s):  
Muzakir Rahalus ◽  
Maureen Kumaunang ◽  
Audy Wuntu ◽  
Julius Pontoh

DNA barcode merupakan metode identifikasi organisme hidup dengan menggunakan urutan DNA pendek (± 500 pasang basa). Tujuan dari penelitian ini adalah memperoleh barcode DNA Edelweis dan menganalisis kemiripan gen matK Edelweis (Anaphalis javanica) dengan kerabat terdekatnya. Isolasi DNA total Edelweis berhasil dilakukan dengan menggunakan  manual prosedur dari InnuPrep Plant DNA Kit yang dimodifikasi. Gen matK parsial telah diisolasi dengan metode Polymerase Chain Reaction (PCR) menggunakan Primer forward matK-1RKIM-f dan Primer Reverse matK-3FKIM-r. Hasil analisis sekuens menghasilkan barcode DNA edelweis berukuran 843 bp. Hasil analisis kemiripan menunjukkan tingkat kekerabatan terdekat dengan A. margaritaceae yaitu 99.86% pada BOLD System dan 100 % pada NCBI.DNA barcode is a method to identify living organism by using several short sequences of DNA (± 500 base pairs). The purpose of this study was to obtain a DNA barcode and analyze the similarity of matK genes of edelweis (Anaphalis javanica) with its closest relatives. Isolation of total DNA of edelweis has been succesfully done by using modified manual procedures of InnuPrep Plant Kit. matK partial gene has been isolated by the method of Polymerase Chain Reaction (PCR) using forward primer MATK-1RKIM-f and reverse primer MATK-3FKIM-r. Analysis of DNA sequences of edelweis confirmed its DNA barcode size was 843 bp. Furthermore, A. javanica showed similarity 99.86% in BOLD system and 100% in NCBI with A. margaritaceae.


2021 ◽  
Vol 12 ◽  
Author(s):  
Ying Zhang ◽  
Yupei Zhou ◽  
Wei Sun ◽  
Lili Zhao ◽  
D. Pavlic-Zupanc ◽  
...  

The genus Botryosphaeria includes more than 200 epithets, but only the type species, Botryosphaeria dothidea and a dozen or more other species have been identified based on DNA sequence data. The taxonomic status of the other species remains unconfirmed because they lack either morphological information or DNA sequence data. In this study, types or authentic specimens of 16 “Botryosphaeria” species are reassessed to clarify their identity and phylogenetic position. nuDNA sequences of four regions, ITS, LSU, tef1-α and tub2, are analyzed and considered in combination with morphological characteristics. Based on the multigene phylogeny and morphological characters, Botryosphaeria cruenta and Botryosphaeria hamamelidis are transferred to Neofusicoccum. The generic status of Botryosphaeria aterrima and Botryosphaeria mirabile is confirmed in Botryosphaeria. Botryosphaeria berengeriana var. weigeliae and B. berengeriana var. acerina are treated synonyms of B. dothidea. Botryosphaeria mucosa is transferred to Neodeightonia as Neodeightonia mucosa, and Botryosphaeria ferruginea to Nothophoma as Nothophoma ferruginea. Botryosphaeria foliicola is reduced to synonymy with Phyllachorella micheliae. Botryosphaeria abuensis, Botryosphaeria aesculi, Botryosphaeria dasylirii, and Botryosphaeria wisteriae are tentatively kept in Botryosphaeria sensu stricto until further phylogenetic analysis is carried out on verified specimens. The ordinal status of Botryosphaeria apocyni, Botryosphaeria gaubae, and Botryosphaeria smilacinina cannot be determined, and tentatively accommodate these species in Dothideomycetes incertae sedis. The study demonstrates the significance of a polyphasic approach in characterizing type specimens, including the importance of using of DNA sequence data.


Parasitology ◽  
2022 ◽  
pp. 1-16
Author(s):  
Anna Faltýnková ◽  
Olena Kudlai ◽  
Camila Pantoja ◽  
Galina Yakovleva ◽  
Daria Lebedeva

Abstract DNA sequence data became an integral part of species characterization and identification. Still, specimens associated with a particular DNA sequence must be identified by the use of traditional morphology-based analysis and correct linking of sequence and identification must be ensured. Only a small part of DNA sequences of the genus Diplostomum (Diplostomidae) is based on adult isolates which are essential for accurate identification. In this study, we provide species identification with an aid of morphological and molecular (cox1, ITS-5.8S-ITS2 and 28S) characterization of adults of Diplostomum baeri Dubois, 1937 from naturally infected Larus canus Linnaeus in Karelia, Russia. Furthermore, we reveal that the DNA sequences of our isolates of D. baeri are identical with those of the lineage Diplostomum sp. clade Q , while other sequences labelled as the ‘D. baeri’ complex do not represent lineages of D. baeri. Our new material of cercariae from Radix balthica (Linnaeus) in Ireland is also linked to Diplostomum sp. clade Q. We reveal that D. baeri is widely distributed in Europe; as first intermediate hosts lymnaeid snails (Radix auricularia (Linnaeus), R. balthica) are used; metacercariae occur in eye lens of cyprinid fishes. In light of the convoluted taxonomy of D. baeri and other Diplostomum spp., we extend the recommendations of Blasco-Costa et al. (2016, Systematic Parasitology 93, 295–306) for the ‘best practice’ in molecular approaches to trematode systematics. The current study is another step in elucidating the species spectrum of Diplostomum based on integrative taxonomy with well-described morphology of adults linked to sequences.


Author(s):  
Jeremy Charlier ◽  
Robert Nadon ◽  
Vladimir Makarenkov

Abstract Motivation Off-target predictions are crucial in gene editing research. Recently, significant progress has been made in the field of prediction of off-target mutations, particularly with CRISPR-Cas9 data, thanks to the use of deep learning. CRISPR-Cas9 is a gene editing technique which allows manipulation of DNA fragments. The sgRNA-DNA (single guide RNA-DNA) sequence encoding for deep neural networks, however, has a strong impact on the prediction accuracy. We propose a novel encoding of sgRNA-DNA sequences that aggregates sequence data with no loss of information. Results In our experiments, we compare the proposed sgRNA-DNA sequence encoding applied in a deep learning prediction framework with state-of-the-art encoding and prediction methods. We demonstrate the superior accuracy of our approach in a simulation study involving Feedforward Neural Networks (FNNs), Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) as well as the traditional Random Forest (RF), Naive Bayes (NB) and Logistic Regression (LR) classifiers.We highlight the quality of our results by building several FNNs, CNNs and RNNs with various layer depths and performing predictions on two popular CRISPOR and GUIDE-seq gene editing data sets. In all our experiments, the new encoding led to more accurate off-target prediction results, providing an improvement of the area under the Receiver Operating Characteristic (ROC) curve up to 35%. Availability The code and data used in this study are available at: https://github.com/dagrate/dl-offtarget


2016 ◽  
Vol 36 (1) ◽  
Author(s):  
Paola Berchialla

We introduce a Bayesian hierarchical model for mitochondrial DNA sequence data, which is fitted via acceptance-rejection algorithms. The model incorporates parametric models of population history explicitly as well as a mutational process allowing for a simultaneous parameter estimation whose importance has become increasingly clear in many recent studies. The model is applied to a sample of DNA sequences from the Italian population.


Author(s):  
Dmitry Schigel ◽  
Thomas Jeppesen ◽  
Robert Finn ◽  
Guy Cochrane ◽  
Urmas Kõljalg ◽  
...  

The Global Biodiversity Information Facility (GBIF) was established by governments in 2001, largely through the initiative and leadership of the natural history collections community, following the 1999 recommendation by a working group under the Megascience Forum (predecessor of the Global Science Forum) of the Organization for Economic Cooperation and Development (OECD). Over 20 years, GBIF has helped develop standards and convened a global community of data-publishing institutions, aggregrating over one billion specimen occurrence records freely and openly available for use in research and policy making. These GBIF mediated data range from vouchered museum specimens to observation records generated by humans and machines. New data are being generated from integrated remote sensing, ecological sampling, and molecular sequencing that have strong geospatial components but lack traditional vouchers. GBIF is working with partners to develop best practices of bringing this data into the GBIF architecture. Following discussions during the second Global Biodiversity Information Conference in 2018, GBIF and the European Bioinformatics Institute (EMBL-EBI), supported by ELIXIR, have extended collaboration to share species occurrence records known only from their genetic material. When these data providers contribute data coordinates along with the sequences to the European Nucleotide Archive (ENA), the records will appear on GBIF maps and in spatial searches. This collaboration enables significant new molecular data streams to become discoverable through GBIF.org: by mid-March 2019, over 7.8m individual occurrence records via the ENA, and over 13.2m records as standardized Darwin Core sampling-event datasets via MGnify, a resource that provides taxonomic and functional annotations on sequences derived from environmental sequencing projects. Sequence-based occurrence records published by ENA and MGnify boost representation of microbial diversity which was underrepresented at GBIF. The ELIXIR-ENA-MGnify-GBIF partnership is working on further refinement of the dynamic data linkages, frequency of updates and other improvements. The API-based tool that connects GBIF data infrastructures is open to new data contributors and for indexes of molecular occurrences. Indexing of these data streams is dependent on the presence of a name (any rank) with the sequence. Under the current Codes of nomenclature, animals, fungi, plants, and algae cannot be described based on exclusively sequence data. Yet, a significant volume of biodiversity data has only been represented by DNA sequences. Barcoding and sequence clustering procedures vary among taxa and research communities, but clusters can be related to a taxon with a Latin name. Many DNA similarity clusters do not contain a sequence from a formally described taxon; however these sequence clusters provide provisional molecular names for nomenclatural communication. In the best cases, curated libraries of reference sequences, their metadata, clusters, alignments, and links to individuals and physical material become de facto naming conventions for certain taxonomic groups, and co-exist with Latin names. Integration of molecular names into the taxonomic backbone of GBIF started with Fungi and UNITE, a data management and identification environment for fungal ITS barcodes with 87,000+ fungal species hypotheses demarcating 800,000+ sequence specimens as of March 2019. Checklist publication of all names in UNITE through GBIF.org including Linnaean names and stable, DOI-trackable molecular sequence based ‘species hypotheses’, enables indexing of fungal metabarcoding data worldwide, such as BIOWIDE. As names are currently essential to indexing the world’s occurrence data, GBIF will develop similar linkages with names in the Barcode of Life data system (BOLD) and in SILVA - a resource for high-quality ribosomal RNA sequence data and taxonomy, and welcomes other reference systems to this development. Expanding the molecular data streams (Fig. 1) allows GBIF to address spatial, temporal and taxonomic gaps and biases, and to support large-scale data-intensive research openly and worldwide.


2021 ◽  
Vol 22 (3) ◽  
pp. 505
Author(s):  
SONIA GIULIETTI ◽  
TIZIANA ROMAGNOLI ◽  
ALESSANDRA CAMPANELLI ◽  
CECILIA TOTTI ◽  
STEFANO ACCORONI

The ecology and seasonality of Pseudo-nitzschia species and their contribution to phytoplankton community were analysed for the first time at the coastal station of the LTER-Senigallia-Susak transect (north-western Adriatic Sea) from 1988 to 2020. Species composition was addressed using DNA sequence data obtained from 106 monoclonal strains isolated from January 2018 to January 2020. The mean annual cycle of total phytoplankton in the study period (Feb 1988–Jan 2020) showed maximum abundances in winter followed by other peaks in spring and autumn. Diatoms were the main contributors in terms of abundance during the winter and the spring blooms. The autumn peak was due to phytoflagellates and diatoms. In summer phytoflagellates dominated the community, followed by diatoms and dinoflagellates, which in this season reached their annual maximum. Pseudo-nitzschia spp. represented on average 0.4–17.6% of diatom community, but during their blooms they could reach up to up to 90% of the total diatom abundances with 106 cells l-1. By LM, six different taxa were recognized: Pseudo-nitzschia cf. delicatissima and P. cf. pseudodelicatissima were the most abundant, followed by P. cf. fraudulenta, P. pungens, P. multistriata and P. cf. galaxiae. P. cf. fraudulenta and P. pungens were indicator taxa of winter. P. cf. delicatissima and P. cf. pseudodelicatissima were spring and summer taxa, respectively. P. galaxiae showed maximum abundances in autumn. DNA sequences revealed the presence of two species belonging to the ’P. seriata group’ (i.e. P. fraudulenta and P. pungens) and four species belonging to the ‘P. delicatissima group’ (P. calliantha and P. mannii within the P. pseudodelicatissima species complex, and P. delicatissima and P. cf. arenysensis within the P. delicatissima species complex). The presence of several cryptic and pseudo-cryptic species highlights the need to combine LM observations with DNA sequence data when the ecology of Pseudo-nitzschia is investigated. 


Sign in / Sign up

Export Citation Format

Share Document