Scalable Classification of Organisms into a Taxonomy Using Hierarchical Supervised Learners

Taxonomy of living organisms gains major importance in making the study of vastly heterogeneous living things easier. In addition, various fields of applied biology (e.g., agriculture) depend on classification of living creatures. Specific fragments of the DNA sequence of a living organism have been defined as DNA barcodes and can be used as markers to identify species efficiently and effectively. The existing DNA barcode-based classification approaches suffer from three major issues: (i) most of them assume that the classification is done within a given taxonomic class and/or input sequences are prealigned, (ii) highly performing classifiers, such as SVM, cannot scale to large taxonomies due to high memory requirements, (iii) mutations and noise in input DNA sequences greatly reduce the taxonomic classification accuracy. In order to address these issues, we propose a multi-level hierarchical classifier framework to automatically assign taxonomy labels to DNA sequences. We utilize an alignment-free approach called spectrum kernel method for feature extraction. We build a proof-of-concept hierarchical classifier with two levels, and evaluated it on real DNA sequence data from BOLD systems. We demonstrate that the proposed framework provides higher accuracy than regular classifiers. Besides, hierarchical framework scales better to large datasets enabling researchers to employ classifiers with high accuracy and high memory requirement on large datasets. Furthermore, we show that the proposed framework is more robust to mutations and noise in sequence data than the non-hierarchical classifiers.

Download Full-text

Scalable classification of organisms into a taxonomy using hierarchical supervised learners

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720020500262 ◽

2020 ◽

Vol 18 (05) ◽

pp. 2050026

Author(s):

Gihad N. Sohsah ◽

Ali Reza Ibrahimzada ◽

Huzeyfe Ayaz ◽

Ali Cakmak

Keyword(s):

Dna Sequence ◽

Dna Sequences ◽

Kernel Method ◽

Sequence Data ◽

Genetic Material ◽

Dna Barcode ◽

Living Organism ◽

Large Datasets ◽

Data Systems ◽

Hierarchical Classifier

Accurately identifying organisms based on their partially available genetic material is an important task to explore the phylogenetic diversity in an environment. Specific fragments in the DNA sequence of a living organism have been defined as DNA barcodes and can be used as markers to identify species efficiently and effectively. The existing DNA barcode-based classification approaches suffer from three major issues: (i) most of them assume that the classification is done within a given taxonomic class and/or input sequences are pre-aligned, (ii) highly performing classifiers, such as SVM, cannot scale to large taxonomies due to high memory requirements, (iii) mutations and noise in input DNA sequences greatly reduce the taxonomic classification score. In order to address these issues, we propose a multi-level hierarchical classifier framework to automatically assign taxonomy labels to DNA sequences. We utilize an alignment-free approach called spectrum kernel method for feature extraction. We build a proof-of-concept hierarchical classifier with two levels, and evaluated it on real DNA sequence data from barcode of life data systems. We demonstrate that the proposed framework provides higher f1-score than regular classifiers. Besides, hierarchical framework scales better to large datasets enabling researchers to employ classifiers with high classification performance and high memory requirement on large datasets. Furthermore, we show that the proposed framework is more robust to mutations and noise in sequence data than the non-hierarchical classifiers.

Download Full-text

Systematic Approach to Establish DNA Barcode of Medicinally Important Plants in Nepal

Journal of Balkumari College ◽

10.3126/jbkc.v8i0.29311 ◽

2019 ◽

Vol 8 ◽

pp. 57-61

Author(s):

Sunil Bhandari ◽

Jay Bhandari ◽

Sanjay Lama

Keyword(s):

Dna Sequences ◽

Sequence Data ◽

Dna Barcode ◽

Sequence Information ◽

Terrestrial Plants ◽

Coding Region ◽

Global Database ◽

High Efficient ◽

Living Organisms

DNA barcoding is an emerging tool for species identification that uses internationally agreed protocols and regions of DNA to create a global database of living organisms. Initiatives are taking place to generate DNA ba rcodes for all groups of living organisms make these genomic identity publically available to understand, conserve, and utilize the world 's biodiversity. Most of the terrestrial plants are characterized using two section of coding region within chloplast, part of chloroplast gene, a more conserved rbcl and more polymorphic MatK gene. In order to create high quality databases, each plants are characterized not only with the rbcl and MatK DNA sequences, an additional sequence information from internal transcribed spacer (ITS) region is more efficient. The quality of barcode depends on the various factors such as efficient primers, purity of DNA templates, as well as the quality of PCR amplicon from which the sequence data will derive. The protocol described here led to the generation of high efficient PCR amplicon which will aid in the minimization of erroneous DNA sequence infonnation from which bioinformatics procedure will generate efficient barcodes. The primers used to amplified MatK, rbcl and ITS sequence were MatK-4 13f-1 and MatK- 1227r-1, rbcl-1F and rbcl-724R, ITS1 and ITS4 showed a strong amplification successes of 80% of each in the tasted medicinal plants of Nepal. This study propose that the used sets of primers and amplification condition will help, in part, the development of DNA barcode for medicinally important plants of Nepal to conserve their identity with its nativeness.

Download Full-text

Bit Reduction based Compression Algorithm for DNA Sequences

International Journal of Scientific Research in Science Engineering and Technology ◽

10.32628/ijsrset218529 ◽

2021 ◽

pp. 270-277

Author(s):

Rosario Gilmary ◽

Murugesan G

Keyword(s):

Dna Sequence ◽

Data Storage ◽

Compression Ratio ◽

Dna Sequences ◽

Genomic Data ◽

Living Organism ◽

Biological Sequence ◽

Compression Algorithms ◽

Living Organisms ◽

Abundant Data

Deoxyribonucleic acid called DNA is the smallest fundamental unit that bears the genetic instructions of a living organism. It is used in the up growth and functioning of all known living organisms. Current DNA sequencing equipment creates extensive heaps of genomic data. The Nucleotide databases like GenBank, size getting 2 to 3 times larger annually. The increase in genomic data outstrips the increase in storage capacity. Massive amount of genomic data needs an effectual depository, quick transposal and preferable performance. To reduce storage of abundant data and data storage expense, compression algorithms were used. Typical compression approaches lose status while compressing these sequences. However, novel compression algorithms have been introduced for better compression ratio. The performance is correlated in terms of compression ratio; ratio of the capacity of compressed file and compression/decompression time; time taken to compress/decompress the sequence. In the proposed work, the input DNA sequence is compressed by reconstructing the sequence into varied formats. Here the input DNA sequence is subjected to bit reduction. The binary output is converted to hexadecimal format followed by encoding. Thus, the compression ratio of the biological sequence is improved.

Download Full-text

Characterization of an unusually conserved AluI highly reiterated DNA sequence family from the honeybee, Apis mellifera.

Genetics ◽

10.1093/genetics/134.4.1195 ◽

1993 ◽

Vol 134 (4) ◽

pp. 1195-1204

Author(s):

S Tarès ◽

J M Cornuet ◽

P Abad

Keyword(s):

Apis Mellifera ◽

Dna Sequence ◽

Dna Sequences ◽

Sequence Data ◽

Sequence Divergence ◽

Repeated Sequence ◽

Consensus Sequences ◽

Dna Sequence Data ◽

Repeat Class ◽

Honeybee Subspecies

Abstract An AluI family of highly reiterated nontranscribed sequences has been found in the genome of the honeybee Apis mellifera. This repeated sequence is shown to be present at approximately 23,000 copies per haploid genome constituting about 2% of the total genomic DNA. The nucleotide sequence of 10 monomers was determined. The consensus sequences is 176 nucleotides long and has an A + T content of 58%. There are clusters of both direct and inverted repeats. Internal subrepeating units ranging from 11 to 17 nucleotides are observed, suggesting that it could have evolved from a shorter sequence. DNA sequence data reveal that this repeat class is unusually homogeneous compared to the other class of invertebrate highly reiterated DNA sequences. The average pairwise sequence divergence between the repeats is 2.5%. In spite of this unusual homogeneity, divergence has been found in the repeated sequence hybridization ladder between four different honeybee subspecies. Therefore, the AluI highly reiterated sequences provide a new probe for fingerprinting in A. m. mellifera.

Download Full-text

Traditional Infrageneric Classification of Gymnopilus Is Not Supported by Ribosomal DNA Sequence Data

Mycologia ◽

10.2307/3761920 ◽

2003 ◽

Vol 95 (6) ◽

pp. 1204 ◽

Cited By ~ 5

Author(s):

Laura Guzman-Davalos ◽

Gregory M. Mueller ◽

Joaquin Cifuentes ◽

Andrew N. Miller ◽

Anne Santerre

Keyword(s):

Dna Sequence ◽

Ribosomal Dna ◽

Sequence Data ◽

Infrageneric Classification ◽

Dna Sequence Data ◽

Ribosomal Dna Sequence

Download Full-text

The genomic basis of medicine

Oxford Textbook of Medicine ◽

10.1093/med/9780199204854.003.40202 ◽

2010 ◽

pp. 136-151

Author(s):

Paweł Stankiewicz ◽

James R. Lupski

Keyword(s):

Dna Sequence ◽

Dna Content ◽

Sequence Data ◽

Dna Sequence Data ◽

Enormous Amount ◽

Living Organisms ◽

Genomic Basis ◽

Genetic Bases

During the last two decades it has become possible to determine the entire DNA content of living organisms—the genome. The completion of the human reference DNA sequence has provided an enormous amount of DNA sequence data and has extended our view of the genetic bases of disease....

Download Full-text

Association of larvae and adults of Mexican species of Macrelmis (Coleoptera: Elmidae): a preliminary analysis using DNA sequences

Zootaxa ◽

10.11646/zootaxa.3361.1.5 ◽

2012 ◽

Vol 3361 (1) ◽

pp. 56-62 ◽

Cited By ~ 7

Author(s):

JOSEFINA CURIEL ◽

JUAN J. MORRONE

Keyword(s):

Dna Sequence ◽

Dna Sequences ◽

Preliminary Analysis ◽

Sequence Data ◽

Strong Association ◽

Life Stages ◽

Adult Morphology ◽

Dna Sequence Data ◽

Mexican Species ◽

Major Impediment

Insect life stages are known imperfectly in many cases, and classifications are usually based on adult morphology. This isunfortunate as information on other life stages may be useful for biomonitoring. The major impediment to using elmid(Coleoptera) larvae for freshwater biomonitoring is the lack of larval descriptions and illustrations. Reliable molecular proto-cols may be used to associate larvae and adults. After adults of seven species of Mexican Macrelmis were identified morpho-logically, seven larval specimens were associated to them based on two gene fragments: Cox1 and Cob. The phylogeneticanalysis allowed identifying the larval specimens as Macrelmis leonilae, M. scutellaris, M. species 7, M. species 10, and M.species 11. Two species based on adults associated uncertainly with one larva, and one larva did not match with any adult. Adult/larval association in elmids using DNA sequence data seems to be promising in terms of speed and reliability.

Download Full-text

Discovery and Classification of Ecological Diversity in the Bacterial World: The Role of DNA Sequence Data

International Journal of Systematic Bacteriology ◽

10.1099/00207713-47-4-1145 ◽

1997 ◽

Vol 47 (4) ◽

pp. 1145-1156 ◽

Cited By ~ 251

Author(s):

T. Palys ◽

L. K. Nakamura ◽

F. M. Cohan

Keyword(s):

Dna Sequence ◽

Sequence Data ◽

Ecological Diversity ◽

Dna Sequence Data

Download Full-text

Barcode DNA Edelweis (Anaphalis javanica) Berdasarkan Gen matK

Jurnal MIPA ◽

10.35799/jm.4.2.2015.9037 ◽

2015 ◽

Vol 4 (2) ◽

pp. 131

Author(s):

Muzakir Rahalus ◽

Maureen Kumaunang ◽

Audy Wuntu ◽

Julius Pontoh

Keyword(s):

Polymerase Chain Reaction ◽

Dna Sequences ◽

Dna Barcode ◽

Living Organism ◽

Chain Reaction ◽

Base Pairs ◽

Barcode Dna ◽

Reverse Primer ◽

Total Dna ◽

Polymerase Chain

DNA barcode merupakan metode identifikasi organisme hidup dengan menggunakan urutan DNA pendek (± 500 pasang basa). Tujuan dari penelitian ini adalah memperoleh barcode DNA Edelweis dan menganalisis kemiripan gen matK Edelweis (Anaphalis javanica) dengan kerabat terdekatnya. Isolasi DNA total Edelweis berhasil dilakukan dengan menggunakan manual prosedur dari InnuPrep Plant DNA Kit yang dimodifikasi. Gen matK parsial telah diisolasi dengan metode Polymerase Chain Reaction (PCR) menggunakan Primer forward matK-1RKIM-f dan Primer Reverse matK-3FKIM-r. Hasil analisis sekuens menghasilkan barcode DNA edelweis berukuran 843 bp. Hasil analisis kemiripan menunjukkan tingkat kekerabatan terdekat dengan A. margaritaceae yaitu 99.86% pada BOLD System dan 100 % pada NCBI.DNA barcode is a method to identify living organism by using several short sequences of DNA (± 500 base pairs). The purpose of this study was to obtain a DNA barcode and analyze the similarity of matK genes of edelweis (Anaphalis javanica) with its closest relatives. Isolation of total DNA of edelweis has been succesfully done by using modified manual procedures of InnuPrep Plant Kit. matK partial gene has been isolated by the method of Polymerase Chain Reaction (PCR) using forward primer MATK-1RKIM-f and reverse primer MATK-3FKIM-r. Analysis of DNA sequences of edelweis confirmed its DNA barcode size was 843 bp. Furthermore, A. javanica showed similarity 99.86% in BOLD system and 100% in NCBI with A. margaritaceae.

Download Full-text

Two new asexual genera and six new asexual species in the family Microthyriaceae (Dothideomycetes, Ascomycota) from China

MycoKeys ◽

10.3897/mycokeys.85.70829 ◽

2021 ◽

Vol 85 ◽

pp. 1-30

Author(s):

Min Qiao ◽

Hua Zheng ◽

Ji-Shu Guo ◽

Rafael F. Castañeda-Ruiz ◽

Jian-Ping Xu ◽

...

Keyword(s):

New Taxa ◽

Dna Sequences ◽

Sequence Data ◽

Southern China ◽

Phylogenetic Analyses ◽

Large Subunit ◽

Aquatic Hyphomycetes ◽

Internal Transcribed Spacers ◽

The Family

The family Microthyriaceae is represented by relatively few mycelial cultures and DNA sequences; as a result, the taxonomy and classification of this group of organisms remain poorly understood. During the investigation of the diversity of aquatic hyphomycetes from southern China, several isolates were collected. These isolates were cultured and sequenced and a BLAST search of its LSU sequences against data in GenBank revealed that the closest related taxa are in the genus Microthyrium. Phylogenetic analyses, based on the combined sequence data from the internal transcribed spacers (ITS) and the large subunit (LSU), revealed that these isolates represent eight new taxa in Microthyriaceae, including two new genera, Antidactylariagen. nov. and Isthmomycesgen. nov. and six new species, Antidactylaria minifimbriatasp. nov., Isthmomyces oxysporussp. nov., I. dissimilissp. nov., I. macrosporussp. nov., Triscelophorus anisopterioideussp. nov. and T. sinensissp. nov. These new taxa are described, illustrated for their morphologies and compared with similar taxa. In addition, two new combinations are proposed in this family.

Download Full-text