scholarly journals Alignment-free classification of COI DNA barcode data with the Python package Alfie

2020 ◽  
Vol 4 ◽  
Author(s):  
Cameron M. Nugent ◽  
Sarah J. Adamowicz

Characterization of biodiversity from environmental DNA samples and bulk metabarcoding data is hampered by off-target sequences that can confound conclusions about a taxonomic group of interest. Existing methods for isolation of target sequences rely on alignment to existing reference barcodes, but this can bias results against novel genetic variants. Effectively parsing targeted DNA barcode data from off-target noise improves the quality of biodiversity estimates and biological conclusions by limiting subsequent analyses to a relevant subset of available data. Here, we present Alfie, a Python package for the alignment-free classification of cytochrome c oxidase subunit I (COI) DNA barcode sequences to taxonomic kingdoms. The package determines k-mer frequencies of DNA sequences, and the frequencies serve as input for a neural network classifier that was trained and tested using ~58,000 publicly available COI sequences. The classifier was designed and optimized through a series of tests that allowed for the optimal set of DNA k-mer features and optimal machine learning algorithm to be selected. The neural network classifier rapidly assigns COI sequences of varying lengths to kingdoms with greater than 99% accuracy and is shown to generalize effectively and make accurate predictions about data from previously unseen taxonomic classes. The package contains an application programming interface that allows the Alfie package’s functionality to be extended to different DNA sequence classification tasks to suit a user’s need, including classification of different genes and barcodes, and classification to different taxonomic levels. Alfie is free and publicly available through GitHub (https://github.com/CNuge/alfie) and the Python package index (https://pypi.org/project/alfie/).

Author(s):  
Cameron M. Nugent ◽  
Sarah J. Adamowicz

AbstractCharacterization of biodiversity from environmental DNA samples and bulk metabarcoding data is hampered by off-target sequences that can confound conclusions about a taxonomic group of interest. Existing methods for isolation of target sequences rely on alignment to existing reference barcodes, but this can bias results against novel genetic variants. Effectively parsing targeted DNA barcode data from off-target noise improves the quality of biodiversity estimates and biological conclusions by limiting subsequent analyses to a relevant subset of available data. Here, we present Alfie, a Python package for the alignment-free classification of cytochrome c oxidase subunit I (COI) DNA barcode sequences to taxonomic kingdoms. The package determines k-mer frequencies of DNA sequences, and the frequencies serve as input for a neural network classifier that was trained and tested using ~58,000 publicly available COI sequences. The classifier was designed and optimized through a series of tests that allowed for the optimal set of DNA k-mer features and optimal machine learning algorithm to be selected. The neural network classifier rapidly assigns COI sequences to kingdoms with greater than 99% accuracy and is shown to generalize effectively and make accurate predictions about data from previously unseen taxonomic classes. The package contains an application programming interface that allows the Alfie package’s functionality to be extended to different DNA sequence classification tasks to suit a user’s need, including classification of different genes and barcodes, and classification to different taxonomic levels. Alfie is free and publicly available through GitHub (https://github.com/CNuge/alfie) and the Python package index (https://pypi.org/project/alfie/).


2016 ◽  
Vol 2016 ◽  
pp. 1-15 ◽  
Author(s):  
I. Jasmine Selvakumari Jeya ◽  
S. N. Deepa

A proposed real coded genetic algorithm based radial basis function neural network classifier is employed to perform effective classification of healthy and cancer affected lung images. Real Coded Genetic Algorithm (RCGA) is proposed to overcome the Hamming Cliff problem encountered with the Binary Coded Genetic Algorithm (BCGA). Radial Basis Function Neural Network (RBFNN) classifier is chosen as a classifier model because of its Gaussian Kernel function and its effective learning process to avoid local and global minima problem and enable faster convergence. This paper specifically focused on tuning the weights and bias of RBFNN classifier employing the proposed RCGA. The operators used in RCGA enable the algorithm flow to compute weights and bias value so that minimum Mean Square Error (MSE) is obtained. With both the lung healthy and cancer images from Lung Image Database Consortium (LIDC) database and Real time database, it is noted that the proposed RCGA based RBFNN classifier has performed effective classification of the healthy lung tissues and that of the cancer affected lung nodules. The classification accuracy computed using the proposed approach is noted to be higher in comparison with that of the classifiers proposed earlier in the literatures.


2006 ◽  
Vol 46 (3) ◽  
pp. 1479-1490 ◽  
Author(s):  
Joji M. Otaki ◽  
Akihito Mori ◽  
Yoshimasa Itoh ◽  
Takashi Nakayama ◽  
Haruhiko Yamamoto

2017 ◽  
Vol 1 (4) ◽  
pp. 271-277 ◽  
Author(s):  
Abdullah Caliskan ◽  
Mehmet Emin Yuksel

Abstract In this study, a deep neural network classifier is proposed for the classification of coronary artery disease medical data sets. The proposed classifier is tested on reference CAD data sets from the literature and also compared with popular representative classification methods regarding its classification performance. Experimental results show that the deep neural network classifier offers much better accuracy, sensitivity and specificity rates when compared with other methods. The proposed method presents itself as an easily accessible and cost-effective alternative to currently existing methods used for the diagnosis of CAD and it can be applied for easily checking whether a given subject under examination has at least one occluded coronary artery or not.


Sign in / Sign up

Export Citation Format

Share Document