scholarly journals MT-MAG: Accurate and interpretable machine learning based taxonomic assignment of metagenome-assembled genomes, with a partial classification option

2022 ◽  
Author(s):  
Wanxin Li ◽  
Lila Kari ◽  
Yaoliang Yu ◽  
Laura A Hug

We propose MT-MAG, a novel machine learning-based taxonomic assignment tool for hierarchically-structured local classification of metagenome-assembled genomes (MAGs). MT-MAG is capable of classifying large and diverse real metagenomic datasets, having analyzed for this study a total of 240 Gbp of data in the training set, and 7 Gbp of data in the test set. MT-MAG is, to the best of our knowledge, the first machine learning method for taxonomic assignment of metagenomic data that offers a "partial classification" option. MT-MAG outputs complete or a partial classification paths, and interpretable numerical classification confidences of its classifications, at all taxonomic ranks. MT-MAG is able to completely classify 48% more sequences than DeepMicrobes to the Species level (the only comparable taxonomic rank for DeepMicrobes), and it outperforms DeepMicrobes by an average of 33% in weighted accuracy, and by 89% in constrained accuracy.

Author(s):  
K Sooknunan ◽  
M Lochner ◽  
Bruce A Bassett ◽  
H V Peiris ◽  
R Fender ◽  
...  

Abstract With the advent of powerful telescopes such as the Square Kilometer Array and the Vera C. Rubin Observatory, we are entering an era of multiwavelength transient astronomy that will lead to a dramatic increase in data volume. Machine learning techniques are well suited to address this data challenge and rapidly classify newly detected transients. We present a multiwavelength classification algorithm consisting of three steps: (1) interpolation and augmentation of the data using Gaussian processes; (2) feature extraction using wavelets; (3) classification with random forests. Augmentation provides improved performance at test time by balancing the classes and adding diversity into the training set. In the first application of machine learning to the classification of real radio transient data, we apply our technique to the Green Bank Interferometer and other radio light curves. We find we are able to accurately classify most of the eleven classes of radio variables and transients after just eight hours of observations, achieving an overall test accuracy of 78%. We fully investigate the impact of the small sample size of 82 publicly available light curves and use data augmentation techniques to mitigate the effect. We also show that on a significantly larger simulated representative training set that the algorithm achieves an overall accuracy of 97%, illustrating that the method is likely to provide excellent performance on future surveys. Finally, we demonstrate the effectiveness of simultaneous multiwavelength observations by showing how incorporating just one optical data point into the analysis improves the accuracy of the worst performing class by 19%.


2021 ◽  
Author(s):  
Marco Aceves-Fernandez

Abstract Dealing with electroencephalogram signals (EEG) are often not easy. The lack of predicability and complexity of such non-stationary, noisy and high dimensional signals is challenging. Cross Recurrence Plots (CRP) have been used extensively to deal with detecting subtle changes in signals even when the noise is embedded in the signal. In this contribution, a total of 121 children performed visual attention experiments and a proposed methodology using CRP and a Welch Power Spectral Distribution have been used to classify then between those who have ADHD and the control group. Additional tools were presented to determine to which extent the proposed methodology is able to classify accurately and avoid misclassifications, thus demonstrating that this methodology is feasible to classify EEG signals from subjects with ADHD. Lastly, the results were compared with a baseline machine learning method to prove experimentally that this methodology is consistent and the results repeatable.


Forests ◽  
2020 ◽  
Vol 11 (3) ◽  
pp. 298 ◽  
Author(s):  
Dercilio Junior Verly Lopes ◽  
Greg W. Burgreen ◽  
Edward D. Entsminger

This technical note determines the feasibility of using an InceptionV4_ResNetV2 convolutional neural network (CNN) to correctly identify hardwood species from macroscopic images. The method is composed of a commodity smartphone fitted with a 14× macro lens for photography. The end-grains of ten different North American hardwood species were photographed to create a dataset of 1869 images. The stratified 5-fold cross-validation machine-learning method was used, in which the number of testing samples varied from 341 to 342. Data augmentation was performed on-the-fly for each training set by rotating, zooming, and flipping images. It was found that the CNN could correctly identify hardwood species based on macroscopic images of its end-grain with an adjusted accuracy of 92.60%. With the current growing of machine-learning field, this model can then be readily deployed in a mobile application for field wood identification.


2015 ◽  
Vol 9s1 ◽  
pp. CMC.S18746 ◽  
Author(s):  
Amparo Alonso-Betanzos ◽  
Verónica Bolón-Canedo ◽  
Guy R. Heyndrickx ◽  
Peter L.M. Kerkhof

Background Heart failure (HF) manifests as at least two subtypes. The current paradigm distinguishes the two by using both the metric ejection fraction (EF) and a constraint for end-diastolic volume. About half of all HF patients exhibit preserved EF. In contrast, the classical type of HF shows a reduced EF. Common practice sets the cut-off point often at or near EF = 50%, thus defining a linear divider. However, a rationale for this safe choice is lacking, while the assumption regarding applicability of strict linearity has not been justified. Additionally, some studies opt for eliminating patients from consideration for HF if 40 < EF < 50% (gray zone). Thus, there is a need for documented classification guidelines, solving gray zone ambiguity and formulating crisp delineation of transitions between phenotypes. Methods Machine learning (ML) models are applied to classify HF subtypes within the ventricular volume domain, rather than by the single use of EF. Various ML models, both unsupervised and supervised, are employed to establish a foundation for classification. Data regarding 48 HF patients are employed as training set for subsequent classification of Monte Carlo–generated surrogate HF patients ( n = 403). Next, we map consequences when EF cut-off differs from 50% (as proposed for women) and analyze HF candidates not covered by current rules. Results The training set yields best results for the Support Vector Machine method (test error 4.06%), covers the gray zone, and other clinically relevant HF candidates. End-systolic volume (ESV) emerges as a logical discriminator rather than EF as in the prevailing paradigm. Conclusions Selected ML models offer promise for classifying HF patients (including the gray zone), when driven by ventricular volume data. ML analysis indicates that ESV has a role in the development of guidelines to parse HF subtypes. The documented curvilinear relationship between EF and ESV suggests that the assumption concerning a linear EF divider may not be of general utility over the complete clinically relevant range.


2021 ◽  
Vol 13 (16) ◽  
pp. 3176
Author(s):  
Beata Hejmanowska ◽  
Piotr Kramarczyk ◽  
Ewa Głowienka ◽  
Sławomir Mikrut

The study presents the analysis of the possible use of limited number of the Sentinel-2 and Sentinel-1 to check if crop declarations that the EU farmers submit to receive subsidies are true. The declarations used in the research were randomly divided into two independent sets (training and test). Based on the training set, supervised classification of both single images and their combinations was performed using random forest algorithm in SNAP (ESA) and our own Python scripts. A comparative accuracy analysis was performed on the basis of two forms of confusion matrix (full confusion matrix commonly used in remote sensing and binary confusion matrix used in machine learning) and various accuracy metrics (overall accuracy, accuracy, specificity, sensitivity, etc.). The highest overall accuracy (81%) was obtained in the simultaneous classification of multitemporal images (three Sentinel-2 and one Sentinel-1). An unexpectedly high accuracy (79%) was achieved in the classification of one Sentinel-2 image at the end of May 2018. Noteworthy is the fact that the accuracy of the random forest method trained on the entire training set is equal 80% while using the sampling method ca. 50%. Based on the analysis of various accuracy metrics, it can be concluded that the metrics used in machine learning, for example: specificity and accuracy, are always higher then the overall accuracy. These metrics should be used with caution, because unlike the overall accuracy, to calculate these metrics, not only true positives but also false positives are used as positive results, giving the impression of higher accuracy. Correct calculation of overall accuracy values is essential for comparative analyzes. Reporting the mean accuracy value for the classes as overall accuracy gives a false impression of high accuracy. In our case, the difference was 10–16% for the validation data, and 25–45% for the test data.


Sign in / Sign up

Export Citation Format

Share Document