MT-MAG: Accurate and interpretable machine learning based taxonomic assignment of metagenome-assembled genomes, with a partial classification option

We propose MT-MAG, a novel machine learning-based taxonomic assignment tool for hierarchically-structured local classification of metagenome-assembled genomes (MAGs). MT-MAG is capable of classifying large and diverse real metagenomic datasets, having analyzed for this study a total of 240 Gbp of data in the training set, and 7 Gbp of data in the test set. MT-MAG is, to the best of our knowledge, the first machine learning method for taxonomic assignment of metagenomic data that offers a "partial classification" option. MT-MAG outputs complete or a partial classification paths, and interpretable numerical classification confidences of its classifications, at all taxonomic ranks. MT-MAG is able to completely classify 48% more sequences than DeepMicrobes to the Species level (the only comparable taxonomic rank for DeepMicrobes), and it outperforms DeepMicrobes by an average of 33% in weighted accuracy, and by 89% in constrained accuracy.

Download Full-text

Classification of multiwavelength transients with Machine Learning

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa3873 ◽

2020 ◽

Author(s):

K Sooknunan ◽

M Lochner ◽

Bruce A Bassett ◽

H V Peiris ◽

R Fender ◽

...

Keyword(s):

Machine Learning ◽

Small Sample ◽

Light Curves ◽

Machine Learning Techniques ◽

Optical Data ◽

Test Time ◽

Test Accuracy ◽

Training Set ◽

The Impact

Abstract With the advent of powerful telescopes such as the Square Kilometer Array and the Vera C. Rubin Observatory, we are entering an era of multiwavelength transient astronomy that will lead to a dramatic increase in data volume. Machine learning techniques are well suited to address this data challenge and rapidly classify newly detected transients. We present a multiwavelength classification algorithm consisting of three steps: (1) interpolation and augmentation of the data using Gaussian processes; (2) feature extraction using wavelets; (3) classification with random forests. Augmentation provides improved performance at test time by balancing the classes and adding diversity into the training set. In the first application of machine learning to the classification of real radio transient data, we apply our technique to the Green Bank Interferometer and other radio light curves. We find we are able to accurately classify most of the eleven classes of radio variables and transients after just eight hours of observations, achieving an overall test accuracy of 78%. We fully investigate the impact of the small sample size of 82 publicly available light curves and use data augmentation techniques to mitigate the effect. We also show that on a significantly larger simulated representative training set that the algorithm achieves an overall accuracy of 97%, illustrating that the method is likely to provide excellent performance on future surveys. Finally, we demonstrate the effectiveness of simultaneous multiwavelength observations by showing how incorporating just one optical data point into the analysis improves the accuracy of the worst performing class by 19%.

Download Full-text

Classification of Chroma Reconstruction Method by Machine Learning Method

2020 IEEE International Conference on Consumer Electronics - Taiwan (ICCE-Taiwan) ◽

10.1109/icce-taiwan49838.2020.9258255 ◽

2020 ◽

Author(s):

Meng-Hsuan Kuo ◽

Yu-Chen Shen ◽

Yih-Shyh Chiou ◽

Shih-Lun Chen ◽

Ting-Lan Lin

Keyword(s):

Machine Learning ◽

Reconstruction Method ◽

Machine Learning Method ◽

Learning Method

Download Full-text

STATISTICAL INDICATORS-BASED MACHINE LEARNING METHOD FOR CLASSIFICATION OF VIBRATION SIGNALS

10.26678/abcm.cobem2021.cob2021-0579 ◽

2021 ◽

Author(s):

Gisele de Fátima Lima Camargo ◽

Eurípedes Nóbrega

Keyword(s):

Machine Learning ◽

Machine Learning Method ◽

Learning Method ◽

Vibration Signals ◽

Statistical Indicators

Download Full-text

Methodology Proposal of ADHD Classification of Children Based on Cross Recurrence Plots

10.21203/rs.3.rs-163507/v1 ◽

2021 ◽

Author(s):

Marco Aceves-Fernandez

Keyword(s):

Machine Learning ◽

Spectral Distribution ◽

High Dimensional ◽

Control Group ◽

Machine Learning Method ◽

Learning Method ◽

Recurrence Plots ◽

Eeg Signals ◽

Power Spectral

Abstract Dealing with electroencephalogram signals (EEG) are often not easy. The lack of predicability and complexity of such non-stationary, noisy and high dimensional signals is challenging. Cross Recurrence Plots (CRP) have been used extensively to deal with detecting subtle changes in signals even when the noise is embedded in the signal. In this contribution, a total of 121 children performed visual attention experiments and a proposed methodology using CRP and a Welch Power Spectral Distribution have been used to classify then between those who have ADHD and the control group. Additional tools were presented to determine to which extent the proposed methodology is able to classify accurately and avoid misclassifications, thus demonstrating that this methodology is feasible to classify EEG signals from subjects with ADHD. Lastly, the results were compared with a baseline machine learning method to prove experimentally that this methodology is consistent and the results repeatable.

Download Full-text

Machine-learning with a small training set for classification of quantitative phase images of cancer cells (Conference Presentation)

AI and Optical Data Sciences ◽

10.1117/12.2550957 ◽

2020 ◽

Author(s):

Natan T. Shaked

Keyword(s):

Machine Learning ◽

Cancer Cells ◽

Training Set ◽

Quantitative Phase ◽

Phase Images

Download Full-text

North American Hardwoods Identification Using Machine-Learning

Forests ◽

10.3390/f11030298 ◽

2020 ◽

Vol 11 (3) ◽

pp. 298 ◽

Cited By ~ 2

Author(s):

Dercilio Junior Verly Lopes ◽

Greg W. Burgreen ◽

Edward D. Entsminger

Keyword(s):

Machine Learning ◽

North American ◽

Mobile Application ◽

Cross Validation ◽

Data Augmentation ◽

Technical Note ◽

Machine Learning Method ◽

Training Set ◽

Hardwood Species ◽

Fold Cross Validation

This technical note determines the feasibility of using an InceptionV4_ResNetV2 convolutional neural network (CNN) to correctly identify hardwood species from macroscopic images. The method is composed of a commodity smartphone fitted with a 14× macro lens for photography. The end-grains of ten different North American hardwood species were photographed to create a dataset of 1869 images. The stratified 5-fold cross-validation machine-learning method was used, in which the number of testing samples varied from 341 to 342. Data augmentation was performed on-the-fly for each training set by rotating, zooming, and flipping images. It was found that the CNN could correctly identify hardwood species based on macroscopic images of its end-grain with an adjusted accuracy of 92.60%. With the current growing of machine-learning field, this model can then be readily deployed in a mobile application for field wood identification.

Download Full-text

Exploring Guidelines for Classification of Major Heart Failure Subtypes by Using Machine Learning

Clinical Medicine Insights Cardiology ◽

10.4137/cmc.s18746 ◽

2015 ◽

Vol 9s1 ◽

pp. CMC.S18746 ◽

Cited By ~ 11

Author(s):

Amparo Alonso-Betanzos ◽

Verónica Bolón-Canedo ◽

Guy R. Heyndrickx ◽

Peter L.M. Kerkhof

Keyword(s):

Machine Learning ◽

Heart Failure ◽

Ventricular Volume ◽

Classical Type ◽

Curvilinear Relationship ◽

Support Vector ◽

Gray Zone ◽

Training Set ◽

Machine Method

Background Heart failure (HF) manifests as at least two subtypes. The current paradigm distinguishes the two by using both the metric ejection fraction (EF) and a constraint for end-diastolic volume. About half of all HF patients exhibit preserved EF. In contrast, the classical type of HF shows a reduced EF. Common practice sets the cut-off point often at or near EF = 50%, thus defining a linear divider. However, a rationale for this safe choice is lacking, while the assumption regarding applicability of strict linearity has not been justified. Additionally, some studies opt for eliminating patients from consideration for HF if 40 < EF < 50% (gray zone). Thus, there is a need for documented classification guidelines, solving gray zone ambiguity and formulating crisp delineation of transitions between phenotypes. Methods Machine learning (ML) models are applied to classify HF subtypes within the ventricular volume domain, rather than by the single use of EF. Various ML models, both unsupervised and supervised, are employed to establish a foundation for classification. Data regarding 48 HF patients are employed as training set for subsequent classification of Monte Carlo–generated surrogate HF patients ( n = 403). Next, we map consequences when EF cut-off differs from 50% (as proposed for women) and analyze HF candidates not covered by current rules. Results The training set yields best results for the Support Vector Machine method (test error 4.06%), covers the gray zone, and other clinically relevant HF candidates. End-systolic volume (ESV) emerges as a logical discriminator rather than EF as in the prevailing paradigm. Conclusions Selected ML models offer promise for classifying HF patients (including the gray zone), when driven by ventricular volume data. ML analysis indicates that ESV has a role in the development of guidelines to parse HF subtypes. The documented curvilinear relationship between EF and ESV suggests that the assumption concerning a linear EF divider may not be of general utility over the complete clinically relevant range.

Download Full-text

Reliable Crops Classification Using Limited Number of Sentinel-2 and Sentinel-1 Images

Remote Sensing ◽

10.3390/rs13163176 ◽

2021 ◽

Vol 13 (16) ◽

pp. 3176

Author(s):

Beata Hejmanowska ◽

Piotr Kramarczyk ◽

Ewa Głowienka ◽

Sławomir Mikrut

Keyword(s):

Machine Learning ◽

Random Forest ◽

Confusion Matrix ◽

High Accuracy ◽

Training Set ◽

Validation Data ◽

Comparative Accuracy ◽

The Difference ◽

Sentinel 2

The study presents the analysis of the possible use of limited number of the Sentinel-2 and Sentinel-1 to check if crop declarations that the EU farmers submit to receive subsidies are true. The declarations used in the research were randomly divided into two independent sets (training and test). Based on the training set, supervised classification of both single images and their combinations was performed using random forest algorithm in SNAP (ESA) and our own Python scripts. A comparative accuracy analysis was performed on the basis of two forms of confusion matrix (full confusion matrix commonly used in remote sensing and binary confusion matrix used in machine learning) and various accuracy metrics (overall accuracy, accuracy, specificity, sensitivity, etc.). The highest overall accuracy (81%) was obtained in the simultaneous classification of multitemporal images (three Sentinel-2 and one Sentinel-1). An unexpectedly high accuracy (79%) was achieved in the classification of one Sentinel-2 image at the end of May 2018. Noteworthy is the fact that the accuracy of the random forest method trained on the entire training set is equal 80% while using the sampling method ca. 50%. Based on the analysis of various accuracy metrics, it can be concluded that the metrics used in machine learning, for example: specificity and accuracy, are always higher then the overall accuracy. These metrics should be used with caution, because unlike the overall accuracy, to calculate these metrics, not only true positives but also false positives are used as positive results, giving the impression of higher accuracy. Correct calculation of overall accuracy values is essential for comparative analyzes. Reporting the mean accuracy value for the classes as overall accuracy gives a false impression of high accuracy. In our case, the difference was 10–16% for the validation data, and 25–45% for the test data.

Download Full-text