Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data

Detection Sensitivity ◽

Prediction Errors ◽

Metagenomic Sequencing ◽

Learning Approaches ◽

Microbial Abundance ◽

Classification Models ◽

Analytical Approaches

Abstract Background:The advent of metagenomic sequencing provides microbial abundance patterns that can be leveraged for sample origin prediction. Supervised machine learning classification approaches have been reported to predict sample origin accurately when the origin has been previously sampled. Using metagenomic datasets provided by the 2019 CAMDA challenge, we evaluated the influence of technical, analytical and machine learning approaches for result interpretation and source prediction of new origins.Results:Comparison between 16S rRNA amplicon and shotgun sequencing approaches as well as metagenomic analytical tools showed differences in measured microbial abundance of the same samples, especially for organisms present at low abundance. Shotgun sequence data analyzed using Kraken2 and Bracken taxonomic annotation, had higher detection sensitivity than did other methods. As classification models are limited to labeling previously trained origins, we proposed an alternative approach using Lasso-regularized multivariate regression to predict geographic coordinates for comparison. In both models, the prediction errors were much higher in Leave-1-city-out than in 10-fold cross validation, the former of which realistically forecasted the difficulty in accurately predicting samples from new origins than pre-trained origins. The challenge was further confirmed using mystery samples obtained from new origins. Overall, prediction performances between regression and classification models, as measured by mean squared error, were comparable on mystery samples. Due to higher prediction errors for samples from new origins, we provided an additional strategy based on prediction ambiguity to infer whether a sample is from a new origin for practical applications. Lastly, we showed increased prediction error when data from a different sequencing protocol were included as training data.Conclusions:Here we highlighted the capacity of predicting sample origin accurately with pre-trained origins and the challenge of predicting new origins through both regression and classification models. Overall, the work provided a summary evaluation of sequencing techniques, protocol, taxonomic analytical approaches, and machine learning approaches to inform future designs in metagenomic prediction of sample origin.

Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data

Biology Direct ◽

10.1186/s13062-020-00287-y ◽

2020 ◽

Vol 15 (1) ◽

Author(s):

Julie Chih-yu Chen ◽

Andrea D. Tyler

Keyword(s):

Machine Learning ◽

Prediction Error ◽

Detection Sensitivity ◽

Metagenomic Sequencing ◽

Learning Approaches ◽

Microbial Abundance ◽

Classification Models ◽

The Impact ◽

Analytical Approaches

Abstract Background The advent of metagenomic sequencing provides microbial abundance patterns that can be leveraged for sample origin prediction. Supervised machine learning classification approaches have been reported to predict sample origin accurately when the origin has been previously sampled. Using metagenomic datasets provided by the 2019 CAMDA challenge, we evaluated the influence of variable technical, analytical and machine learning approaches for result interpretation and novel source prediction. Results Comparison between 16S rRNA amplicon and shotgun sequencing approaches as well as metagenomic analytical tools showed differences in normalized microbial abundance, especially for organisms present at low abundance. Shotgun sequence data analyzed using Kraken2 and Bracken, for taxonomic annotation, had higher detection sensitivity. As classification models are limited to labeling pre-trained origins, we took an alternative approach using Lasso-regularized multivariate regression to predict geographic coordinates for comparison. In both models, the prediction errors were much higher in Leave-1-city-out than in 10-fold cross validation, of which the former realistically forecasted the increased difficulty in accurately predicting samples from new origins. This challenge was further confirmed when applying the model to a set of samples obtained from new origins. Overall, the prediction performance of the regression and classification models, as measured by mean squared error, were comparable on mystery samples. Due to higher prediction error rates for samples from new origins, we provided an additional strategy based on prediction ambiguity to infer whether a sample is from a new origin. Lastly, we report increased prediction error when data from different sequencing protocols were included as training data. Conclusions Herein, we highlight the capacity of predicting sample origin accurately with pre-trained origins and the challenge of predicting new origins through both regression and classification models. Overall, this work provides a summary of the impact of sequencing technique, protocol, taxonomic analytical approaches, and machine learning approaches on the use of metagenomics for prediction of sample origin.

Comparison of Implicit vs. Explicit Regime Identification in Machine Learning Methods for Solar Irradiance Prediction

Energies ◽

10.3390/en13030689 ◽

2020 ◽

Vol 13 (3) ◽

pp. 689 ◽

Cited By ~ 6

Author(s):

Tyler McCandless ◽

Susan Dettling ◽

Sue Ellen Haupt

Keyword(s):

Machine Learning ◽

Solar Power ◽

Network Models ◽

Machine Learning Techniques ◽

Validation Dataset ◽

Prediction Errors ◽

Learning Approaches ◽

Power Prediction ◽

Neural Network Models

This work compares the solar power forecasting performance of tree-based methods that include implicit regime-based models to explicit regime separation methods that utilize both unsupervised and supervised machine learning techniques. Previous studies have shown an improvement utilizing a regime-based machine learning approach in a climate with diverse cloud conditions. This study compares the machine learning approaches for solar power prediction at the Shagaya Renewable Energy Park in Kuwait, which is in an arid desert climate characterized by abundant sunshine. The regime-dependent artificial neural network models undergo a comprehensive parameter and hyperparameter tuning analysis to minimize the prediction errors on a test dataset. The final results that compare the different methods are computed on an independent validation dataset. The results show that the tree-based methods, the regression model tree approach, performs better than the explicit regime-dependent approach. These results appear to be a function of the predominantly sunny conditions that limit the ability of an unsupervised technique to separate regimes for which the relationship between the predictors and the predictand would differ for the supervised learning technique.

Comparing Statistical and Machine Learning Classifiers: Alternatives for Predictive Modeling in Human Factors Research

Human Factors The Journal of the Human Factors and Ergonomics Society ◽

10.1518/hfes.45.3.408.27248 ◽

2003 ◽

Vol 45 (3) ◽

pp. 408-423 ◽

Cited By ~ 6

Author(s):

Brian Carnahan ◽

Gérard Meyer ◽

Lois-Ann Kuntz

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Discriminant Analysis ◽

Human Factors ◽

Predictive Accuracy ◽

Performance Outcomes ◽

Learning Approaches ◽

Classification Models ◽

Human Factors Research

Multivariate classification models play an increasingly important role in human factors research. In the past, these models have been based primarily on discriminant analysis and logistic regression. Models developed from machine learning research offer the human factors professional a viable alternative to these traditional statistical classification methods. To illustrate this point, two machine learning approaches - genetic programming and decision tree induction - were used to construct classification models designed to predict whether or not a student truck driver would pass his or her commercial driver license (CDL) examination. The models were developed and validated using the curriculum scores and CDL exam performances of 37 student truck drivers who had completed a 320-hr driver training course. Results indicated that the machine learning classification models were superior to discriminant analysis and logistic regression in terms of predictive accuracy. Actual or potential applications of this research include the creation of models that more accurately predict human performance outcomes.

A Comparison of Traditional Machine Learning Approaches for Supervised Feedback Classification in Bahasa Indonesia

International Journal of New Media Technology ◽

10.31937/ijnmt.v1i1.1485 ◽

2020 ◽

Vol 7 (1) ◽

pp. 28-32

Author(s):

Andre Rusli ◽

Alethea Suryadibrata ◽

Samiaji Bintang Nusantara ◽

Julio Christian Young

Keyword(s):

Machine Learning ◽

Language Processing ◽

Text Classification ◽

Weighted Average ◽

Learning Approaches ◽

K Nearest Neighbors ◽

Logistics Regression ◽

Learning Machine

The advancement of machine learning and natural language processing techniques hold essential opportunities to improve the existing software engineering activities, including the requirements engineering activity. Instead of manually reading all submitted user feedback to understand the evolving requirements of their product, developers could use the help of an automatic text classification program to reduce the required effort. Many supervised machine learning approaches have already been used in many fields of text classification and show promising results in terms of performance. This paper aims to implement NLP techniques for the basic text preprocessing, which then are followed by traditional (non-deep learning) machine learning classification algorithms, which are the Logistics Regression, Decision Tree, Multinomial Naïve Bayes, K-Nearest Neighbors, Linear SVC, and Random Forest classifier. Finally, the performance of each algorithm to classify the feedback in our dataset into several categories is evaluated using three F1 Score metrics, the macro-, micro-, and weighted-average F1 Score. Results show that generally, Logistics Regression is the most suitable classifier in most cases, followed by Linear SVC. However, the performance gap is not large, and with different configurations and requirements, other classifiers could perform equally or even better.

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

10.26434/chemrxiv.5513581.v1 ◽

2017 ◽

Author(s):

Sabrina Jaeger ◽

Simone Fulle ◽

Samo Turk

Keyword(s):

Machine Learning ◽

Language Processing ◽

Learning Approach ◽

Learning Approaches ◽

Unsupervised Machine Learning ◽

Feature Representations ◽

Machine Learning Approach ◽

The Individual ◽

Vector Representations

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

Machine Learning Classification and Regression Approaches for Optical Network Traffic Prediction

Electronics ◽

10.3390/electronics10131578 ◽

2021 ◽

Vol 10 (13) ◽

pp. 1578

Author(s):

Daniel Szostak ◽

Adam Włodarczyk ◽

Krzysztof Walkowiak

Keyword(s):

Machine Learning ◽

Optical Networks ◽

Network Traffic ◽

Optical Network ◽

Optimization Methods ◽

Traffic Prediction ◽

Classification And Regression ◽

Network Technologies

Rapid growth of network traffic causes the need for the development of new network technologies. Artificial intelligence provides suitable tools to improve currently used network optimization methods. In this paper, we propose a procedure for network traffic prediction. Based on optical networks’ (and other network technologies) characteristics, we focus on the prediction of fixed bitrate levels called traffic levels. We develop and evaluate two approaches based on different supervised machine learning (ML) methods—classification and regression. We examine four different ML models with various selected features. The tested datasets are based on real traffic patterns provided by the Seattle Internet Exchange Point (SIX). Obtained results are analyzed using a new quality metric, which allows researchers to find the best forecasting algorithm in terms of network resources usage and operational costs. Our research shows that regression provides better results than classification in case of all analyzed datasets. Additionally, the final choice of the most appropriate ML algorithm and model should depend on the network operator expectations.

Prediction of activity and selectivity profiles of human Carbonic Anhydrase inhibitors using machine learning classification models

Journal of Cheminformatics ◽

10.1186/s13321-021-00499-y ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Annachiara Tinivella ◽

Luca Pinzi ◽

Giulio Rastelli

Keyword(s):

Machine Learning ◽

Carbonic Anhydrase ◽

A Priori ◽

Selective Inhibition ◽

Great Promise ◽

Classification Models ◽

Central Interest ◽

Human Carbonic Anhydrase ◽

In Silico Models

AbstractThe development of selective inhibitors of the clinically relevant human Carbonic Anhydrase (hCA) isoforms IX and XII has become a major topic in drug research, due to their deregulation in several types of cancer. Indeed, the selective inhibition of these two isoforms, especially with respect to the homeostatic isoform II, holds great promise to develop anticancer drugs with limited side effects. Therefore, the development of in silico models able to predict the activity and selectivity against the desired isoform(s) is of central interest. In this work, we have developed a series of machine learning classification models, trained on high confidence data extracted from ChEMBL, able to predict the activity and selectivity profiles of ligands for human Carbonic Anhydrase isoforms II, IX and XII. The training datasets were built with a procedure that made use of flexible bioactivity thresholds to obtain well-balanced active and inactive classes. We used multiple algorithms and sampling sizes to finally select activity models able to classify active or inactive molecules with excellent performances. Remarkably, the results herein reported turned out to be better than those obtained by models built with the classic approach of selecting an a priori activity threshold. The sequential application of such validated models enables virtual screening to be performed in a fast and more reliable way to predict the activity and selectivity profiles against the investigated isoforms.

Source allocation of per- and polyfluoroalkyl substances (PFAS) with supervised machine learning: Classification performance and the role of feature selection in an expanded dataset

Chemosphere ◽

10.1016/j.chemosphere.2021.130124 ◽

2021 ◽

Vol 275 ◽

pp. 130124

Author(s):

Tohren C.G. Kibbey ◽

Rafal Jabrzemski ◽

Denis M. O’Carroll

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Classification Performance ◽

Polyfluoroalkyl Substances ◽

Source Allocation

Supervised learning for the detection of negation and of its scope in French and Brazilian Portuguese biomedical corpora

Natural Language Engineering ◽

10.1017/s1351324920000352 ◽

2020 ◽

pp. 1-21 ◽

Cited By ~ 2

Author(s):

Clément Dalloux ◽

Vincent Claveau ◽

Natalia Grabar ◽

Lucas Emanuel Silva Oliveira ◽

Claudia Maria Cabral Moro ◽

...

Keyword(s):

Machine Learning ◽

Information Extraction ◽

State Of The Art ◽

Automatic Detection ◽

Brazilian Portuguese ◽

Biomedical Domain ◽

Learning Approaches ◽

Cross Domain ◽

Automatic Methods

Abstract Automatic detection of negated content is often a prerequisite in information extraction systems in various domains. In the biomedical domain especially, this task is important because negation plays an important role. In this work, two main contributions are proposed. First, we work with languages which have been poorly addressed up to now: Brazilian Portuguese and French. Thus, we developed new corpora for these two languages which have been manually annotated for marking up the negation cues and their scope. Second, we propose automatic methods based on supervised machine learning approaches for the automatic detection of negation marks and of their scopes. The methods show to be robust in both languages (Brazilian Portuguese and French) and in cross-domain (general and biomedical languages) contexts. The approach is also validated on English data from the state of the art: it yields very good results and outperforms other existing approaches. Besides, the application is accessible and usable online. We assume that, through these issues (new annotated corpora, application accessible online, and cross-domain robustness), the reproducibility of the results and the robustness of the NLP applications will be augmented.

Evaluating disaster-related tweet credibility using content-based and user-based features

Information Discovery and Delivery ◽

10.1108/idd-04-2020-0044 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Nasser Assery ◽

Yuan (Dorothy) Xiaohong ◽

Qu Xiuli ◽

Roy Kaushik ◽

Sultan Almalki

Keyword(s):

Machine Learning ◽

Unsupervised Learning ◽

Supervised Learning ◽

Emergency Response ◽

Learning Model ◽

Performance Comparison ◽

Learning Methods ◽

Content Type ◽

Machine Learning Classification

Purpose This study aims to propose an unsupervised learning model to evaluate the credibility of disaster-related Twitter data and present a performance comparison with commonly used supervised machine learning models. Design/methodology/approach First historical tweets on two recent hurricane events are collected via Twitter API. Then a credibility scoring system is implemented in which the tweet features are analyzed to give a credibility score and credibility label to the tweet. After that, supervised machine learning classification is implemented using various classification algorithms and their performances are compared. Findings The proposed unsupervised learning model could enhance the emergency response by providing a fast way to determine the credibility of disaster-related tweets. Additionally, the comparison of the supervised classification models reveals that the Random Forest classifier performs significantly better than the SVM and Logistic Regression classifiers in classifying the credibility of disaster-related tweets. Originality/value In this paper, an unsupervised 10-point scoring model is proposed to evaluate the tweets’ credibility based on the user-based and content-based features. This technique could be used to evaluate the credibility of disaster-related tweets on future hurricanes and would have the potential to enhance emergency response during critical events. The comparative study of different supervised learning methods has revealed effective supervised learning methods for evaluating the credibility of Tweeter data.