Machine Learning Applications for Mass Spectrometry-Based Metabolomics

The metabolome of an organism depends on environmental factors and intracellular regulation and provides information about the physiological conditions. Metabolomics helps to understand disease progression in clinical settings or estimate metabolite overproduction for metabolic engineering. The most popular analytical metabolomics platform is mass spectrometry (MS). However, MS metabolome data analysis is complicated, since metabolites interact nonlinearly, and the data structures themselves are complex. Machine learning methods have become immensely popular for statistical analysis due to the inherent nonlinear data representation and the ability to process large and heterogeneous data rapidly. In this review, we address recent developments in using machine learning for processing MS spectra and show how machine learning generates new biological insights. In particular, supervised machine learning has great potential in metabolomics research because of the ability to supply quantitative predictions. We review here commonly used tools, such as random forest, support vector machines, artificial neural networks, and genetic algorithms. During processing steps, the supervised machine learning methods help peak picking, normalization, and missing data imputation. For knowledge-driven analysis, machine learning contributes to biomarker detection, classification and regression, biochemical pathway identification, and carbon flux determination. Of important relevance is the combination of different omics data to identify the contributions of the various regulatory levels. Our overview of the recent publications also highlights that data quality determines analysis quality, but also adds to the challenge of choosing the right model for the data. Machine learning methods applied to MS-based metabolomics ease data analysis and can support clinical decisions, guide metabolic engineering, and stimulate fundamental biological discoveries.

Download Full-text

Machine Learning for Mass Spectrometry Data Analysis in Proteomics

Current Proteomics ◽

10.2174/1570164617999201023145304 ◽

2020 ◽

Vol 17 ◽

Author(s):

Juntao Li ◽

Kanglei Zhou ◽

Bingyu Mu

Keyword(s):

Machine Learning ◽

Mass Spectrometry ◽

Support Vector Machine ◽

Data Analysis ◽

Rapid Development ◽

Point Of View ◽

Mass Spectrometry Data ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods

: With the rapid development of high-throughput techniques, mass spectrometry has been widely used for largescale protein analysis. To search for the existing proteins, discover biomarkers, and diagnose and prognose diseases, machine learning methods are applied in mass spectrometry data analysis. This paper reviews the applications of five kinds of machine learning methods to mass spectrometry data analysis from an algorithmic point of view, including support vector machine, decision tree, random forest, naive Bayesian classifier and deep learning.

Download Full-text

Classification models using circulating neutrophil transcripts can detect unruptured intracranial aneurysm

Journal of Translational Medicine ◽

10.1186/s12967-020-02550-2 ◽

2020 ◽

Vol 18 (1) ◽

Author(s):

Kerry E. Poppenberg ◽

Vincent M. Tutino ◽

Lu Li ◽

Muhammad Waqas ◽

Armond June ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Prediction Models ◽

Model Performance ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Training Cohort ◽

Network Analyses ◽

Machine Learning Methods

Abstract Background Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods. Methods Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction. Results Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance. Conclusions We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.

Download Full-text

Hydraulic Flow Unit Classification and Prediction Using Machine Learning Techniques: A Case Study from the Nam Con Son Basin, Offshore Vietnam

Energies ◽

10.3390/en14227714 ◽

2021 ◽

Vol 14 (22) ◽

pp. 7714

Author(s):

Ha Quang Man ◽

Doan Huy Hien ◽

Kieu Duy Thong ◽

Bui Viet Dung ◽

Nguyen Minh Hoa ◽

...

Keyword(s):

Machine Learning ◽

Machine Learning Algorithms ◽

Flow Unit ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Log Data ◽

Hydraulic Flow ◽

Core Data ◽

Machine Learning Methods

The test study area is the Miocene reservoir of Nam Con Son Basin, offshore Vietnam. In the study we used unsupervised learning to automatically cluster hydraulic flow units (HU) based on flow zone indicators (FZI) in a core plug dataset. Then we applied supervised learning to predict HU by combining core and well log data. We tested several machine learning algorithms. In the first phase, we derived hydraulic flow unit clustering of porosity and permeability of core data using unsupervised machine learning methods such as Ward’s, K mean, Self-Organize Map (SOM) and Fuzzy C mean (FCM). Then we applied supervised machine learning methods including Artificial Neural Networks (ANN), Support Vector Machines (SVM), Boosted Tree (BT) and Random Forest (RF). We combined both core and log data to predict HU logs for the full well section of the wells without core data. We used four wells with six logs (GR, DT, NPHI, LLD, LSS and RHOB) and 578 cores from the Miocene reservoir to train, validate and test the data. Our goal was to show that the correct combination of cores and well logs data would provide reservoir engineers with a tool for HU classification and estimation of permeability in a continuous geological profile. Our research showed that machine learning effectively boosts the prediction of permeability, reduces uncertainty in reservoir modeling, and improves project economics.

Download Full-text

Supervised machine learning methods in psychology: A practical introduction with annotated R code

10.31234/osf.io/s72vu ◽

2019 ◽

Author(s):

Hannes Rosenbusch ◽

Felix Soldner ◽

Anthony M Evans ◽

Marcel Zeelenberg

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Psychological Research ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Comprehensive Overview ◽

K Nearest Neighbors ◽

Machine Learning Methods ◽

Out Of Sample

Machine learning methods for pattern detection and prediction are increasingly prevalent in psychological research. We provide a comprehensive overview of machine learning, its applications, and how to implement models for research. We review fundamental concepts of machine learning, such as prediction accuracy and out-of-sample evaluation, and summarize four standard prediction algorithms: linear regressions, ridge regressions, decision trees, and random forests (plus k-nearest neighbors, Naïve Bayes classifiers, and support vector machines in the supplementary material). This selection provides a set of powerful models that are implemented regularly in machine learning projects. We demonstrate each method with examples and annotated R code, and discuss best practices for determining sample sizes; comparing model performances; tuning prediction models; preregistering prediction models; and reporting results. Finally, we discuss the value of machine learning methods in maintaining psychology’s status as a predictive science.

Download Full-text

An evaluation of machine learning classifiers for next-generation, continuous-ethogram smart trackers

Movement Ecology ◽

10.1186/s40462-021-00245-x ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Hui Yu ◽

Jian Deng ◽

Ran Nathan ◽

Max Kröschel ◽

Sasha Pekarsky ◽

...

Keyword(s):

Machine Learning ◽

Feature Reduction ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods ◽

Behavioural Research ◽

Intermittent Sampling

Abstract Background Our understanding of movement patterns and behaviours of wildlife has advanced greatly through the use of improved tracking technologies, including application of accelerometry (ACC) across a wide range of taxa. However, most ACC studies either use intermittent sampling that hinders continuity or continuous data logging relying on tracker retrieval for data downloading which is not applicable for long term study. To allow long-term, fine-scale behavioural research, we evaluated a range of machine learning methods for their suitability for continuous on-board classification of ACC data into behaviour categories prior to data transmission. Methods We tested six supervised machine learning methods, including linear discriminant analysis (LDA), decision tree (DT), support vector machine (SVM), artificial neural network (ANN), random forest (RF) and extreme gradient boosting (XGBoost) to classify behaviour using ACC data from three bird species (white stork Ciconia ciconia, griffon vulture Gyps fulvus and common crane Grus grus) and two mammals (dairy cow Bos taurus and roe deer Capreolus capreolus). Results Using a range of quality criteria, SVM, ANN, RF and XGBoost performed well in determining behaviour from ACC data and their good performance appeared little affected when greatly reducing the number of input features for model training. On-board runtime and storage-requirement tests showed that notably ANN, RF and XGBoost would make suitable on-board classifiers. Conclusions Our identification of using feature reduction in combination with ANN, RF and XGBoost as suitable methods for on-board behavioural classification of continuous ACC data has considerable potential to benefit movement ecology and behavioural research, wildlife conservation and livestock husbandry.

Download Full-text

Evaluation of Supervised Learning Models in Predicting Greenhouse Energy Demand and Production for Intelligent and Sustainable Operations

Energies ◽

10.3390/en14196297 ◽

2021 ◽

Vol 14 (19) ◽

pp. 6297

Author(s):

Laila Ouazzani Chahidi ◽

Marco Fossa ◽

Antonella Priarone ◽

Abdellah Mechaqrane

Keyword(s):

Machine Learning ◽

Intelligent Control ◽

Energy Demand ◽

Well Being ◽

Supervised Machine Learning ◽

Support Vector ◽

Photovoltaic Module ◽

Learning Methods ◽

Sustainable Operations ◽

Machine Learning Methods

Plants need a specific environment to grow and reproduce in fine fettle. Nevertheless, climatic conditions are not stable and can impact their well-being and, consequently, harvest quality. Thus, greenhouse cultivation is one of the suitable agricultural techniques for creating and controlling the inside microclimate to be adequate for plant growth. The relevance of greenhouse control is widely recognized. The prediction of greenhouse variables using artificial intelligence methods is of great interest for intelligent control and the potential reduction in energetic and financial losses. However, the studies carried out in this context are still more or less limited and several machine learning methods have not been sufficiently exploited. The aim of this study is to predict the air conditioning electrical consumption and photovoltaic module electrical production at the smart Agro-Manufacturing Laboratory (SamLab) greenhouse, located in Albenga, north-western Italy. Different supervised machine learning methods were compared, namely, Artificial Neural Networks (ANNs), Gaussian Process Regression (GPR), Support Vector Machine (SVM) and Boosting trees. We evaluated the performance of the models based on three statistical indicators: the coefficient of correlation (R), the normalized root mean square error (nRMSE) and the normalized mean absolute error (nMAE). The results show good agreement between the measured and predicted values for all models, with a correlation coefficient R > 0.9, considering the validation set. The good performance of the models affirms the importance of this approach and that it can be used to further improve greenhouse efficiency through its intelligent control.

Download Full-text

Classification Models using Circulating Neutrophil Transcripts Can Detect Unruptured Intracranial Aneurysm

10.21203/rs.3.rs-17161/v2 ◽

2020 ◽

Author(s):

Kerry E Poppenberg ◽

Vincent M Tutino ◽

Lu Li ◽

Muhammad Waqas ◽

Armond June ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Prediction Models ◽

Model Performance ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Training Cohort ◽

Network Analyses ◽

Machine Learning Methods

Abstract Background: Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods. Methods: Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n=94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n=40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction. Results: Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC)=0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance. Conclusions: We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.

Download Full-text

Machine learning for the extragalactic astronomy educational manual

Proceedings of the International Astronomical Union ◽

10.1017/s1743921321000132 ◽

2019 ◽

Vol 15 (S367) ◽

pp. 461-463

Author(s):

Maksym Vasylenko ◽

Daria Dobrycheva

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Programming Language ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods ◽

Python Programming Language ◽

Python Programming

AbstractWe evaluated a new approach to the automated morphological classification of large galaxy samples based on the supervised machine learning techniques (Naive Bayes, Random Forest, Support Vector Machine, Logistic Regression, and k-Nearest Neighbours) and Deep Learning using the Python programming language. A representative sample of ∼315000 SDSS DR9 galaxies at z < 0.1 and stellar magnitudes r < 17.7m was considered as a target sample of galaxies with indeterminate morphological types. Classical machine learning methods were used to binary morphologically classification of galaxies into early and late types (96.4% with Support Vector Machine). Deep machine learning methods were used to classify images of galaxies into five visual types (completely rounded, rounded in-between, smooth cigar-shaped, edge-on, and spiral) with the Xception architecture (94% accuracy for four classes and 88% for cigar-like galaxies). These results created a basis for educational manual on the processing of large data sets in the Python programming language, which is intended for students of the Ukrainian universities.

Download Full-text

Benchmark Study of Supervised Machine Learning Methods for a Ship Speed-Power Prediction at Sea

10.1115/omae2021-62395 ◽

2021 ◽

Author(s):

Xiao Lang ◽

Da Wu ◽

Wengang Mao

Keyword(s):

Machine Learning ◽

Supervised Machine Learning ◽

Support Vector ◽

Statistical Regression ◽

Learning Methods ◽

Benchmark Study ◽

Machine Learning Methods ◽

Extreme Gradient Boosting ◽

Ship Performance ◽

Ship Speed

Abstract The development and evaluation of energy efficiency measures to reduce air emissions from shipping strongly depends on reliable description of a ship’s performance when sailing at sea. Normally, model tests and semi-empirical formulas are used to model a ship’s performance but they are either expensive or lack accuracy. Nowadays, a lot of ship performance-related parameters have been recorded during a ship’s sailing, and different data driven machine learning methods have been applied for the ship speed-power modelling. This paper compares different supervised machine learning algorithms, i.e., eXtreme Gradient Boosting (XGBoost), neural network, support vector machine, and some statistical regression methods, for the ship speed-power modelling. A worldwide sailing chemical tanker with full-scale measurements is employed as the case study vessel. A general data pre-processing method for the machine learning is presented. The machine learning models are trained using measurement data including ship operation profiles and encountered metocean conditions. Through the benchmark study, the pros and cons of different machine learning methods for the ship’s speed-power performance modelling are identified. The accuracy of various algorithms based models for ship performance during individual voyages is also investigated.

Download Full-text

Prediction of the Development of Gestational Diabetes Mellitus in Pregnant Women Using Machine Learning Methods

Microsystems Electronics and Acoustics ◽

10.20535/2523-4455.mea.228845 ◽

2021 ◽

Vol 26 (2) ◽

Author(s):

Marko Romanovych Basarab ◽

Ekateryna Olehivna Ivanko ◽

Vishwesh Kulkarni

Keyword(s):

Diabetes Mellitus ◽

Machine Learning ◽

Gestational Diabetes ◽

Gestational Diabetes Mellitus ◽

Supervised Machine Learning ◽

Support Vector ◽

Pima Indians ◽

Learning Methods ◽

Machine Learning Methods ◽

Extreme Gradient Boosting

The paper is devoted to the application of machine learning methods to the prediction of the development of gestational diabetes mellitus in early pregnancy. Based on two publicly available databases, study assesses influence of such features as body mass index, thickness of triceps skin folds, ultrasound measurements of maternal visceral fat, first measured fasting glucose, and others a predictors of gestational diabetes mellitus. The supervised machine learning methods based on decision trees, support vector machines, logistic regression, k-nearest neighbors classifier, ensemble learning, Naive Bayes classifier, and neural networks were implemented to determine the best classification models for computerized gestational diabetes mellitus disease prediction. The accuracy of the different classifiers was determined and compared. Support vector machine classifier demonstrated the highest accuracy (83.0% of total correctly prognosed cases, 87.9% for healthy class, and 78.1% for gestational diabetes mellitus) in predicting the development of gestational diabetes based on features from Pima Indians Diabetes Database. Extreme gradient boosting classifier performed the best, comparing to other supervised machine learning methods, for Visceral Adipose Tissue Measurements during Pregnancy Database. It showed 87.9% of total correctly prognosed cases, 82.2% for healthy class, and 93.6% for gestational diabetes mellitus).

Download Full-text