Prediction of the Development of Gestational Diabetes Mellitus in Pregnant Women Using Machine Learning Methods

The paper is devoted to the application of machine learning methods to the prediction of the development of gestational diabetes mellitus in early pregnancy. Based on two publicly available databases, study assesses influence of such features as body mass index, thickness of triceps skin folds, ultrasound measurements of maternal visceral fat, first measured fasting glucose, and others a predictors of gestational diabetes mellitus. The supervised machine learning methods based on decision trees, support vector machines, logistic regression, k-nearest neighbors classifier, ensemble learning, Naive Bayes classifier, and neural networks were implemented to determine the best classification models for computerized gestational diabetes mellitus disease prediction. The accuracy of the different classifiers was determined and compared. Support vector machine classifier demonstrated the highest accuracy (83.0% of total correctly prognosed cases, 87.9% for healthy class, and 78.1% for gestational diabetes mellitus) in predicting the development of gestational diabetes based on features from Pima Indians Diabetes Database. Extreme gradient boosting classifier performed the best, comparing to other supervised machine learning methods, for Visceral Adipose Tissue Measurements during Pregnancy Database. It showed 87.9% of total correctly prognosed cases, 82.2% for healthy class, and 93.6% for gestational diabetes mellitus).

Download Full-text

Benchmark Study of Supervised Machine Learning Methods for a Ship Speed-Power Prediction at Sea

10.1115/omae2021-62395 ◽

2021 ◽

Author(s):

Xiao Lang ◽

Da Wu ◽

Wengang Mao

Keyword(s):

Machine Learning ◽

Supervised Machine Learning ◽

Support Vector ◽

Statistical Regression ◽

Learning Methods ◽

Benchmark Study ◽

Machine Learning Methods ◽

Extreme Gradient Boosting ◽

Ship Performance ◽

Ship Speed

Abstract The development and evaluation of energy efficiency measures to reduce air emissions from shipping strongly depends on reliable description of a ship’s performance when sailing at sea. Normally, model tests and semi-empirical formulas are used to model a ship’s performance but they are either expensive or lack accuracy. Nowadays, a lot of ship performance-related parameters have been recorded during a ship’s sailing, and different data driven machine learning methods have been applied for the ship speed-power modelling. This paper compares different supervised machine learning algorithms, i.e., eXtreme Gradient Boosting (XGBoost), neural network, support vector machine, and some statistical regression methods, for the ship speed-power modelling. A worldwide sailing chemical tanker with full-scale measurements is employed as the case study vessel. A general data pre-processing method for the machine learning is presented. The machine learning models are trained using measurement data including ship operation profiles and encountered metocean conditions. Through the benchmark study, the pros and cons of different machine learning methods for the ship’s speed-power performance modelling are identified. The accuracy of various algorithms based models for ship performance during individual voyages is also investigated.

Download Full-text

Prediction of Hanwoo Cattle Phenotypes from Genotypes Using Machine Learning Methods

Animals ◽

10.3390/ani11072066 ◽

2021 ◽

Vol 11 (7) ◽

pp. 2066

Author(s):

Swati Srivastava ◽

Bryan Irvine Lopez ◽

Himansu Kumar ◽

Myoungjin Jang ◽

Han-Ha Chai ◽

...

Keyword(s):

Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Eye Muscle ◽

Important Species ◽

Machine Learning Methods ◽

Extreme Gradient Boosting ◽

Boosting Method ◽

Predictive Correlation ◽

Hanwoo Cattle

Hanwoo was originally raised for draft purposes, but the increase in local demand for red meat turned that purpose into full-scale meat-type cattle rearing; it is now considered one of the most economically important species and a vital food source for Koreans. The application of genomic selection in Hanwoo breeding programs in recent years was expected to lead to higher genetic progress. However, better statistical methods that can improve the genomic prediction accuracy are required. Hence, this study aimed to compare the predictive performance of three machine learning methods, namely, random forest (RF), extreme gradient boosting method (XGB), and support vector machine (SVM), when predicting the carcass weight (CWT), marbling score (MS), backfat thickness (BFT) and eye muscle area (EMA). Phenotypic and genotypic data (53,866 SNPs) from 7324 commercial Hanwoo cattle that were slaughtered at the age of around 30 months were used. The results showed that the boosting method XGB showed the highest predictive correlation for CWT and MS, followed by GBLUP, SVM, and RF. Meanwhile, the best predictive correlation for BFT and EMA was delivered by GBLUP, followed by SVM, RF, and XGB. Although XGB presented the highest predictive correlations for some traits, we did not find an advantage of XGB or any machine learning methods over GBLUP according to the mean squared error of prediction. Thus, we still recommend the use of GBLUP in the prediction of genomic breeding values for carcass traits in Hanwoo cattle.

Download Full-text

Classification models using circulating neutrophil transcripts can detect unruptured intracranial aneurysm

Journal of Translational Medicine ◽

10.1186/s12967-020-02550-2 ◽

2020 ◽

Vol 18 (1) ◽

Author(s):

Kerry E. Poppenberg ◽

Vincent M. Tutino ◽

Lu Li ◽

Muhammad Waqas ◽

Armond June ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Prediction Models ◽

Model Performance ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Training Cohort ◽

Network Analyses ◽

Machine Learning Methods

Abstract Background Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods. Methods Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction. Results Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance. Conclusions We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.

Download Full-text

Hydraulic Flow Unit Classification and Prediction Using Machine Learning Techniques: A Case Study from the Nam Con Son Basin, Offshore Vietnam

Energies ◽

10.3390/en14227714 ◽

2021 ◽

Vol 14 (22) ◽

pp. 7714

Author(s):

Ha Quang Man ◽

Doan Huy Hien ◽

Kieu Duy Thong ◽

Bui Viet Dung ◽

Nguyen Minh Hoa ◽

...

Keyword(s):

Machine Learning ◽

Machine Learning Algorithms ◽

Flow Unit ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Log Data ◽

Hydraulic Flow ◽

Core Data ◽

Machine Learning Methods

The test study area is the Miocene reservoir of Nam Con Son Basin, offshore Vietnam. In the study we used unsupervised learning to automatically cluster hydraulic flow units (HU) based on flow zone indicators (FZI) in a core plug dataset. Then we applied supervised learning to predict HU by combining core and well log data. We tested several machine learning algorithms. In the first phase, we derived hydraulic flow unit clustering of porosity and permeability of core data using unsupervised machine learning methods such as Ward’s, K mean, Self-Organize Map (SOM) and Fuzzy C mean (FCM). Then we applied supervised machine learning methods including Artificial Neural Networks (ANN), Support Vector Machines (SVM), Boosted Tree (BT) and Random Forest (RF). We combined both core and log data to predict HU logs for the full well section of the wells without core data. We used four wells with six logs (GR, DT, NPHI, LLD, LSS and RHOB) and 578 cores from the Miocene reservoir to train, validate and test the data. Our goal was to show that the correct combination of cores and well logs data would provide reservoir engineers with a tool for HU classification and estimation of permeability in a continuous geological profile. Our research showed that machine learning effectively boosts the prediction of permeability, reduces uncertainty in reservoir modeling, and improves project economics.

Download Full-text

Using Machine Learning Methods To Identify Coal Pay Zones from Drilling and Logging-While-Drilling (LWD) Data

SPE Journal ◽

10.2118/198288-pa ◽

2020 ◽

Vol 25 (03) ◽

pp. 1241-1258 ◽

Cited By ~ 2

Author(s):

Ruizhi Zhong ◽

Raymond L. Johnson ◽

Zhongwei Chen

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

Learning Methods ◽

Well Completion ◽

Machine Learning Methods ◽

Extreme Gradient Boosting ◽

Logging While Drilling

Summary Accurate coal identification is critical in coal seam gas (CSG) (also known as coalbed methane or CBM) developments because it determines well completion design and directly affects gas production. Density logging using radioactive source tools is the primary tool for coal identification, adding well trips to condition the hole and additional well costs for logging runs. In this paper, machine learning methods are applied to identify coals from drilling and logging-while-drilling (LWD) data to reduce overall well costs. Machine learning algorithms include logistic regression (LR), support vector machine (SVM), artificial neural network (ANN), random forest (RF), and extreme gradient boosting (XGBoost). The precision, recall, and F1 score are used as evaluation metrics. Because coal identification is an imbalanced data problem, the performance on the minority class (i.e., coals) is limited. To enhance the performance on coal prediction, two data manipulation techniques [naive random oversampling (NROS) technique and synthetic minority oversampling technique (SMOTE)] are separately coupled with machine learning algorithms. Case studies are performed with data from six wells in the Surat Basin, Australia. For the first set of experiments (single-well experiments), both the training data and test data are in the same well. The machine learning methods can identify coal pay zones for sections with poor or missing logs. It is found that rate of penetration (ROP) is the most important feature. The second set of experiments (multiple-well experiments) uses the training data from multiple nearby wells, which can predict coal pay zones in a new well. The most important feature is gamma ray. After placing slotted casings, all wells have coal identification rates greater than 90%, and three wells have coal identification rates greater than 99%. This indicates that machine learning methods (either XGBoost or ANN/RF with NROS/SMOTE) can be an effective way to identify coal pay zones and reduce coring or logging costs in CSG developments.

Download Full-text

Machine Learning Applications for Mass Spectrometry-Based Metabolomics

Metabolites ◽

10.3390/metabo10060243 ◽

2020 ◽

Vol 10 (6) ◽

pp. 243 ◽

Cited By ~ 7

Author(s):

Ulf W. Liebal ◽

An N. T. Phan ◽

Malvika Sudhakar ◽

Karthik Raman ◽

Lars M. Blank

Keyword(s):

Machine Learning ◽

Mass Spectrometry ◽

Data Analysis ◽

Metabolic Engineering ◽

Data Representation ◽

Heterogeneous Data ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods

The metabolome of an organism depends on environmental factors and intracellular regulation and provides information about the physiological conditions. Metabolomics helps to understand disease progression in clinical settings or estimate metabolite overproduction for metabolic engineering. The most popular analytical metabolomics platform is mass spectrometry (MS). However, MS metabolome data analysis is complicated, since metabolites interact nonlinearly, and the data structures themselves are complex. Machine learning methods have become immensely popular for statistical analysis due to the inherent nonlinear data representation and the ability to process large and heterogeneous data rapidly. In this review, we address recent developments in using machine learning for processing MS spectra and show how machine learning generates new biological insights. In particular, supervised machine learning has great potential in metabolomics research because of the ability to supply quantitative predictions. We review here commonly used tools, such as random forest, support vector machines, artificial neural networks, and genetic algorithms. During processing steps, the supervised machine learning methods help peak picking, normalization, and missing data imputation. For knowledge-driven analysis, machine learning contributes to biomarker detection, classification and regression, biochemical pathway identification, and carbon flux determination. Of important relevance is the combination of different omics data to identify the contributions of the various regulatory levels. Our overview of the recent publications also highlights that data quality determines analysis quality, but also adds to the challenge of choosing the right model for the data. Machine learning methods applied to MS-based metabolomics ease data analysis and can support clinical decisions, guide metabolic engineering, and stimulate fundamental biological discoveries.

Download Full-text

Prediction of Liver Weight Recovery by an Integrated Metabolomics and Machine Learning Approach After 2/3 Partial Hepatectomy

Frontiers in Pharmacology ◽

10.3389/fphar.2021.760474 ◽

2021 ◽

Vol 12 ◽

Author(s):

Runbin Sun ◽

Haokai Zhao ◽

Shuzhen Huang ◽

Ran Zhang ◽

Zhenyao Lu ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Random Forest ◽

Liver Regeneration ◽

Partial Hepatectomy ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods ◽

Liver Index ◽

Extreme Gradient Boosting

Liver has an ability to regenerate itself in mammals, whereas the mechanism has not been fully explained. Here we used a GC/MS-based metabolomic method to profile the dynamic endogenous metabolic change in the serum of C57BL/6J mice at different times after 2/3 partial hepatectomy (PHx), and nine machine learning methods including Least Absolute Shrinkage and Selection Operator Regression (LASSO), Partial Least Squares Regression (PLS), Principal Components Regression (PCR), k-Nearest Neighbors (KNN), Support Vector Machines (SVM), Random Forest (RF), eXtreme Gradient Boosting (xgbDART), Neural Network (NNET) and Bayesian Regularized Neural Network (BRNN) were used for regression between the liver index and metabolomic data at different stages of liver regeneration. We found a tree-based random forest method that had the minimum average Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and the maximum R square (R2) and is time-saving. Furthermore, variable of importance in the project (VIP) analysis of RF method was performed and metabolites with VIP ranked top 20 were selected as the most critical metabolites contributing to the model. Ornithine, phenylalanine, 2-hydroxybutyric acid, lysine, etc. were chosen as the most important metabolites which had strong correlations with the liver index. Further pathway analysis found Arginine biosynthesis, Pantothenate and CoA biosynthesis, Galactose metabolism, Valine, leucine and isoleucine degradation were the most influenced pathways. In summary, several amino acid metabolic pathways and glucose metabolism pathway were dynamically changed during liver regeneration. The RF method showed advantages for predicting the liver index after PHx over other machine learning methods used and a metabolic clock containing four metabolites is established to predict the liver index during liver regeneration.

Download Full-text

Supervised machine learning methods in psychology: A practical introduction with annotated R code

10.31234/osf.io/s72vu ◽

2019 ◽

Author(s):

Hannes Rosenbusch ◽

Felix Soldner ◽

Anthony M Evans ◽

Marcel Zeelenberg

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Psychological Research ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Comprehensive Overview ◽

K Nearest Neighbors ◽

Machine Learning Methods ◽

Out Of Sample

Machine learning methods for pattern detection and prediction are increasingly prevalent in psychological research. We provide a comprehensive overview of machine learning, its applications, and how to implement models for research. We review fundamental concepts of machine learning, such as prediction accuracy and out-of-sample evaluation, and summarize four standard prediction algorithms: linear regressions, ridge regressions, decision trees, and random forests (plus k-nearest neighbors, Naïve Bayes classifiers, and support vector machines in the supplementary material). This selection provides a set of powerful models that are implemented regularly in machine learning projects. We demonstrate each method with examples and annotated R code, and discuss best practices for determining sample sizes; comparing model performances; tuning prediction models; preregistering prediction models; and reporting results. Finally, we discuss the value of machine learning methods in maintaining psychology’s status as a predictive science.

Download Full-text

An evaluation of machine learning classifiers for next-generation, continuous-ethogram smart trackers

Movement Ecology ◽

10.1186/s40462-021-00245-x ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Hui Yu ◽

Jian Deng ◽

Ran Nathan ◽

Max Kröschel ◽

Sasha Pekarsky ◽

...

Keyword(s):

Machine Learning ◽

Feature Reduction ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods ◽

Behavioural Research ◽

Intermittent Sampling

Abstract Background Our understanding of movement patterns and behaviours of wildlife has advanced greatly through the use of improved tracking technologies, including application of accelerometry (ACC) across a wide range of taxa. However, most ACC studies either use intermittent sampling that hinders continuity or continuous data logging relying on tracker retrieval for data downloading which is not applicable for long term study. To allow long-term, fine-scale behavioural research, we evaluated a range of machine learning methods for their suitability for continuous on-board classification of ACC data into behaviour categories prior to data transmission. Methods We tested six supervised machine learning methods, including linear discriminant analysis (LDA), decision tree (DT), support vector machine (SVM), artificial neural network (ANN), random forest (RF) and extreme gradient boosting (XGBoost) to classify behaviour using ACC data from three bird species (white stork Ciconia ciconia, griffon vulture Gyps fulvus and common crane Grus grus) and two mammals (dairy cow Bos taurus and roe deer Capreolus capreolus). Results Using a range of quality criteria, SVM, ANN, RF and XGBoost performed well in determining behaviour from ACC data and their good performance appeared little affected when greatly reducing the number of input features for model training. On-board runtime and storage-requirement tests showed that notably ANN, RF and XGBoost would make suitable on-board classifiers. Conclusions Our identification of using feature reduction in combination with ANN, RF and XGBoost as suitable methods for on-board behavioural classification of continuous ACC data has considerable potential to benefit movement ecology and behavioural research, wildlife conservation and livestock husbandry.

Download Full-text

Evaluation of Supervised Learning Models in Predicting Greenhouse Energy Demand and Production for Intelligent and Sustainable Operations

Energies ◽

10.3390/en14196297 ◽

2021 ◽

Vol 14 (19) ◽

pp. 6297

Author(s):

Laila Ouazzani Chahidi ◽

Marco Fossa ◽

Antonella Priarone ◽

Abdellah Mechaqrane

Keyword(s):

Machine Learning ◽

Intelligent Control ◽

Energy Demand ◽

Well Being ◽

Supervised Machine Learning ◽

Support Vector ◽

Photovoltaic Module ◽

Learning Methods ◽

Sustainable Operations ◽

Machine Learning Methods

Plants need a specific environment to grow and reproduce in fine fettle. Nevertheless, climatic conditions are not stable and can impact their well-being and, consequently, harvest quality. Thus, greenhouse cultivation is one of the suitable agricultural techniques for creating and controlling the inside microclimate to be adequate for plant growth. The relevance of greenhouse control is widely recognized. The prediction of greenhouse variables using artificial intelligence methods is of great interest for intelligent control and the potential reduction in energetic and financial losses. However, the studies carried out in this context are still more or less limited and several machine learning methods have not been sufficiently exploited. The aim of this study is to predict the air conditioning electrical consumption and photovoltaic module electrical production at the smart Agro-Manufacturing Laboratory (SamLab) greenhouse, located in Albenga, north-western Italy. Different supervised machine learning methods were compared, namely, Artificial Neural Networks (ANNs), Gaussian Process Regression (GPR), Support Vector Machine (SVM) and Boosting trees. We evaluated the performance of the models based on three statistical indicators: the coefficient of correlation (R), the normalized root mean square error (nRMSE) and the normalized mean absolute error (nMAE). The results show good agreement between the measured and predicted values for all models, with a correlation coefficient R > 0.9, considering the validation set. The good performance of the models affirms the importance of this approach and that it can be used to further improve greenhouse efficiency through its intelligent control.

Download Full-text