A machine learning based approach to clinopyroxene thermobarometry: model optimisation and distribution for use in Earth Sciences

Thermobarometry is a fundamental tool to quantitatively interrogate magma plumbing systems and broaden our appreciation of volcanic processes. Developments in random forest-based machine learning lend themselves to a more data-driven approach to clinopyroxene thermobarometry. This can include allowing users to access and filter large experimental datasets that can be tailored to individual applications in Earth Sciences. Here we present a methodological assessment of random forest thermobarometry, using the R freeware package “extraTrees”, by investigating the model performance, tuning hyperparameters, and evaluating different methods for calculating uncertainties. We determine that deviating from the default hyperparameters used in the “extraTrees” package results in little difference in overall model performance (<0.2 kbar and <3 ⁰C difference in mean SEE). However, accuracy is greatly affected by how the final pressure or temperature (PT) value from the voting distribution of trees in the random forest is selected (mean, median or mode). This thus far has been unapproached in machine learning thermobarometry. Using the mean value leads to a higher residual between experimental and predicted PT, whereas using median values produces smaller residuals. Additionally, this work provides two comprehensive R scripts for users to apply the random forest methodology to natural datasets. The first script permits modification and filtering of the model calibration dataset. The second script contains pre-made models in which users can rapidly input their data to recover pressure and temperature estimates. These scripts are open source and can be accessed at https://github.com/corinjorgenson/RandomForest-cpx-thermobarometer.

Download Full-text

Machine Learning, Ethics, and Change Management: A Data-Driven Approach to Improving Hospital Observation Unit Operations

INFORMS Transactions on Education ◽

10.1287/ited.2021.0251ca ◽

2021 ◽

Author(s):

Dessislava Pachamanova ◽

Vera Tilson ◽

Keely Dwyer-Matzky

Keyword(s):

Machine Learning ◽

Change Management ◽

Process Improvement ◽

Operations Management ◽

Model Performance ◽

Machine Learning Algorithms ◽

Business Analytics ◽

Observation Unit ◽

Improvement Project ◽

Data Driven Approach

This case discusses a process improvement project aimed at maximizing the use of hospital capacity as the flu season looms. Dr. Erin Kelly heads the observation unit and turns to predictive models to improve the assignment of patients to her unit. The case covers three major themes: (1) data analytics life cycle and interface of predictive and prescriptive analytics in the context of process improvement, (2) design and ethical application of machine learning models, and (3) effecting organizational change to operationalize the findings of the analysis. Realistic data, R code, and Excel models are provided. The rich context of the case allows for discussing change management in a healthcare organization, analytics problem framing and model mapping, service process capacity analysis and Little’s law, data summaries and visualizations, interpretable machine learning algorithms, evaluations of predictive model performance, algorithmic bias, and dealing with dirty data. The case is appropriate for use in courses in machine learning, business analytics, operations management, and operations research, both at the advanced undergraduate level and at the master’s/MBA level.

Download Full-text

Classification models using circulating neutrophil transcripts can detect unruptured intracranial aneurysm

Journal of Translational Medicine ◽

10.1186/s12967-020-02550-2 ◽

2020 ◽

Vol 18 (1) ◽

Author(s):

Kerry E. Poppenberg ◽

Vincent M. Tutino ◽

Lu Li ◽

Muhammad Waqas ◽

Armond June ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Prediction Models ◽

Model Performance ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Training Cohort ◽

Network Analyses ◽

Machine Learning Methods

Abstract Background Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods. Methods Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction. Results Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance. Conclusions We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.

Download Full-text

Enhanced Changeover Detection in Industry 4.0 Environments with Machine Learning

Sensors ◽

10.3390/s21175896 ◽

2021 ◽

Vol 21 (17) ◽

pp. 5896

Author(s):

Eddi Miller ◽

Vladyslav Borysenko ◽

Moritz Heusinger ◽

Niklas Niedner ◽

Bastian Engelmann ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Binary Classification ◽

Model Performance ◽

Support Vector ◽

Milling Machine ◽

Vector Machines ◽

Changeover Times ◽

Flow Power

Changeover times are an important element when evaluating the Overall Equipment Effectiveness (OEE) of a production machine. The article presents a machine learning (ML) approach that is based on an external sensor setup to automatically detect changeovers in a shopfloor environment. The door statuses, coolant flow, power consumption, and operator indoor GPS data of a milling machine were used in the ML approach. As ML methods, Decision Trees, Support Vector Machines, (Balanced) Random Forest algorithms, and Neural Networks were chosen, and their performance was compared. The best results were achieved with the Random Forest ML model (97% F1 score, 99.72% AUC score). It was also carried out that model performance is optimal when only a binary classification of a changeover phase and a production phase is considered and less subphases of the changeover process are applied.

Download Full-text

The importance of round-robin validation when assessing machine-learning-based vertical extrapolation of wind speeds

10.5194/wes-2020-2 ◽

2020 ◽

Author(s):

Nicola Bodini ◽

Mike Optis

Keyword(s):

Machine Learning ◽

Random Forest ◽

Power Law ◽

Wind Farm ◽

Sonic Anemometer ◽

Model Performance ◽

Learning Model ◽

Round Robin ◽

Wind Speeds ◽

Machine Learning Model

Abstract. The extrapolation of wind speeds measured at a meteorological mast to wind turbine hub heights is a key component in a bankable wind farm energy assessment and a significant source of uncertainty. Industry-standard methods for extrapolation include the power law and logarithmic profile. The emergence of machine-learning applications in wind energy has led to several studies demonstrating substantial improvements in vertical extrapolation accuracy in machine-learning methods over these conventional power law and logarithmic profile methods. In all cases, these studies assess relative model performance at a measurement site where, critically, the machine-learning algorithm requires knowledge of the hub-height wind speeds in order to train the model. This prior knowledge provides fundamental advantages to the site-specific machine-learning model over the power law and log profile, which, by contrast, are not highly tuned to hub-height measurements but rather can generalize to any site. Furthermore, there is no practical benefit in applying a machine-learning model at a site where hub-height winds are known; rather, its performance at nearby locations (i.e., across a wind farm site) without hub-height measurements is of most practical interest. To more fairly and practically compare machine-learning-based extrapolation to standard approaches, we implemented a round-robin extrapolation model comparison, in which a random forest machine-learning model is trained and evaluated at different sites and then compared against the power law and logarithmic profile. We consider 20 months of lidar and sonic anemometer data collected at four sites between 50–100 kilometers apart in the central United States. We find that the random forest outperforms the standard extrapolation approaches, especially when incorporating surface measurements as inputs to include the influence of atmospheric stability. When compared at a single site (the traditional comparison approach), the machine-learning improvement in mean absolute error was 28 % and 23 % over the power law and logarithmic profile, respectively. Using the round-robin approach proposed here, this improvement drops to 19 % and 14 %, respectively. These latter values better represent practical model performance, and we conclude that round-robin validation should be the standard for machine-learning-based, wind-speed extrapolation methods.

Download Full-text

A machine learning ensemble to predict treatment outcomes following an Internet intervention for depression

Psychological Medicine ◽

10.1017/s003329171800315x ◽

2018 ◽

Vol 49 (14) ◽

pp. 2330-2341 ◽

Cited By ~ 4

Author(s):

Rahel Pearson ◽

Derek Pisner ◽

Björn Meyer ◽

Jason Shumake ◽

Christopher G. Beevers

Keyword(s):

Machine Learning ◽

Random Forest ◽

Model Performance ◽

Well Being ◽

Elastic Net ◽

Internet Intervention ◽

Symptom Improvement ◽

Depression Symptoms ◽

Treatment Expectancies ◽

The Usa

AbstractBackgroundSome Internet interventions are regarded as effective treatments for adult depression, but less is known about who responds to this form of treatment.MethodAn elastic net and random forest were trained to predict depression symptoms and related disability after an 8-week course of an Internet intervention, Deprexis, involving adults (N = 283) from across the USA. Candidate predictors included psychopathology, demographics, treatment expectancies, treatment usage, and environmental context obtained from population databases. Model performance was evaluated using predictive R2 $\lpar R_{{\rm pred}}^2\rpar\comma $ the expected variance explained in a new sample, estimated by 10 repetitions of 10-fold cross-validation.ResultsAn ensemble model was created by averaging the predictions of the elastic net and random forest. Model performance was compared with a benchmark linear autoregressive model that predicted each outcome using only its baseline. The ensemble predicted more variance in post-treatment depression (8.0% gain, 95% CI 0.8–15; total $R_{{\rm pred}}^2 \; $= 0.25), disability (5.0% gain, 95% CI −0.3 to 10; total $R_{{\rm pred}}^2 \; $= 0.25), and well-being (11.6% gain, 95% CI 4.9–19; total $R_{{\rm pred}}^2 \; $= 0.29) than the benchmark model. Important predictors included comorbid psychopathology, particularly total psychopathology and dysthymia, low symptom-related disability, treatment credibility, lower access to therapists, and time spent using certain Deprexis modules.ConclusionA number of variables predict symptom improvement following an Internet intervention, but each of these variables makes relatively small contributions. Machine learning ensembles may be a promising statistical approach for identifying the cumulative contribution of many weak predictors to psychosocial depression treatment response.

Download Full-text

59. A RADIOMICS-BASED MACHINE LEARNING MODEL FOR DISTINGUISHING RADIATION NECROSIS FROM PROGRESSION OF BRAIN METASTASES TREATED WITH STEREOTACTIC RADIOSURGERY (SRS)

Neuro-Oncology Advances ◽

10.1093/noajnl/vdaa073.047 ◽

2020 ◽

Vol 2 (Supplement_2) ◽

pp. ii12-ii12

Author(s):

Xuguang Chen ◽

Vishwa Parekh ◽

Luke Peng ◽

Michael Chan ◽

Michael Soike ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Brain Metastases ◽

Radiation Necrosis ◽

Dimensional Space ◽

Model Performance ◽

Difference Matrix ◽

Post Contrast ◽

Training Cohort ◽

Backward Elimination

Abstract PURPOSE This study aims to test whether MRI radiomic signatures can distinguish radiation necrosis (RN) from tumor progression (TP) in a multi-institution dataset using machine learning. METHODS Brain metastases treated with SRS were followed by serial MRI, and those showing evidence of RN or TP underwent pathologic confirmation. Radiomic features were extracted from T1 post-contrast (T1c) and T2 fluid attenuated inversion recovery (T2 FLAIR) MRI. High dimensional radiomic feature space was visualized in a two-dimensional space using t-distributed stochastic neighbor embedding (t-SNE). Cases from 2 institutions were combined and randomly assigned to training (2/3) and testing (1/3) cohorts. Backward elimination was used for feature selection, followed by random forest algorithm for predictive modeling. RESULTS A total of 135 individual lesions (37 RN and 98 TP) were included. The majority (72.6%) received single-fraction SRS to a median dose of 18Gy. Clear clustering of cases around the institutional origin was observed on t-SNE analysis. 21 T1c and 4 FLAIR features were excluded from subsequent modeling due to significant correlation with the institutional origin. Backward elimination yielded 6 T1c and 6 FLAIR features used for model construction. A random forest model based on the 6 FLAIR features (cluster shade, neighborhood gray tone difference matrix (NGTDM) coarseness, NGTDM texture strength, run length nonuniformity, run percentage, and short run high gray-level emphasis) achieved sensitivity of 76% and specificity of 70% on the training cohort (AUC 0.74, 95% CI 0.60–0.88), and sensitivity of 67% and specificity of 83% on the testing cohort (AUC 0.75, 95% CI 0.59–0.93). Addition of the T1c features resulted in overfitting of the training cohort (AUC 1.00), but did not improve model performance on the testing cohort (AUC 0.69, 95% CI 0.51–0.87). CONCLUSION MRI radiomics based machine learning can distinguish RN from TP after brain SRS in a heterogeneous image dataset.

Download Full-text

Comparison of Regression and Machine Learning Methods in Depression Forecasting Among Home-Based Elderly Chinese: A Community Based Study

Frontiers in Psychiatry ◽

10.3389/fpsyt.2021.764806 ◽

2022 ◽

Vol 12 ◽

Author(s):

Shaowu Lin ◽

Yafei Wu ◽

Ya Fang

Keyword(s):

Machine Learning ◽

Life Satisfaction ◽

Logistic Regression ◽

Random Forest ◽

Cognitive Ability ◽

Characteristic Curve ◽

Model Performance ◽

Predictive Performance ◽

Home Based ◽

Elderly Chinese

BackgroundDepression is highly prevalent and considered as the most common psychiatric disorder in home-based elderly, while study on forecasting depression risk in the elderly is still limited. In an endeavor to improve accuracy of depression forecasting, machine learning (ML) approaches have been recommended, in addition to the application of more traditional regression approaches.MethodsA prospective study was employed in home-based elderly Chinese, using baseline (2011) and follow-up (2013) data of the China Health and Retirement Longitudinal Study (CHARLS), a nationally representative cohort study. We compared four algorithms, including the regression-based models (logistic regression, lasso, ridge) and ML method (random forest). Model performance was assessed using repeated nested 10-fold cross-validation. As the main measure of predictive performance, we used the area under the receiver operating characteristic curve (AUC).ResultsThe mean AUCs of the four predictive models, logistic regression, lasso, ridge, and random forest, were 0.795, 0.794, 0.794, and 0.769, respectively. The main determinants were life satisfaction, self-reported memory, cognitive ability, ADL (activities of daily living) impairment, CESD-10 score. Life satisfaction increased the odds ratio of a future depression by 128.6% (logistic), 13.8% (lasso), and 13.2% (ridge), and cognitive ability was the most important predictor in random forest.ConclusionsThe three regression-based models and one ML algorithm performed equally well in differentiating between a future depression case and a non-depression case in home-based elderly. When choosing a model, different considerations, however, such as easy operating, might in some instances lead to one model being prioritized over another.

Download Full-text

Prediction Model of Temperature of Cast Billet Based on Its Heating Retrospection Using Boosting “Random Forest” Structure

Vestnik NSU Series Information Technologies ◽

10.25205/1818-7900-2020-18-4-11-27 ◽

2020 ◽

Vol 18 (4) ◽

pp. 11-27

Author(s):

Petr I. Zhukov ◽

Anton I. Glushchenko ◽

Andrey V. Fomin

Keyword(s):

Random Forest ◽

Continuous Furnace ◽

Regression Trees ◽

Transient Heat Conduction ◽

Mean Value ◽

Heating Process ◽

Cast Billet ◽

Temperature Prediction ◽

Billet Temperature ◽

Data Driven Approach

The scope of this research is the prediction of a cast billet surface temperature, which it will have in the rolling mill after the heating process. The main problem is that such a prediction is needed before the cast billet will really leave the furnace. In many cases, the boundary value problem of the heat transfer, particularly the differential equations of the transient heat conduction, is used to solve this problem. But in this research an alternative data-driven approach is proposed, which is based on a model of the dependence of the billet temperature on the retrospection of its heating in the continuous furnace. Such a model is developed as a result of the analysis of the data from the furnace control system. Such data from the real furnace were collected and stored in the data warehouse. Their exploratory analysis was conducted. All data were splitted into training, testing and validation subsets. As a part of this research, the regression model previously developed by the authors was also validated. It seemed to be overfitted (the error on the test set was significantly higher than the one on the training set). To overcome this disadvantage, an alternative method to develop the required data-based model is proposed by authors on the basis of the Boosting and Bagging algorithms. They belong to the machine learning field. As a result of the experiments with the bagging and boosting, the required model structure was chosen as a “Random Forest” with special class of the regression trees known as DART (Dropout Adaptive Regression Trees). Based on a significant number of experiments with that model, the two confidence intervals of the temperature prediction were found: 68 % and 95 % ones. The mean value of the temperature prediction error was estimated as ~ 9 °C for both the test and validation sets.

Download Full-text

SAT0647-HPR DEVELOP A MACHINE LEARNING MODEL AND ALGORITHM BASED ON SMART SYSTEM OF DISEASE MANAGEMENT (SSDM) BIG DATA FOR RA FLARE PREDICTION

Annals of the Rheumatic Diseases ◽

10.1136/annrheumdis-2020-eular.5458 ◽

2020 ◽

Vol 79 (Suppl 1) ◽

pp. 1282.2-1283

Author(s):

Y. Zhao ◽

R. Mu ◽

X. LI ◽

H. Sun ◽

C. MI ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Disease Management ◽

Decision Trees ◽

Morning Stiffness ◽

Model Performance ◽

Machine Learning Algorithms ◽

Brier Score ◽

Tender Joint Count ◽

Smart System

Background:Flare, relapse from status of treat-to-target (T2T, DAS28<=3.2), is hard predicted. We try to make it predictable by applying machine learning to a database from smart system of disease management (SSDM). SSDM is an interactive mobile disease management APPs.Objectives:To develop and validate machine learning algorithms for flare prediction in RA.Methods:Patients were trained using SSDM and input their data, including demographic, comorbidities (COMBs), lab test, medications and monthly self-assessments, including DAS28, HAQ, SF-36, Hospital Anxiety and Depression Scale (HADS). The data was uploaded to cloud and synchronized to the mobile of authorized rheumatologists. The COMBs were by ICD-9, and medications were listed as cDMARDs, Bio (BioDMARDs), NSAIDs, Steroid, FS (food supplements), MC (medicine for COMBs), TCM (Traditional Chinese Medicine), and combinations.Results:From Jan of 2015 to Jan of 2020, 8811 RA patients, 85% female and 15% male, used to reach T2T. 4556 were flare-free and 4255 suffering at least one flare. The average 160 attributes were extracted from each flare-free patient at time of reaching T2T, and each flare patients at time of 3 months before the flare. Patients were randomly assigned as model setup (training) group (70%) and validation (testing) group30%.For training, data were processed using Python with statistical analyses in R. In R, random forests were implemented. Logistic regression via glm in base R. The random forest comprises a set of decision trees. “Splits” in the decision trees reflect binary (i.e., yes/no) respect to attributors. Bootstrapping was used to assess, quantify, and adjust for model optimism. Model performance was evaluated using AUC, precision and recall metrics. Brier scores for accuracy of probabilistic predictions ranged from 0 to 1 (0 is perfect discrimination).The testing showed model performance for prediction windows are 0.78 for AUC (95% CI), 0.71 for Recall (sensitivity), 0.195 for Brier score, and 0.68 for precision (true positive 893, false positive 417, false negative 367, true negative 966).Based on weighing in the random forest, the top 10 pro-flare attributes were CRP, swollen joint count (SJC), tender joint count (TJC), HAQ, DAS28, morning stiffness, gout, MCTD, OA, duration; while top 10 anti-flare attributes were cDMARDs+Bio, cDMARDs+steroid+NSAIDs, stable on HAQ, on morning stiffness, on SJC, medicine on COMBs, cDMARDs+TCM, stable on TJC, on ESR, income at 100-200k (Fig.1). The top weighing COMBs for pro-flaring were gout (0.81), MRD (0.75), OA (0.56), AS (0.48). The monotherapies with either Bio or NSAIDs, or steroid, or TCM was pro-flare; while with cDMARDs was anti-flare (-0.21).Figure 1.Conclusion:The attempt to develop a machine learning algorithm for RA flare prediction is successful. The discrimination was acceptable. The attributes of both pro-flare and anti-flare are identified, which may inspire the proactive intervention.Acknowledgments:SSDM was developed by Shanghai Gothic Internet Technology Co., Ltd.Disclosure of Interests:None declared

Download Full-text

Addressing Measurement Error in Random Forests using Quantitative Bias Analysis

American Journal of Epidemiology ◽

10.1093/aje/kwab010 ◽

2021 ◽

Author(s):

Tammy Jiang ◽

Jaimie L Gradus ◽

Timothy L Lash ◽

Matthew P Fox

Keyword(s):

Machine Learning ◽

Measurement Error ◽

Random Forest ◽

Random Forests ◽

Model Performance ◽

Variable Importance ◽

Bias Analysis ◽

Variable Importance Measures ◽

Quantitative Bias Analysis ◽

The Impact

Abstract Although variables are often measured with error, the impact of measurement error on machine learning predictions is seldom quantified. The purpose of this study was to assess the impact of measurement error on random forest model performance and variable importance. First, we assessed the impact of misclassification (i.e., measurement error of categorical variables) of predictors on random forest model performance (e.g., accuracy, sensitivity) and variable importance (mean decrease in accuracy) using data from the United States National Comorbidity Survey Replication (2001 - 2003). Second, we simulated datasets in which we know the true model performance and variable importance measures and could verify that quantitative bias analysis was recovering the truth in misclassified versions of the datasets. Our findings show that measurement error in the data used to construct random forests can distort model performance and variable importance measures, and that bias analysis can recover the correct results. This study highlights the utility of applying quantitative bias analysis in machine learning to quantify the impact of measurement error on study results.

Download Full-text