Broken Rail Prediction With Machine Learning-Based Approach

The Relationship

Abstract Broken rails are the most frequent cause of freight train derailments in the United States. According to the U.S. Federal Railroad Administration (FRA) railroad accident database, there are over 900 Class I railroad freight-train derailments caused by broken rails between 2000 and 2017. In 2017 alone, broken rail-caused freight train derailments cause $15.8 million track and rolling stock damage costs to Class I railroads. The prevention of broken rails is crucial for reducing the risk due to broken rail-caused derailments. Although there is fast-growing big data in the railroad industry, quite limited prior research has taken advantage of these data to disclose the relationship between real-world factors and broken rail occurrence. This article aims to predict the occurrence of broken rails via machine learning approach that simultaneously accounts for track files, traffic information, maintenance history, and prior defect information. In the prediction of broken rails, a machine learning-based algorithm called extreme gradient boosting (XGBoost) is developed with various types of variables, including track characteristics (e.g. rail profile information, rail laid information), traffic-related information (e.g. gross tonnage recorded by time, number of passing cars), maintenance records (e.g. rail grinding and track ballast cleaning), and historical rail defect records. Area Under the Curve (AUC) is used as the evaluation metric to identify the prediction accuracy of developed machine learning model. The preliminary result shows that the AUC for one year of the XGBoost-based prediction model is 0.83, which is higher than two comparative models, logistic regression and random forests. Furthermore, the feature importance discloses that segment length, traffic tonnage, number of car passes, rail age, and the number of detected defects in the past six months have relatively greater importance for the prediction of broken rails. The prediction model and outcomes, along with future research in the relationship between broken rails and broken rail-caused derailment, can benefit railroad practical maintenance planning and capital planning.

A Machine Learning Approach to Identify Predictors of Potentially Inappropriate Non-Steroidal Anti-Inflammatory Drugs (NSAIDs) Use in Older Adults with Osteoarthritis

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18010155 ◽

2020 ◽

Vol 18 (1) ◽

pp. 155

Author(s):

Jayeshkumar Patel ◽

Amit Ladani ◽

Nethra Sambamoorthi ◽

Traci LeMasters ◽

Nilanjana Dwibedi ◽

...

Keyword(s):

Machine Learning ◽

Older Adults ◽

The United States ◽

Gradient Boosting ◽

Study Cohort ◽

Real World Data ◽

Inflammatory Drugs ◽

Anti Inflammatory ◽

Using Data

Evidence from some studies suggest that osteoarthritis (OA) patients are often prescribed non-steroidal anti-inflammatory drugs (NSAIDs) that are not in accordance with their cardiovascular (CV) or gastrointestinal (GI) risk profiles. However, no such study has been carried out in the United States. Therefore, we sought to examine the prevalence and predictors of potentially inappropriate NSAIDs use in older adults (age > 65) with OA using machine learning with real-world data from Optum De-identified Clinformatics® Data Mart. We identified a retrospective cohort of eligible individuals using data from 2015 (baseline) and 2016 (follow-up). Potentially inappropriate NSAIDs use was identified using the type (COX-2 selective vs. non-selective) and length of NSAIDs use and an individual’s CV and GI risk. Predictors of potentially inappropriate NSAIDs use were identified using eXtreme Gradient Boosting. Our study cohort comprised of 44,990 individuals (mean age 75.9 years). We found that 12.8% individuals had potentially inappropriate NSAIDs use, but the rate was disproportionately higher (44.5%) in individuals at low CV/high GI risk. Longer duration of NSAIDs use during baseline (AOR 1.02; 95% CI:1.02–1.02 for both non-selective and selective NSAIDs) was associated with a higher risk of potentially inappropriate NSAIDs use. Additionally, individuals with low CV/high GI (AOR 1.34; 95% CI:1.20–1.50) and high CV/low GI risk (AOR 1.61; 95% CI:1.34–1.93) were also more likely to have potentially inappropriate NSAIDs use. Heightened surveillance of older adults with OA requiring NSAIDs is warranted.

Dementia risks identified by vocal features via telephone conversations: A novel machine learning prediction model

PLoS ONE ◽

10.1371/journal.pone.0253988 ◽

2021 ◽

Vol 16 (7) ◽

pp. e0253988

Author(s):

Akihiro Shimoda ◽

Yue Li ◽

Hana Hayashi ◽

Naoki Kondo

Keyword(s):

Machine Learning ◽

Prediction Model ◽

Predictive Performance ◽

Gradient Boosting ◽

Validation Data ◽

Audio File ◽

Audio Data ◽

Data Files ◽

Audio Files

Due to difficulty in early diagnosis of Alzheimer’s disease (AD) related to cost and differentiated capability, it is necessary to identify low-cost, accessible, and reliable tools for identifying AD risk in the preclinical stage. We hypothesized that cognitive ability, as expressed in the vocal features in daily conversation, is associated with AD progression. Thus, we have developed a novel machine learning prediction model to identify AD risk by using the rich voice data collected from daily conversations, and evaluated its predictive performance in comparison with a classification method based on the Japanese version of the Telephone Interview for Cognitive Status (TICS-J). We used 1,465 audio data files from 99 Healthy controls (HC) and 151 audio data files recorded from 24 AD patients derived from a dementia prevention program conducted by Hachioji City, Tokyo, between March and May 2020. After extracting vocal features from each audio file, we developed machine-learning models based on extreme gradient boosting (XGBoost), random forest (RF), and logistic regression (LR), using each audio file as one observation. We evaluated the predictive performance of the developed models by describing the receiver operating characteristic (ROC) curve, calculating the areas under the curve (AUCs), sensitivity, and specificity. Further, we conducted classifications by considering each participant as one observation, computing the average of their audio files’ predictive value, and making comparisons with the predictive performance of the TICS-J based questionnaire. Of 1,616 audio files in total, 1,308 (81.0%) were randomly allocated to the training data and 308 (19.1%) to the validation data. For audio file-based prediction, the AUCs for XGboost, RF, and LR were 0.863 (95% confidence interval [CI]: 0.794–0.931), 0.882 (95% CI: 0.840–0.924), and 0.893 (95%CI: 0.832–0.954), respectively. For participant-based prediction, the AUC for XGboost, RF, LR, and TICS-J were 1.000 (95%CI: 1.000–1.000), 1.000 (95%CI: 1.000–1.000), 0.972 (95%CI: 0.918–1.000) and 0.917 (95%CI: 0.918–1.000), respectively. There was difference in predictive accuracy of XGBoost and TICS-J with almost approached significance (p = 0.065). Our novel prediction model using the vocal features of daily conversations demonstrated the potential to be useful for the AD risk assessment.

A novel prediction model to identify patients with early-stage pancreatic cancer.

Journal of Clinical Oncology ◽

10.1200/jco.2020.38.15_suppl.e16801 ◽

2020 ◽

Vol 38 (15_suppl) ◽

pp. e16801-e16801

Author(s):

Daniel R Cherry ◽

Qinyu Chen ◽

James Don Murphy

Keyword(s):

Machine Learning ◽

Pancreatic Cancer ◽

Prediction Model ◽

Learning Algorithm ◽

Early Stage ◽

Biliary Tract Disease ◽

Screening Tools ◽

Gradient Boosting ◽

Diagnosis Codes ◽

e16801 Background: Pancreatic cancer has an insidious presentation with four-in-five patients presenting with disease not amenable to potentially curative surgery. Efforts to screen patients for pancreatic cancer using population-wide strategies have proven ineffective. We applied a machine learning approach to create an early prediction model drawing on the content of patients’ electronic health records (EHRs). Methods: We used patient data from OptumLabs which included de-identified data extracted from patient EHRs collected between 2009 and 2017. We identified patients diagnosed with pancreatic cancer at age 40 or later, which we categorized into early-stage pancreatic cancer (ESPC; n = 3,322) and late-stage pancreatic cancer (LSPC; n = 25,908) groups. ESPC cases were matched to non-pancreatic cancer controls in a ratio of 1:16 based on diagnosis year and geographic division, and the cohort was divided into training (70%) and test (30%) sets. The prediction model was built using an eXtreme Gradient Boosting machine learning algorithm of ESPC patients’ EHRs in the year preceding diagnosis, with features including patient demographics, procedure and clinical diagnosis codes, clinical notes and medications. Model discrimination was assessed with sensitivity, specificity, positive predictive value (PPV) and area under the curve (AUC) with a score of 1.0 indicating perfect prediction. Results: The final AUC in the test set was 0.841, and the model included 583 features, of which 248 (42.5%) were physician note elements, 146 (25.0%) were procedure codes, 91 (15.6%) were diagnosis codes, 89 (15.3%) were medications and 9 (1.54%) were demographic features. The most important features were history of pancreatic disorders (not diabetes or cancer), age, income, biliary tract disease, education level, obstructive jaundice and abdominal pain. We evaluated model performance at varying classification thresholds. When applied to patients over 40 choosing a threshold with a sensitivity of 20% produced a specificity of 99.9% and a PPV of 2.5%. The model PPV increased with age; for patients over 80, PPV was 8.0%. LSPC patients identified by the model would have been detected a median of 4 months before their actual diagnosis, with a quarter of these patients identified at least 14 months earlier. Conclusions: Using EHR data to identify early-stage pancreatic cancer patients shows promise. While widespread use of this approach on an unselected population would produce high rates of false positives, this technique could be employed among high risk patients, or paired with other screening tools.

A Pragmatic Machine Learning Model to Predict Carbapenem Resistance

Antimicrobial Agents and Chemotherapy ◽

10.1128/aac.00063-21 ◽

2021 ◽

Author(s):

Ryan J. McGuire ◽

Sean C. Yu ◽

Philip R. O. Payne ◽

Albert M. Lai ◽

M. Cristina Vazquez-Guillamet ◽

...

Keyword(s):

Machine Learning ◽

Predictive Value ◽

Medical Center ◽

Academic Medical Center ◽

Tertiary Care ◽

Carbapenem Resistance ◽

The United States ◽

Gradient Boosting ◽

Negative Case ◽

Infection caused by carbapenem resistant (CR) organisms is a rising problem in the United States. While the risk factors for antibiotic resistance are well known, there remains a large need for the early identification of antibiotic resistant infections. Using machine learning (ML), we sought to develop a prediction model for carbapenem resistance. All patients >18 years of age admitted to a tertiary-care academic medical center between Jan 1, 2012 and Oct 10, 2017 with ≥1 bacterial culture were eligible for inclusion. All demographic, medication, vital sign, procedure, laboratory, and culture/sensitivity data was extracted from the electronic health record. Organisms were considered CR if a single isolate was reported as intermediate or resistant. CR and non-CR patients were temporally matched to maintain positive/negative case ratio. Extreme gradient boosting was used for model development. In total, 68,472 patients met inclusion criteria with 1,088 CR patients identified. Sixty-seven features were used for predictive modeling. The most important features were number of prior antibiotic days, recent central venous catheter placement, and inpatient surgery. After model training, the area under the receiver operating characteristic curve was 0.846. The sensitivity of the model was 30%, with a positive predictive value (PPV) of 30% and a negative predictive value of 99%. Using readily available clinical data, we were able to create a ML model capable of predicting CR infections at the time of culture collection with a high PPV.

A Self-Care Prediction Model for Children with Disability Based on Genetic Algorithm and Extreme Gradient Boosting

Mathematics ◽

10.3390/math8091590 ◽

2020 ◽

Vol 8 (9) ◽

pp. 1590

Author(s):

Muhammad Syafrudin ◽

Ganjar Alfian ◽

Norma Latif Fitriyani ◽

Muhammad Anshari ◽

Tony Hadibarata ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Prediction Model ◽

Prediction Models ◽

Self Care ◽

Gradient Boosting ◽

Children With Disability ◽

Study Results ◽

Care Problems

Detecting self-care problems is one of important and challenging issues for occupational therapists, since it requires a complex and time-consuming process. Machine learning algorithms have been recently applied to overcome this issue. In this study, we propose a self-care prediction model called GA-XGBoost, which combines genetic algorithms (GAs) with extreme gradient boosting (XGBoost) for predicting self-care problems of children with disability. Selecting the feature subset affects the model performance; thus, we utilize GA to optimize finding the optimum feature subsets toward improving the model’s performance. To validate the effectiveness of GA-XGBoost, we present six experiments: comparing GA-XGBoost with other machine learning models and previous study results, a statistical significant test, impact analysis of feature selection and comparison with other feature selection methods, and sensitivity analysis of GA parameters. During the experiments, we use accuracy, precision, recall, and f1-score to measure the performance of the prediction models. The results show that GA-XGBoost obtains better performance than other prediction models and the previous study results. In addition, we design and develop a web-based self-care prediction to help therapist diagnose the self-care problems of children with disabilities. Therefore, appropriate treatment/therapy could be performed for each child to improve their therapeutic outcome.

Machine Learning-Based Cardiovascular Disease Prediction Model: A Cohort Study on the Korean National Health Insurance Service Health Screening Database

Diagnostics ◽

10.3390/diagnostics11060943 ◽

2021 ◽

Vol 11 (6) ◽

pp. 943

Author(s):

Joung Ouk (Ryan) Kim ◽

Yong-Suk Jeong ◽

Jin Ho Kim ◽

Jong-Weon Lee ◽

Dougho Park ◽

...

Keyword(s):

Machine Learning ◽

Health Insurance ◽

Prediction Model ◽

National Health Insurance ◽

National Health ◽

Prediction Models ◽

Characteristic Curve ◽

Health Screening ◽

Gradient Boosting ◽

Background: This study proposes a cardiovascular diseases (CVD) prediction model using machine learning (ML) algorithms based on the National Health Insurance Service-Health Screening datasets. Methods: We extracted 4699 patients aged over 45 as the CVD group, diagnosed according to the international classification of diseases system (I20–I25). In addition, 4699 random subjects without CVD diagnosis were enrolled as a non-CVD group. Both groups were matched by age and gender. Various ML algorithms were applied to perform CVD prediction; then, the performances of all the prediction models were compared. Results: The extreme gradient boosting, gradient boosting, and random forest algorithms exhibited the best average prediction accuracy (area under receiver operating characteristic curve (AUROC): 0.812, 0.812, and 0.811, respectively) among all algorithms validated in this study. Based on AUROC, the ML algorithms improved the CVD prediction performance, compared to previously proposed prediction models. Preexisting CVD history was the most important factor contributing to the accuracy of the prediction model, followed by total cholesterol, low-density lipoprotein cholesterol, waist-height ratio, and body mass index. Conclusions: Our results indicate that the proposed health screening dataset-based CVD prediction model using ML algorithms is readily applicable, produces validated results and outperforms the previous CVD prediction models.

A Machine Learning Based Model for Energy Usage Peak Prediction in Smart Farms

Electronics ◽

10.3390/electronics11020218 ◽

2022 ◽

Vol 11 (2) ◽

pp. 218

Author(s):

SaravanaKumar Venkatesan ◽

Jonghyun Lim ◽

Hoon Ko ◽

Yongyun Cho

Keyword(s):

Machine Learning ◽

Big Data ◽

Prediction Model ◽

Energy Use ◽

Energy Utilization ◽

Gradient Boosting ◽

Support Vector ◽

Peak Time ◽

Energy Usage ◽

Context: Energy utilization is one of the most closely related factors affecting many areas of the smart farm, plant growth, crop production, device automation, and energy supply to the same degree. Recently, 4th industrial revolution technologies such as IoT, artificial intelligence, and big data have been widely used in smart farm environments to efficiently use energy and control smart farms’ conditions. In particular, machine learning technologies with big data analysis are actively used as one of the most potent prediction methods supporting energy use in the smart farm. Purpose: This study proposes a machine learning-based prediction model for peak energy use by analyzing energy-related data collected from various environmental and growth devices in a smart paprika farm of the Jeonnam Agricultural Research and Extension Service in South Korea between 2019 and 2021. Scientific method: To find out the most optimized prediction model, comparative evaluation tests are performed using representative ML algorithms such as artificial neural network, support vector regression, random forest, K-nearest neighbors, extreme gradient boosting and gradient boosting machine, and time series algorithm ARIMA with binary classification for a different number of input features. Validate: This article can provide an effective and viable way for smart farm managers or greenhouse farmers who can better manage the problem of agricultural energy economically and environmentally. Therefore, we hope that the recommended ML method will help improve the smart farm’s energy use or their energy policies in various fields related to agricultural energy. Conclusion: The seven performance metrics including R-squared, root mean squared error, and mean absolute error, are associated with these two algorithms. It is concluded that the RF-based model is more successful than in the pre-others diction accuracy of 92%. Therefore, the proposed model may be contributed to the development of various applications for environment energy usage in a smart farm, such as a notification service for energy usage peak time or an energy usage control for each device.

Predicting Undesired Treatment Outcome in Mental Healthcare: Machine Learning Study (Preprint)

10.2196/preprints.17235 ◽

2019 ◽

Author(s):

Kasper Van Mens ◽

Joran Lokkerbol ◽

Richard Janssen ◽

Robert de Lange ◽

Bea Tiemens

Keyword(s):

Machine Learning ◽

Treatment Outcome ◽

Mental Health Treatment ◽

Mental Healthcare ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Trade Off ◽

Trade Offs ◽

Outcome Monitoring ◽

BACKGROUND It remains a challenge to predict which treatment will work for which patient in mental healthcare. OBJECTIVE In this study we compare machine algorithms to predict during treatment which patients will not benefit from brief mental health treatment and present trade-offs that must be considered before an algorithm can be used in clinical practice. METHODS Using an anonymized dataset containing routine outcome monitoring data from a mental healthcare organization in the Netherlands (n = 2,655), we applied three machine learning algorithms to predict treatment outcome. The algorithms were internally validated with cross-validation on a training sample (n = 1,860) and externally validated on an unseen test sample (n = 795). RESULTS The performance of the three algorithms did not significantly differ on the test set. With a default classification cut-off at 0.5 predicted probability, the extreme gradient boosting algorithm showed the highest positive predictive value (ppv) of 0.71(0.61 – 0.77) with a sensitivity of 0.35 (0.29 – 0.41) and area under the curve of 0.78. A trade-off can be made between ppv and sensitivity by choosing different cut-off probabilities. With a cut-off at 0.63, the ppv increased to 0.87 and the sensitivity dropped to 0.17. With a cut-off of at 0.38, the ppv decreased to 0.61 and the sensitivity increased to 0.57. CONCLUSIONS Machine learning can be used to predict treatment outcomes based on routine monitoring data.This allows practitioners to choose their own trade-off between being selective and more certain versus inclusive and less certain.

Evaluation of Three Different Machine Learning Methods for Object-Based Artificial Terrace Mapping—A Case Study of the Loess Plateau, China

Remote Sensing ◽

10.3390/rs13051021 ◽

2021 ◽

Vol 13 (5) ◽

pp. 1021

Author(s):

Hu Ding ◽

Jiaming Na ◽

Shangjing Jiang ◽

Jie Zhu ◽

Kai Liu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Loess Plateau ◽

Water Conservation ◽

Nearest Neighbor ◽

Gradient Boosting ◽

K Nearest Neighbor ◽

The Loess Plateau ◽

Object Based ◽

Artificial terraces are of great importance for agricultural production and soil and water conservation. Automatic high-accuracy mapping of artificial terraces is the basis of monitoring and related studies. Previous research achieved artificial terrace mapping based on high-resolution digital elevation models (DEMs) or imagery. As a result of the importance of the contextual information for terrace mapping, object-based image analysis (OBIA) combined with machine learning (ML) technologies are widely used. However, the selection of an appropriate classifier is of great importance for the terrace mapping task. In this study, the performance of an integrated framework using OBIA and ML for terrace mapping was tested. A catchment, Zhifanggou, in the Loess Plateau, China, was used as the study area. First, optimized image segmentation was conducted. Then, features from the DEMs and imagery were extracted, and the correlations between the features were analyzed and ranked for classification. Finally, three different commonly-used ML classifiers, namely, extreme gradient boosting (XGBoost), random forest (RF), and k-nearest neighbor (KNN), were used for terrace mapping. The comparison with the ground truth, as delineated by field survey, indicated that random forest performed best, with a 95.60% overall accuracy (followed by 94.16% and 92.33% for XGBoost and KNN, respectively). The influence of class imbalance and feature selection is discussed. This work provides a credible framework for mapping artificial terraces.

A Machine Learning Method for Predicting Vegetation Indices in China

Remote Sensing ◽

10.3390/rs13061147 ◽

2021 ◽

Vol 13 (6) ◽

pp. 1147

Author(s):

Xiangqian Li ◽

Wenping Yuan ◽

Wenjie Dong

Keyword(s):

Machine Learning ◽

Growing Season ◽

Crop Growth ◽

Spatiotemporal Distribution ◽

Coefficient Of Determination ◽

Gradient Boosting ◽

Severe Drought ◽

Vegetation Growth ◽

Boosting Method

To forecast the terrestrial carbon cycle and monitor food security, vegetation growth must be accurately predicted; however, current process-based ecosystem and crop-growth models are limited in their effectiveness. This study developed a machine learning model using the extreme gradient boosting method to predict vegetation growth throughout the growing season in China from 2001 to 2018. The model used satellite-derived vegetation data for the first month of each growing season, CO2 concentration, and several meteorological factors as data sources for the explanatory variables. Results showed that the model could reproduce the spatiotemporal distribution of vegetation growth as represented by the satellite-derived normalized difference vegetation index (NDVI). The predictive error for the growing season NDVI was less than 5% for more than 98% of vegetated areas in China; the model represented seasonal variations in NDVI well. The coefficient of determination (R2) between the monthly observed and predicted NDVI was 0.83, and more than 69% of vegetated areas had an R2 > 0.8. The effectiveness of the model was examined for a severe drought year (2009), and results showed that the model could reproduce the spatiotemporal distribution of NDVI even under extreme conditions. This model provides an alternative method for predicting vegetation growth and has great potential for monitoring vegetation dynamics and crop growth.