scholarly journals Machine Learning Accurately Predicts Next Season NHL Player Injury Before It Occurs: Validation of 10,449 Player-Years from 2007-17

2020 ◽  
Vol 8 (7_suppl6) ◽  
pp. 2325967120S0036
Author(s):  
Audrey Wright ◽  
Jaret Karnuta ◽  
Bryan Luu ◽  
Heather Haeberle ◽  
Eric Makhni ◽  
...  

Objectives: With the accumulation of big data surrounding National Hockey League (NHL) and the advent of advanced computational processors, machine learning (ML) is ideally suited to develop a predictive algorithm capable of imbibing historical data to accurately project a future player’s availability to play based on prior injury and performance. To the end of leveraging available analytics to permit data-driven injury prevention strategies and informed decisions for NHL franchises beyond static logistic regression (LR) analysis, the objective of this study of NHL players was to (1) characterize the epidemiology of publicly reported NHL injuries from 2007-17, (2) determine the validity of a machine learning model in predicting next season injury risk for both goalies and non-goalies, and (3) compare the performance of modern ML algorithms versus LR analyses. Methods: Hockey player data was compiled for the years 2007 to 2017 from two publicly reported databases in the absence of an official NHL-approved database. Attributes acquired from each NHL player from each professional year included: age, 85 player metrics, and injury history. A total of 5 ML algorithms were created for both non-goalie and goalie data; Random Forest, K-Nearest Neighbors, Naive Bayes, XGBoost, and Top 3 Ensemble. Logistic regression was also performed for both non-goalie and goalie data. Area under the receiver operating characteristics curve (AUC) primarily determined validation. Results: Player data was generated from 2,109 non-goalies and 213 goalies with an average follow-up of 4.5 years. The results are shown below in Table 1.For models predicting following season injury risk for non-goalies, XGBoost performed the best with an AUC of 0.948, compared to an AUC of 0.937 for logistic regression. For models predicting following season injury risk for goalies, XGBoost had the highest AUC with 0.956, compared to an AUC of 0.947 for LR. Conclusion: Advanced ML models such as XGBoost outperformed LR and demonstrated good to excellent capability of predicting whether a publicly reportable injury is likely to occur the next season. As more player-specific data become available, algorithm refinement may be possible to strengthen predictive insights and allow ML to offer quantitative risk management for franchises, present opportunity for targeted preventative intervention by medical personnel, and replace regression analysis as the new gold standard for predictive modeling. [Figure: see text]

2020 ◽  
Vol 8 (9) ◽  
pp. 232596712095340
Author(s):  
Bryan C. Luu ◽  
Audrey L. Wright ◽  
Heather S. Haeberle ◽  
Jaret M. Karnuta ◽  
Mark S. Schickendantz ◽  
...  

Background: The opportunity to quantitatively predict next-season injury risk in the National Hockey League (NHL) has become a reality with the advent of advanced computational processors and machine learning (ML) architecture. Unlike static regression analyses that provide a momentary prediction, ML algorithms are dynamic in that they are readily capable of imbibing historical data to build a framework that improves with additive data. Purpose: To (1) characterize the epidemiology of publicly reported NHL injuries from 2007 to 2017, (2) determine the validity of a machine learning model in predicting next-season injury risk for both goalies and position players, and (3) compare the performance of modern ML algorithms versus logistic regression (LR) analyses. Study Design: Descriptive epidemiology study. Methods: Professional NHL player data were compiled for the years 2007 to 2017 from 2 publicly reported databases in the absence of an official NHL-approved database. Attributes acquired from each NHL player from each professional year included age, 85 performance metrics, and injury history. A total of 5 ML algorithms were created for both position player and goalie data: random forest, K Nearest Neighbors, Naïve Bayes, XGBoost, and Top 3 Ensemble. LR was also performed for both position player and goalie data. Area under the receiver operating characteristic curve (AUC) primarily determined validation. Results: Player data were generated from 2109 position players and 213 goalies. For models predicting next-season injury risk for position players, XGBoost performed the best with an AUC of 0.948, compared with an AUC of 0.937 for LR ( P < .0001). For models predicting next-season injury risk for goalies, XGBoost had the highest AUC with 0.956, compared with an AUC of 0.947 for LR ( P < .0001). Conclusion: Advanced ML models such as XGBoost outperformed LR and demonstrated good to excellent capability of predicting whether a publicly reportable injury is likely to occur the next season.


2020 ◽  
Vol 8 (11) ◽  
pp. 232596712096304
Author(s):  
Jaret M. Karnuta ◽  
Bryan C. Luu ◽  
Heather S. Haeberle ◽  
Paul M. Saluan ◽  
Salvatore J. Frangiamore ◽  
...  

Background: Machine learning (ML) allows for the development of a predictive algorithm capable of imbibing historical data on a Major League Baseball (MLB) player to accurately project the player's future availability. Purpose: To determine the validity of an ML model in predicting the next-season injury risk and anatomic injury location for both position players and pitchers in the MLB. Study Design: Descriptive epidemiology study. Methods: Using 4 online baseball databases, we compiled MLB player data, including age, performance metrics, and injury history. A total of 84 ML algorithms were developed. The output of each algorithm reported whether the player would sustain an injury the following season as well as the injury’s anatomic site. The area under the receiver operating characteristic curve (AUC) primarily determined validation. Results: Player data were generated from 1931 position players and 1245 pitchers, with a mean follow-up of 4.40 years (13,982 player-years) between the years of 2000 and 2017. Injured players spent a total of 108,656 days on the disabled list, with a mean of 34.21 total days per player. The mean AUC for predicting next-season injuries was 0.76 among position players and 0.65 among pitchers using the top 3 ensemble classification. Back injuries had the highest AUC among both position players and pitchers, at 0.73. Advanced ML models outperformed logistic regression in 13 of 14 cases. Conclusion: Advanced ML models generally outperformed logistic regression and demonstrated fair capability in predicting publicly reportable next-season injuries, including the anatomic region for position players, although not for pitchers.


Author(s):  
Dhilsath Fathima.M ◽  
S. Justin Samuel ◽  
R. Hari Haran

Aim: This proposed work is used to develop an improved and robust machine learning model for predicting Myocardial Infarction (MI) could have substantial clinical impact. Objectives: This paper explains how to build machine learning based computer-aided analysis system for an early and accurate prediction of Myocardial Infarction (MI) which utilizes framingham heart study dataset for validation and evaluation. This proposed computer-aided analysis model will support medical professionals to predict myocardial infarction proficiently. Methods: The proposed model utilize the mean imputation to remove the missing values from the data set, then applied principal component analysis to extract the optimal features from the data set to enhance the performance of the classifiers. After PCA, the reduced features are partitioned into training dataset and testing dataset where 70% of the training dataset are given as an input to the four well-liked classifiers as support vector machine, k-nearest neighbor, logistic regression and decision tree to train the classifiers and 30% of test dataset is used to evaluate an output of machine learning model using performance metrics as confusion matrix, classifier accuracy, precision, sensitivity, F1-score, AUC-ROC curve. Results: Output of the classifiers are evaluated using performance measures and we observed that logistic regression provides high accuracy than K-NN, SVM, decision tree classifiers and PCA performs sound as a good feature extraction method to enhance the performance of proposed model. From these analyses, we conclude that logistic regression having good mean accuracy level and standard deviation accuracy compared with the other three algorithms. AUC-ROC curve of the proposed classifiers is analyzed from the output figure.4, figure.5 that logistic regression exhibits good AUC-ROC score, i.e. around 70% compared to k-NN and decision tree algorithm. Conclusion: From the result analysis, we infer that this proposed machine learning model will act as an optimal decision making system to predict the acute myocardial infarction at an early stage than an existing machine learning based prediction models and it is capable to predict the presence of an acute myocardial Infarction with human using the heart disease risk factors, in order to decide when to start lifestyle modification and medical treatment to prevent the heart disease.


2020 ◽  
Vol 20 (1) ◽  
Author(s):  
Matthijs Blankers ◽  
Louk F. M. van der Post ◽  
Jack J. M. Dekker

Abstract Background Accurate prediction models for whether patients on the verge of a psychiatric criseis need hospitalization are lacking and machine learning methods may help improve the accuracy of psychiatric hospitalization prediction models. In this paper we evaluate the accuracy of ten machine learning algorithms, including the generalized linear model (GLM/logistic regression) to predict psychiatric hospitalization in the first 12 months after a psychiatric crisis care contact. We also evaluate an ensemble model to optimize the accuracy and we explore individual predictors of hospitalization. Methods Data from 2084 patients included in the longitudinal Amsterdam Study of Acute Psychiatry with at least one reported psychiatric crisis care contact were included. Target variable for the prediction models was whether the patient was hospitalized in the 12 months following inclusion. The predictive power of 39 variables related to patients’ socio-demographics, clinical characteristics and previous mental health care contacts was evaluated. The accuracy and area under the receiver operating characteristic curve (AUC) of the machine learning algorithms were compared and we also estimated the relative importance of each predictor variable. The best and least performing algorithms were compared with GLM/logistic regression using net reclassification improvement analysis and the five best performing algorithms were combined in an ensemble model using stacking. Results All models performed above chance level. We found Gradient Boosting to be the best performing algorithm (AUC = 0.774) and K-Nearest Neighbors to be the least performing (AUC = 0.702). The performance of GLM/logistic regression (AUC = 0.76) was slightly above average among the tested algorithms. In a Net Reclassification Improvement analysis Gradient Boosting outperformed GLM/logistic regression by 2.9% and K-Nearest Neighbors by 11.3%. GLM/logistic regression outperformed K-Nearest Neighbors by 8.7%. Nine of the top-10 most important predictor variables were related to previous mental health care use. Conclusions Gradient Boosting led to the highest predictive accuracy and AUC while GLM/logistic regression performed average among the tested algorithms. Although statistically significant, the magnitude of the differences between the machine learning algorithms was in most cases modest. The results show that a predictive accuracy similar to the best performing model can be achieved when combining multiple algorithms in an ensemble model.


2019 ◽  
Author(s):  
Matthijs Blankers ◽  
Louk F. M. van der Post ◽  
Jack J. M. Dekker

Abstract Background: It is difficult to accurately predict whether a patient on the verge of a potential psychiatric crisis will need to be hospitalized. Machine learning may be helpful to improve the accuracy of psychiatric hospitalization prediction models. In this paper we evaluate and compare the accuracy of ten machine learning algorithms including the commonly used generalized linear model (GLM/logistic regression) to predict psychiatric hospitalization in the first 12 months after a psychiatric crisis care contact, and explore the most important predictor variables of hospitalization. Methods: Data from 2,084 patients with at least one reported psychiatric crisis care contact included in the longitudinal Amsterdam Study of Acute Psychiatry were used. The accuracy and area under the receiver operating characteristic curve (AUC) of the machine learning algorithms were compared. We also estimated the relative importance of each predictor variable. The best and least performing algorithms were compared with GLM/logistic regression using net reclassification improvement analysis. Target variable for the prediction models was whether or not the patient was hospitalized in the 12 months following inclusion in the study. The 39 predictor variables were related to patients’ socio-demographics, clinical characteristics and previous mental health care contacts. Results: We found Gradient Boosting to perform the best (AUC=0.774) and K-Nearest Neighbors performing the least (AUC=0.702). The performance of GLM/logistic regression (AUC=0.76) was above average among the tested algorithms. Gradient Boosting outperformed GLM/logistic regression and K-Nearest Neighbors, and GLM outperformed K-Nearest Neighbors in a Net Reclassification Improvement analysis, although the differences between Gradient Boosting and GLM/logistic regression were small. Nine of the top-10 most important predictor variables were related to previous mental health care use. Conclusions: Gradient Boosting led to the highest predictive accuracy and AUC while GLM/logistic regression performed average among the tested algorithms. Although statistically significant, the magnitude of the differences between the machine learning algorithms was modest. Future studies may consider to combine multiple algorithms in an ensemble model for optimal performance and to mitigate the risk of choosing suboptimal performing algorithms.


2017 ◽  
Author(s):  
Aymen A. Elfiky ◽  
Maximilian J. Pany ◽  
Ravi B. Parikh ◽  
Ziad Obermeyer

ABSTRACTBackgroundCancer patients who die soon after starting chemotherapy incur costs of treatment without benefits. Accurately predicting mortality risk from chemotherapy is important, but few patient data-driven tools exist. We sought to create and validate a machine learning model predicting mortality for patients starting new chemotherapy.MethodsWe obtained electronic health records for patients treated at a large cancer center (26,946 patients; 51,774 new regimens) over 2004-14, linked to Social Security data for date of death. The model was derived using 2004-11 data, and performance measured on non-overlapping 2012-14 data.Findings30-day mortality from chemotherapy start was 2.1%. Common cancers included breast (21.1%), colorectal (19.3%), and lung (18.0%). Model predictions were accurate for all patients (AUC 0.94). Predictions for patients starting palliative chemotherapy (46.6% of regimens), for whom prognosis is particularly important, remained highly accurate (AUC 0.92). To illustrate model discrimination, we ranked patients initiating palliative chemotherapy by model-predicted mortality risk, and calculated observed mortality by risk decile. 30-day mortality in the highest-risk decile was 22.6%; in the lowest-risk decile, no patients died. Predictions remained accurate across all primary cancers, stages, and chemotherapies—even for clinical trial regimens that first appeared in years after the model was trained (AUC 0.94). The model also performed well for prediction of 180-day mortality (AUC 0.87; mortality 74.8% in the highest risk decile vs. 0.2% in the lowest). Predictions were more accurate than data from randomized trials of individual chemotherapies, or SEER estimates.InterpretationA machine learning algorithm accurately predicted short-term mortality in patients starting chemotherapy using EHR data. Further research is necessary to determine generalizability and the feasibility of applying this algorithm in clinical settings.


However, oftentimes people just search a restaurant by using word “restaurant”, while the word “restaurant” means differently to different individuals. For an Asian, it can mean a “Chinese restaurant” or “Thai restaurant”. How to correctly interpret search requests based on people’s preference is a challenge. Building a machine-learning model based on activity history of a registered user can solve this problem. The activity histories used by this research are reviews and ratings from users. This project introduces a data processing pipeline, which uses reviews from registered users to generate a machine-learning model for each registered user. This project also defines an architecture, which uses the generated machine-learning models to support real-time personalized recommendations for restaurant searching and type of foods good at those recommended restaurants. Finally, this project aims to develop a good machine learning model, different collaborative filtering methodologies are considered to predict restaurants using user ratings. Slope One, k-Nearest Neighbors algorithm and multiclass SVM classification are some of the collaborating methodologies are going to consider in this project.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Andrei Bratu ◽  
Gabriela Czibula

Data augmentation is a commonly used technique in data science for improving the robustness and performance of machine learning models. The purpose of the paper is to study the feasibility of generating synthetic data points of temporal nature towards this end. A general approach named DAuGAN (Data Augmentation using Generative Adversarial Networks) is presented for identifying poorly represented sections of a time series, studying the synthesis and integration of new data points, and performance improvement on a benchmark machine learning model. The problem is studied and applied in the domain of algorithmic trading, whose constraints are presented and taken into consideration. The experimental results highlight an improvement in performance on a benchmark reinforcement learning agent trained on a dataset enhanced with DAuGAN to trade a financial instrument.


2020 ◽  
Author(s):  
Carlo M. Bertoncelli ◽  
Paola Altamura ◽  
Domenico Bertoncelli ◽  
Virginie Rampal ◽  
Edgar Ramos Vieira ◽  
...  

AbstractNeuromuscular hip dysplasia (NHD) is a common and severe problem in patients with cerebral palsy (CP). Previous studies have so far identified only spasticity (SP) and high levels of Gross Motor Function Classification System as factors associated with NHD. The aim of this study is to develop a machine learning model to identify additional risk factors of NHD. This was a cross-sectional multicenter descriptive study of 102 teenagers with CP (60 males, 42 females; 60 inpatients, 42 outpatients; mean age 16.5 ± 1.2 years, range 12–18 years). Data on etiology, diagnosis, SP, epilepsy (E), clinical history, and functional assessments were collected between 2007 and 2017. Hip dysplasia was defined as femoral head lateral migration percentage > 33% on pelvic radiogram. A logistic regression-prediction model named PredictMed was developed to identify risk factors of NHD. Twenty-eight (27%) teenagers with CP had NHD, of which 18 (67%) had dislocated hips. Logistic regression model identified poor walking abilities (p < 0.001; odds ratio [OR] infinity; 95% confidence interval [CI] infinity), scoliosis (p = 0.01; OR 3.22; 95% CI 1.30–7.92), trunk muscles' tone disorder (p = 0.002; OR 4.81; 95% CI 1.75–13.25), SP (p = 0.006; OR 6.6; 95% CI 1.46–30.23), poor motor function (p = 0.02; OR 5.5; 95% CI 1.2–25.2), and E (p = 0.03; OR 2.6; standard error 0.44) as risk factors of NHD. The accuracy of the model was 77%. PredictMed identified trunk muscles' tone disorder, severe scoliosis, E, and SP as risk factors of NHD in teenagers with CP.


Sign in / Sign up

Export Citation Format

Share Document