scholarly journals A comparison of prediction approaches for identifying prodromal Parkinson disease

PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0256592
Author(s):  
Mark N. Warden ◽  
Susan Searles Nielsen ◽  
Alejandra Camacho-Soto ◽  
Roman Garnett ◽  
Brad A. Racette

Identifying people with Parkinson disease during the prodromal period, including via algorithms in administrative claims data, is an important research and clinical priority. We sought to improve upon an existing penalized logistic regression model, based on diagnosis and procedure codes, by adding prescription medication data or using machine learning. Using Medicare Part D beneficiaries age 66–90 from a population-based case-control study of incident Parkinson disease, we fit a penalized logistic regression both with and without Part D data. We also built a predictive algorithm using a random forest classifier for comparison. In a combined approach, we introduced the probability of Parkinson disease from the random forest, as a predictor in the penalized regression model. We calculated the receiver operator characteristic area under the curve (AUC) for each model. All models performed well, with AUCs ranging from 0.824 (simplest model) to 0.835 (combined approach). We conclude that medication data and random forests improve Parkinson disease prediction, but are not essential.

Neurology ◽  
2017 ◽  
Vol 89 (14) ◽  
pp. 1448-1456 ◽  
Author(s):  
Susan Searles Nielsen ◽  
Mark N. Warden ◽  
Alejandra Camacho-Soto ◽  
Allison W. Willis ◽  
Brenton A. Wright ◽  
...  

Objective:To use administrative medical claims data to identify patients with incident Parkinson disease (PD) prior to diagnosis.Methods:Using a population-based case-control study of incident PD in 2009 among Medicare beneficiaries aged 66–90 years (89,790 cases, 118,095 controls) and the elastic net algorithm, we developed a cross-validated model for predicting PD using only demographic data and 2004–2009 Medicare claims data. We then compared this model to more basic models containing only demographic data and diagnosis codes for constipation, taste/smell disturbance, and REM sleep behavior disorder, using each model's receiver operator characteristic area under the curve (AUC).Results:We observed all established associations between PD and age, sex, race/ethnicity, tobacco smoking, and the above medical conditions. A model with those predictors had an AUC of only 0.670 (95% confidence interval [CI] 0.668–0.673). In contrast, the AUC for a predictive model with 536 diagnosis and procedure codes was 0.857 (95% CI 0.855–0.859). At the optimal cut point, sensitivity was 73.5% and specificity was 83.2%.Conclusions:Using only demographic data and selected diagnosis and procedure codes readily available in administrative claims data, it is possible to identify individuals with a high probability of eventually being diagnosed with PD.


2021 ◽  
Vol 9 ◽  
Author(s):  
Qiao-Ying Xie ◽  
Ming-Wei Wang ◽  
Zu-Ying Hu ◽  
Cheng-Jian Cao ◽  
Cong Wang ◽  
...  

Aim: Metabolic syndrome (MS) screening is essential for the early detection of the occupational population. This study aimed to screen out biomarkers related to MS and establish a risk assessment and prediction model for the routine physical examination of an occupational population.Methods: The least absolute shrinkage and selection operator (Lasso) regression algorithm of machine learning was used to screen biomarkers related to MS. Then, the accuracy of the logistic regression model was further verified based on the Lasso regression algorithm. The areas under the receiving operating characteristic curves were used to evaluate the selection accuracy of biomarkers in identifying MS subjects with risk. The screened biomarkers were used to establish a logistic regression model and calculate the odds ratio (OR) of the corresponding biomarkers. A nomogram risk prediction model was established based on the selected biomarkers, and the consistency index (C-index) and calibration curve were derived.Results: A total of 2,844 occupational workers were included, and 10 biomarkers related to MS were screened. The number of non-MS cases was 2,189 and that of MS was 655. The area under the curve (AUC) value for non-Lasso and Lasso logistic regression was 0.652 and 0.907, respectively. The established risk assessment model revealed that the main risk biomarkers were absolute basophil count (OR: 3.38, CI:1.05–6.85), platelet packed volume (OR: 2.63, CI:2.31–3.79), leukocyte count (OR: 2.01, CI:1.79–2.19), red blood cell count (OR: 1.99, CI:1.80–2.71), and alanine aminotransferase level (OR: 1.53, CI:1.12–1.98). Furthermore, favorable results with C-indexes (0.840) and calibration curves closer to ideal curves indicated the accurate predictive ability of this nomogram.Conclusions: The risk assessment model based on the Lasso logistic regression algorithm helped identify MS with high accuracy in physically examining an occupational population.


2021 ◽  
Vol 8 ◽  
Author(s):  
Robert A. Reed ◽  
Andrei S. Morgan ◽  
Jennifer Zeitlin ◽  
Pierre-Henri Jarreau ◽  
Héloïse Torchin ◽  
...  

Introduction: Preterm babies are a vulnerable population that experience significant short and long-term morbidity. Rehospitalisations constitute an important, potentially modifiable adverse event in this population. Improving the ability of clinicians to identify those patients at the greatest risk of rehospitalisation has the potential to improve outcomes and reduce costs. Machine-learning algorithms can provide potentially advantageous methods of prediction compared to conventional approaches like logistic regression.Objective: To compare two machine-learning methods (least absolute shrinkage and selection operator (LASSO) and random forest) to expert-opinion driven logistic regression modelling for predicting unplanned rehospitalisation within 30 days in a large French cohort of preterm babies.Design, Setting and Participants: This study used data derived exclusively from the population-based prospective cohort study of French preterm babies, EPIPAGE 2. Only those babies discharged home alive and whose parents completed the 1-year survey were eligible for inclusion in our study. All predictive models used a binary outcome, denoting a baby's status for an unplanned rehospitalisation within 30 days of discharge. Predictors included those quantifying clinical, treatment, maternal and socio-demographic factors. The predictive abilities of models constructed using LASSO and random forest algorithms were compared with a traditional logistic regression model. The logistic regression model comprised 10 predictors, selected by expert clinicians, while the LASSO and random forest included 75 predictors. Performance measures were derived using 10-fold cross-validation. Performance was quantified using area under the receiver operator characteristic curve, sensitivity, specificity, Tjur's coefficient of determination and calibration measures.Results: The rate of 30-day unplanned rehospitalisation in the eligible population used to construct the models was 9.1% (95% CI 8.2–10.1) (350/3,841). The random forest model demonstrated both an improved AUROC (0.65; 95% CI 0.59–0.7; p = 0.03) and specificity vs. logistic regression (AUROC 0.57; 95% CI 0.51–0.62, p = 0.04). The LASSO performed similarly (AUROC 0.59; 95% CI 0.53–0.65; p = 0.68) to logistic regression.Conclusions: Compared to an expert-specified logistic regression model, random forest offered improved prediction of 30-day unplanned rehospitalisation in preterm babies. However, all models offered relatively low levels of predictive ability, regardless of modelling method.


Neurology ◽  
2021 ◽  
pp. 10.1212/WNL.0000000000012863
Author(s):  
Basile Kerleroux ◽  
Joseph Benzakoun ◽  
Kévin Janot ◽  
Cyril Dargazanli ◽  
Dimitri Daly Eraya ◽  
...  

ObjectiveIndividualized patient selection for mechanical thrombectomy (MT) in patients with acute ischemic stroke (AIS) and large ischemic core (LIC) at baseline is an unmet need.We tested the hypothesis, that assessing the functional relevance of both the infarcted and hypo-perfused brain tissue, would improve the selection framework of patients with LIC for MT.MethodsMulticenter, retrospective, study of adult with LIC (ischemic core volume > 70ml on MR-DWI), with MRI perfusion, treated with MT or best medical management (BMM).Primary outcome was 3-month modified-Rankin-Scale (mRS), favourable if 0-3. Global and regional-eloquence-based core-perfusion mismatch ratios were derived. The predictive accuracy for clinical outcome of eloquent regions involvement was compared in multivariable and bootstrap-random-forest models.ResultsA total of 138 patients with baseline LIC were included (MT n=96 or BMM n=42; mean age±SD, 72.4±14.4years; 34.1% females; mRS=0-3: 45.1%). Mean core and critically-hypo-perfused volume were 100.4ml±36.3ml and 157.6±56.2ml respectively and did not differ between groups. Models considering the functional relevance of the infarct location showed a better accuracy for the prediction of mRS=0-3 with a c-Statistic of 0.76 and 0.83 for logistic regression model and bootstrap-random-forest testing sets respectively. In these models, the interaction between treatment effect of MT and the mismatch was significant (p=0.04). In comparison in the logistic regression model disregarding functional eloquence the c-Statistic was 0.67 and the interaction between MT and the mismatch was insignificant.ConclusionConsidering functional eloquence of hypo-perfused tissue in patients with a large infarct core at baseline allows for a more precise estimation of treatment expected benefit.


2019 ◽  
Vol 18 ◽  
pp. 153303381984663 ◽  
Author(s):  
Chang-Liang Luo ◽  
Yuan Rong ◽  
Hao Chen ◽  
Wu-Wen Zhang ◽  
Long Wu ◽  
...  

α-Fetoprotein is commonly used in the diagnosis of hepatocellular carcinoma. However, the diagnostic significance of α-fetoprotein has been questioned because a number of patients with hepatocellular carcinoma are α-fetoprotein negative. It is therefore necessary to develop novel noninvasive techniques for the early diagnosis of hepatocellular carcinoma, particularly when α-fetoprotein level is low or negative. The current study aimed to evaluate the diagnostic efficiency of hematological parameters to determine which can act as surrogate markers in α-fetoprotein–negative hepatocellular carcinoma. Therefore, a retrospective study was conducted on a training set recruited from Zhongnan Hospital of Wuhan University—including 171 α-fetoprotein–negative patients with hepatocellular carcinoma and 102 healthy individuals. The results show that mean values of mean platelet volume, red blood cell distribution width, mean platelet volume–PC ratio, neutrophils–lymphocytes ratio, and platelet count–lymphocytes ratio were significantly higher in patients with hepatocellular carcinoma in comparison to the healthy individuals. Most of these parameters showed moderate area under the curve in α-fetoprotein–negative patients with hepatocellular carcinoma, but their sensitivities or specificities were not satisfactory enough. So, we built a logistic regression model combining multiple hematological parameters. This model presented better diagnostic efficiency with area under the curve of 0.922, sensitivity of 83.0%, and specificity of 93.1%. In addition, the 4 validation sets from different hospitals were used to validate the model. They all showed good area under the curve with satisfactory sensitivities or specificities. These data indicate that the logistic regression model combining multiple hematological parameters has better diagnostic efficiency, and they might be helpful for the early diagnosis for α-fetoprotein–negative hepatocellular carcinoma.


2020 ◽  
Author(s):  
Victoria Garcia-Montemayor ◽  
Alejandro Martin-Malo ◽  
Carlo Barbieri ◽  
Francesco Bellocchio ◽  
Sagrario Soriano ◽  
...  

Abstract Background Besides the classic logistic regression analysis, non-parametric methods based on machine learning techniques such as random forest are presently used to generate predictive models. The aim of this study was to evaluate random forest mortality prediction models in haemodialysis patients. Methods Data were acquired from incident haemodialysis patients between 1995 and 2015. Prediction of mortality at 6 months, 1 year and 2 years of haemodialysis was calculated using random forest and the accuracy was compared with logistic regression. Baseline data were constructed with the information obtained during the initial period of regular haemodialysis. Aiming to increase accuracy concerning baseline information of each patient, the period of time used to collect data was set at 30, 60 and 90 days after the first haemodialysis session. Results There were 1571 incident haemodialysis patients included. The mean age was 62.3 years and the average Charlson comorbidity index was 5.99. The mortality prediction models obtained by random forest appear to be adequate in terms of accuracy [area under the curve (AUC) 0.68–0.73] and superior to logistic regression models (ΔAUC 0.007–0.046). Results indicate that both random forest and logistic regression develop mortality prediction models using different variables. Conclusions Random forest is an adequate method, and superior to logistic regression, to generate mortality prediction models in haemodialysis patients.


2020 ◽  
Author(s):  
Mayssa Traboulsi ◽  
Zainab El Alaoui Talibi ◽  
Abdellatif Boussaid

Abstract Background: Preterm Birth (PTB) can negatively affect the health of mothers as well as infants. Prediction of this gynecological complication remains difficult especially in Middle and Low-Income countries because of limited access to specific tests and data collection scarcity. Multiparous women in our study presented a higher PTB prevalence compared to nulliparous women. Methods: In a cohort study from Northern Lebanon of 1996 women, 922 were multiparous presenting a PTB prevalence of 8%. We analyzed the personal, demographic, and health indicators available for this group of women. We compared 4 modified logistic regression models (up-sampling, lasso penalized regression) to develop a nomogram that can screen for preterm in multi-parous women. The models were trained and validated on different data sets.Results: The best PTB prediction of the Logistic regression model reached around 88%. This was obtained using a Logistic Regression Model trained on up-sampled datasets and LASSO (Least Absolute Shrinkage and Selection Operator) penalized. The regression coefficients of the 6 selected variables (Pre-hemorrhage, Social status, Residence, Age, BMI, and Weight gain) were used to create a nomogram to screen multiparous women for PTB risk. Conclusions: The nomogram based on readily available indicators for multiparous women reasonably predicted most of the at PTB risk women. This tool will allow physicians to screen women that represent a high risk for spontaneous preterm birth and run furthermore adequate additional tests leading to better medical surveillance that can reduce PTB incidence.


2020 ◽  
Vol 71 (1) ◽  
pp. 299-305
Author(s):  
Fernando González-Mohíno ◽  
Jesús Santos del Cerro ◽  
Andrew Renfree ◽  
Inmaculada Yustres ◽  
José Mª González-Ravé

AbstractThe purpose of this analysis was to quantify the probability of achieving a top-3 finishing position during 800-m races at a global championship, based on dispersion of the runners during the first and second laps and the difference in split times between laps. Overall race times, intermediate and finishing positions and 400 m split times were obtained for 43 races over 800 m (21 men’s and 22 women’s) comprising 334 individual performances, 128 of which resulted in higher positions (top-3) and 206 the remaining positions. Intermediate and final positions along with times, the dispersion of the runners during the intermediate and final splits (SS1 and SS2), as well as differences between the two split times (Dsplits) were calculated. A logistic regression model was created to determine the influence of these factors in achieving a top-3 position. The final position was most strongly associated with SS2, but also with SS1 and Dsplits. The Global Significance Test showed that the model was significant (p < 0.001) with a predictive ability of 91.08% and an area under the curve coefficient of 0.9598. The values of sensitivity and specificity were 96.8% and 82.5%, respectively. The model demonstrated that SS1, SS2 and Dplits explained the finishing position in the 800-m event in global championships.


2021 ◽  
Vol 9 ◽  
Author(s):  
Brett Snider ◽  
Edward A. McBean ◽  
John Yawney ◽  
S. Andrew Gadsden ◽  
Bhumi Patel

The Severe Acute Respiratory Syndrome Coronavirus 2 pandemic has challenged medical systems to the brink of collapse around the globe. In this paper, logistic regression and three other artificial intelligence models (XGBoost, Artificial Neural Network and Random Forest) are described and used to predict mortality risk of individual patients. The database is based on census data for the designated area and co-morbidities obtained using data from the Ontario Health Data Platform. The dataset consisted of more than 280,000 COVID-19 cases in Ontario for a wide-range of age groups; 0–9, 10–19, 20–29, 30–39, 40–49, 50–59, 60–69, 70–79, 80–89, and 90+. Findings resulting from using logistic regression, XGBoost, Artificial Neural Network and Random Forest, all demonstrate excellent discrimination (area under the curve for all models exceeded 0.948 with the best performance being 0.956 for an XGBoost model). Based on SHapley Additive exPlanations values, the importance of 24 variables are identified, and the findings indicated the highest importance variables are, in order of importance, age, date of test, sex, and presence/absence of chronic dementia. The findings from this study allow the identification of out-patients who are likely to deteriorate into severe cases, allowing medical professionals to make decisions on timely treatments. Furthermore, the methodology and results may be extended to other public health regions.


2018 ◽  
Author(s):  
Florian Privé ◽  
Hugues Aschard ◽  
Michael G.B. Blum

AbstractPolygenic Risk Scores (PRS) consist in combining the information across many single-nucleotide polymorphisms (SNPs) in a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The “Clumping+Thresholding” (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T.In this paper, we present an efficient method to jointly estimate SNP effects, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. The choice of hyper-parameters for a predictive model is very important since it can dramatically impact its predictive performance. As an example, AUC values range from less than 60% to 90% in a model with 30 causal SNPs, depending on the p-value threshold in C+T.We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. PLR consistently achieves higher predictive performance than the two other methods while being as fast as C+T. We find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC of 89% and of 82.5%.In conclusion, our study demonstrates that penalized logistic regression can achieve more discriminative polygenic risk scores, while being applicable to large-scale individual-level data thanks to the implementation we provide in the R package bigstatsr.


Sign in / Sign up

Export Citation Format

Share Document