scholarly journals Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Iqbal Madakkatel ◽  
Ang Zhou ◽  
Mark D. McDonnell ◽  
Elina Hyppönen

AbstractWe present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. In this study, mortality models were built using gradient boosting decision trees (GBDT) and important predictors were identified using a Shapley values-based feature attribution method, SHAP values. Cox models controlled for false discovery rate were used for confounder adjustment, interpretability, and further validation. The pipeline was tested using information from 502,506 UK Biobank participants, aged 37–73 years at recruitment and followed over seven years for mortality registrations. From the 11,639 predictors included in GBDT, 193 potential risk factors had SHAP values ≥ 0.05, passed the correlation test, and were selected for further modelling. Of the total variable importance summed up, 60% was directly health related, and baseline characteristics, sociodemographics, and lifestyle factors each contributed about 10%. Cox models adjusted for baseline characteristics, showed evidence for an association with mortality for 166 out of the 193 predictors. These included mostly well-known risk factors (e.g., age, sex, ethnicity, education, material deprivation, smoking, physical activity, self-rated health, BMI, and many disease outcomes). For 19 predictors we saw evidence for an association in the unadjusted but not adjusted analyses, suggesting bias by confounding. Our GBDT-SHAP pipeline was able to identify relevant predictors ‘hidden’ within thousands of variables, providing an efficient and pragmatic solution for the first stage of hypothesis free risk factor identification.

2021 ◽  
Author(s):  
Iqbal Madakkatel ◽  
Ang Zhou ◽  
Mark McDonnell ◽  
Elina Hypponen

Background: Machine learning (ML) can harness information from large databases with complex structures. We present a simple and fast hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. Methods: Mortality models were built using gradient boosting decision trees (GBDT) and important predictors were identified using SHAP values. Cox models controlled for false discovery rate were used for interpretability and further validation. The pipeline was tested using information from 502,506 UK Biobank participants, aged 37 to 73 years at recruitment and followed over seven years for mortality registrations. Results: From the 11,639 predictors included in GBDT, 193 potential risk factors had SHAP values 0.05 or greater and were selected for further modelling. Of the total variable importance summed up, 60% was directly health related, and baseline characteristics, sociodemographics, and lifestyle factors each contributed about 10%. Cox models adjusted for baseline characteristics, showed evidence for an association with mortality for 166 out of the 193 predictors. For 19 predictors we saw evidence for an association in the unadjusted but not adjusted analyses, suggesting confounding by basic characteristics. Identified "important" predictors included traditional risk factors such as age, sex, ethnicity, education, material deprivation, smoking, physical activity, self-rated health, BMI, hypertension, cardio-vascular diseases, cancer diagnoses and type 2 diabetes, as confirmed by previous studies. Conclusion: Our approach provides a fast and pragmatic solution for hypothesis free risk factor identification.


2021 ◽  
Author(s):  
Iqbal Madakkatel ◽  
Catherine King ◽  
Ang Zhou ◽  
Anwar Mulugeta ◽  
Amanda Lumsden ◽  
...  

Severe acute respiratory syndrome coronavirus has infected over 114 million people worldwide as of March 2021, with worldwide mortality rates ranging between 1-10%. We use information on up to 421,111 UK Biobank participants to identify possible predictors for long-term susceptibility to severe COVID-19 infection (N =1,088) and mortality (N =376). We include 36,168 predictors in our analyses and use a gradient boosting decision tree (GBDT) algorithm and feature attribution based on Shapley values, together with traditional epidemiological approaches to identify possible risk factors. Our analyses show associations between socio-demographic factors (e.g. age, sex, ethnicity, education, material deprivation, accommodation type) and lifestyle indicators (e.g. smoking, physical activity, walking pace, tea intake, and dietary changes) with risk of developing severe COVID-19 symptoms. Blood (cystatin C, C-reactive protein, gamma glutamyl transferase and alkaline phosphatase) and urine (microalbuminuria) biomarkers measured more than 10 years earlier predicted severe COVID-19. We also confirm increased risks for several pre-existing disease outcomes (e.g. lung diseases, type 2 diabetes, hypertension, circulatory diseases, anemia, and mental disorders). Analyses on mortality were possible within a sub-group testing positive for COVID-19 infection (N =1,953) with our analyses confirming association between age, smoking status, and prior primary diagnosis of urinary tract infection.


2020 ◽  
Author(s):  
Neil Kale

BACKGROUND Despite worldwide efforts to develop an effective COVID vaccine, it is quite evident that initial supplies will be limited. Therefore, it is important to develop methods that will ensure that the COVID vaccine is allocated to the people who are at major risk until there is a sufficient global supply. OBJECTIVE The purpose of this study was to develop a machine-learning tool that could be applied to assess the risk in Massachusetts towns based on community-wide social, medical, and lifestyle risk factors. METHODS I compiled Massachusetts town data for 29 potential risk factors, such as the prevalence of preexisting comorbid conditions like COPD and social factors such as racial composition, and implemented logistic regression to predict the amount of COVID cases in each town. RESULTS Of the 29 factors, 14 were found to be significant (p < 0.1) indicators: poverty, food insecurity, lack of high school education, lack of health insurance coverage, premature mortality, population, population density, recent population growth, Asian percentage, high-occupancy housing, and preexisting prevalence of cancer, COPD, overweightness, and heart attacks. The machine-learning approach is 80% accurate in the state of Massachusetts and finds the 9 highest risk communities: Lynn, Brockton, Revere, Randolph, Lowell, New Bedford, Everett, Waltham, and Fitchburg. The 5 most at-risk counties are Suffolk, Middlesex, Bristol, Norfolk, and Plymouth. CONCLUSIONS With appropriate data, the tool could evaluate risk in other communities, or even enumerate individual patient susceptibility. A ranking of communities by risk may help policymakers ensure equitable allocation of limited doses of the COVID vaccine.


Neonatology ◽  
2021 ◽  
pp. 1-8
Author(s):  
Kasia Trzcionkowska ◽  
Floris Groenendaal ◽  
Peter Andriessen ◽  
Peter H. Dijk ◽  
Frank A.M. van den Dungen ◽  
...  

<b><i>Introduction:</i></b> Retinopathy of prematurity (ROP) remains an important cause for preventable blindness. Aside from gestational age (GA) and birth weight, risk factor assessment can be important for determination of infants at risk of (severe) ROP. <b><i>Methods:</i></b> Prospective, multivariable risk-analysis study (NEDROP-2) was conducted, including all infants born in 2017 in the Netherlands considered eligible for ROP screening by pediatricians. Ophthalmologists provided data of screened infants, which were combined with risk factors from the national perinatal database (Perined). Clinical data and potential risk factors were compared to the first national ROP inventory (NEDROP-1, 2009). During the second period, more strict risk factor-based screening inclusion criteria were applied. <b><i>Results:</i></b> Of 1,287 eligible infants, 933 (72.5%) were screened for ROP and matched with the Perined data. Any ROP was found in 264 infants (28.3% of screened population, 2009: 21.9%) and severe ROP (sROP) (stage ≥3) in 41 infants (4.4%, 2009: 2.1%). The risk for any ROP is decreased with a higher GA (odds ratio [OR] 0.59 and 95% confidence interval [CI] 0.54–0.66) and increased for small for GA (SGA) (1.73, 1.11–2.62), mechanical ventilation &#x3e;7 days (2.13, 1.35–3.37) and postnatal corticosteroids (2.57, 1.44–4.66). For sROP, significant factors were GA (OR 0.37 and CI 0.27–0.50), SGA (OR 5.65 and CI 2.17–14.92), postnatal corticosteroids (OR 3.81 and CI 1.72–8.40), and perforated necrotizing enterocolitis (OR 7.55 and CI 2.29–24.48). <b><i>Conclusion:</i></b> In the Netherlands, sROP was diagnosed more frequently since 2009. No new risk factors for ROP were determined in the present study, apart from those already included in the current screening guideline.


2007 ◽  
Vol 70 (6) ◽  
pp. 1350-1359 ◽  
Author(s):  
JULIE ARSENAULT ◽  
ANN LETELLIER ◽  
SYLVAIN QUESSY ◽  
JEAN-PIERRE MORIN ◽  
MARTINE BOULIANNE

An observational study was conducted to estimate prevalence and risk factors for carcass contamination by Salmonella and Campylobacter spp. in 60 lots of turkey slaughtered over 10 months in the province of Quebec, Canada. Carcass contamination was evaluated by the carcass rinse technique for about 30 birds per lot. Exposure to potential risk factors was evaluated with questionnaires, meteorological data, and cecal cultures. Multivariable binomial negative regression models were used for risk factor analysis. Prevalence of Salmonella-positive carcasses was 31.2% (95% confidence interval, 22.8 to 39.5%). Variables positively associated (P ≤ 0.05) with the proportion of lot-positive carcasses were ≥0.5% of carcass condemnation due to various pathologies, cecal samples positive for Salmonella, low wind speed during transportation, closure of lateral curtains of truck during transportation, and slaughtering on a weekday other than Monday. When only Salmonella-positive cecal culture lots were considered, the proportion of carcasses positive for Salmonella was significantly higher in lots exposed to a &gt;5°C outside temperature variation during transportation, slaughtered on a weekday other than Monday, and in which ≥4% of carcasses had visible contamination. Prevalence of Campylobacter-positive carcasses was 36.9% (95% confidence interval, 27.6 to 46.3%). The proportion of positive carcasses was significantly higher in lots with Campylobacter-positive cecal cultures and lots undergoing ≥2 h of transit to slaughterhouse. For lots with Campylobacter-positive cecal cultures, variables significantly associated with an increased incidence of carcass contamination were ≥4% of carcasses with visible contamination, crating for ≥8 h before slaughtering, and no antimicrobials used during rearing.


2016 ◽  
pp. AAC.01503-16 ◽  
Author(s):  
Chih-Han Juan ◽  
Yi-Wei Huang ◽  
Yi-Tsung Lin ◽  
Tsuey-Ching Yang ◽  
Fu-Der Wang

A rise in tigecycline resistance inKlebsiella pneumoniaehas been reported recently worldwide. We aim to identify risk factors, outcomes, and mechanisms for adult patients with tigecycline non-susceptibleK. pneumoniaebacteremia in Taiwan. We conducted a matched case-control study (ratio of 1:1) in a medical center in Taiwan from January 2011 through June 2015. The cases were patients with tigecycline non-susceptibleK. pneumoniaebacteremia, and the controls were patients with tigecycline susceptibleK. pneumoniaebacteremia. Logistic regression was performed to evaluate the potential risk factors for tigecycline non-susceptibleK. pneumoniaebacteremia. Quantitative RT-PCR was performed to analyzeacrA,oqxA,ramA,rarA,andkpgAexpression among these isolates. A total of 43 cases were matched with 43 controls. The 14-day mortality of patients with tigecycline non-susceptibleK. pneumoniaebacteremia was 30.2%, and the 28-day mortality was 41.9%. The attributable mortality of tigecycline non-susceptibleK. pneumoniaeat 14 days and 28 days was 9.3% and 18.6%, respectively. Fluoroquinolone use within 30 days prior to bacteremia was the only independent risk factor for tigecycline non-susceptibleK. pneumoniaebacteremia. Tigecycline non-susceptibleK. pneumoniaewere mostly caused by overexpression of AcrAB and/or OqxAB efflux pumps, together with the upregulation of RamA and/or RarA respectively. One isolate has isolated overexpression ofkpgA. In conclusion, tigecycline non-susceptibleK. pneumoniaebacteremia was associated with high mortality and prior fluoroquinolone use was the independent risk factor for acquisition of tigecycline non-susceptibleK. pneumoniae. The overexpression of AcrAB and/or OqxAB contributes to tigecycline non-susceptibility inK. pneumoniae.


2020 ◽  
Author(s):  
Xueyan Li ◽  
Genshan Ma ◽  
Xiaobo Qian ◽  
Yamou Wu ◽  
Xiaochen Huang ◽  
...  

Abstract Background: We aimed to assess the performance of machine learning algorithms for the prediction of risk factors of postoperative ileus (POI) in patients underwent laparoscopic colorectal surgery for malignant lesions. Methods: We conducted analyses in a retrospective observational study with a total of 637 patients at Suzhou Hospital of Nanjing Medical University. Four machine learning algorithms (logistic regression, decision tree, random forest, gradient boosting decision tree) were considered to predict risk factors of POI. The total cases were randomly divided into training and testing data sets, with a ratio of 8:2. The performance of each model was evaluated by area under receiver operator characteristic curve (AUC), precision, recall and F1-score. Results: The morbidity of POI in this study was 19.15% (122/637). Gradient boosting decision tree reached the highest AUC (0.76) and was the best model for POI risk prediction. In addition, the results of the importance matrix of gradient boosting decision tree showed that the five most important variables were time to first passage of flatus, opioids during POD3, duration of surgery, height and weight. Conclusions: The gradient boosting decision tree was the optimal model to predict the risk of POI in patients underwent laparoscopic colorectal surgery for malignant lesions. And the results of our study could be useful for clinical guidelines in POI risk prediction.


2020 ◽  
Vol 20 (1) ◽  
Author(s):  
Georgios Kantidakis ◽  
Hein Putter ◽  
Carlo Lancia ◽  
Jacob de Boer ◽  
Andries E. Braat ◽  
...  

Abstract Background Predicting survival of recipients after liver transplantation is regarded as one of the most important challenges in contemporary medicine. Hence, improving on current prediction models is of great interest.Nowadays, there is a strong discussion in the medical field about machine learning (ML) and whether it has greater potential than traditional regression models when dealing with complex data. Criticism to ML is related to unsuitable performance measures and lack of interpretability which is important for clinicians. Methods In this paper, ML techniques such as random forests and neural networks are applied to large data of 62294 patients from the United States with 97 predictors selected on clinical/statistical grounds, over more than 600, to predict survival from transplantation. Of particular interest is also the identification of potential risk factors. A comparison is performed between 3 different Cox models (with all variables, backward selection and LASSO) and 3 machine learning techniques: a random survival forest and 2 partial logistic artificial neural networks (PLANNs). For PLANNs, novel extensions to their original specification are tested. Emphasis is given on the advantages and pitfalls of each method and on the interpretability of the ML techniques. Results Well-established predictive measures are employed from the survival field (C-index, Brier score and Integrated Brier Score) and the strongest prognostic factors are identified for each model. Clinical endpoint is overall graft-survival defined as the time between transplantation and the date of graft-failure or death. The random survival forest shows slightly better predictive performance than Cox models based on the C-index. Neural networks show better performance than both Cox models and random survival forest based on the Integrated Brier Score at 10 years. Conclusion In this work, it is shown that machine learning techniques can be a useful tool for both prediction and interpretation in the survival context. From the ML techniques examined here, PLANN with 1 hidden layer predicts survival probabilities the most accurately, being as calibrated as the Cox model with all variables. Trial registration Retrospective data were provided by the Scientific Registry of Transplant Recipients under Data Use Agreement number 9477 for analysis of risk factors after liver transplantation.


2016 ◽  
Vol 56 (4) ◽  
pp. 226
Author(s):  
Yuni Purwanti ◽  
Sutaryo Sutaryo ◽  
Sri Mulatsih ◽  
Pungky Ardani Kusuma

Background Wilms tumor is the most common renal malignancy in children (95%) and one of the leading causes of death in children, with high mortality rates in developing countries. Identifying risk factors for mortality is important in order to provide early intervention to improve cure rates.Objective To identify risk factors for mortality in children with Wilms tumor.Methods We performed a case-control study of children (0-18 years of age) with Wilms tumor admitted to Dr. Sardjito Hospital between 2005 and 2012. The case group consisted of children who died of Wilms tumor, whereas the control group were children who survived. Data were collected from medical records. Statistical analyses using Chi-square and logistic regression tests were done to determine odds ratios and 95% CI of the potential risk factors for mortality from Wilms tumor.Results Thirty-five children with Wilms tumor were admitted to Dr. Sardjito Hospital during the study period. Nine (26%) children died and 26 survived. Stage ≥III was a significant risk factor for mortality in chidren with Wilms tumor (OR 62.8; 95%CI 5.6 to 70.5). Age ≥2 years (OR 1.4; 95%CI 0.1 to 14.3) and male sex (OR 1.2; 95%CI 0.1 to 10.8) were not significant risk factors for mortality.Conclusion Stage ≥III is a risk factor for mortality in children with Wilms tumor. 


Sensors ◽  
2020 ◽  
Vol 20 (9) ◽  
pp. 2734 ◽  
Author(s):  
Ayan Chatterjee ◽  
Martin W. Gerdes ◽  
Santiago G. Martinez

Social determining factors such as the adverse influence of globalization, supermarket growth, fast unplanned urbanization, sedentary lifestyle, economy, and social position slowly develop behavioral risk factors in humans. Behavioral risk factors such as unhealthy habits, improper diet, and physical inactivity lead to physiological risks, and “obesity/overweight” is one of the consequences. “Obesity and overweight” are one of the major lifestyle diseases that leads to other health conditions, such as cardiovascular diseases (CVDs), chronic obstructive pulmonary disease (COPD), cancer, diabetes type II, hypertension, and depression. It is not restricted within the age and socio-economic background of human beings. The “World Health Organization” (WHO) has anticipated that 30% of global death will be caused by lifestyle diseases by 2030 and it can be prevented with the appropriate identification of associated risk factors and behavioral intervention plans. Health behavior change should be given priority to avoid life-threatening damages. The primary purpose of this study is not to present a risk prediction model but to provide a review of various machine learning (ML) methods and their execution using available sample health data in a public repository related to lifestyle diseases, such as obesity, CVDs, and diabetes type II. In this study, we targeted people, both male and female, in the age group of >20 and <60, excluding pregnancy and genetic factors. This paper qualifies as a tutorial article on how to use different ML methods to identify potential risk factors of obesity/overweight. Although institutions such as “Center for Disease Control and Prevention (CDC)” and “National Institute for Clinical Excellence (NICE)” guidelines work to understand the cause and consequences of overweight/obesity, we aimed to utilize the potential of data science to assess the correlated risk factors of obesity/overweight after analyzing the existing datasets available in “Kaggle” and “University of California, Irvine (UCI) database”, and to check how the potential risk factors are changing with the change in body-energy imbalance with data-visualization techniques and regression analysis. Analyzing existing obesity/overweight related data using machine learning algorithms did not produce any brand-new risk factors, but it helped us to understand: (a) how are identified risk factors related to weight change and how do we visualize it? (b) what will be the nature of the data (potential monitorable risk factors) to be collected over time to develop our intended eCoach system for the promotion of a healthy lifestyle targeting “obesity and overweight” as a study case in the future? (c) why have we used the existing “Kaggle” and “UCI” datasets for our preliminary study? (d) which classification and regression models are performing better with a corresponding limited volume of the dataset following performance metrics?


Sign in / Sign up

Export Citation Format

Share Document