scholarly journals Interpretable machine learning prediction of all-cause mortality

Author(s):  
Wei Qiu ◽  
Hugh Chen ◽  
Ayse Berceste Dincer ◽  
Su-In Lee

AbstractExplainable artificial intelligence provides an opportunity to improve prediction accuracy over standard linear models using “black box” machine learning (ML) models while still revealing insights into a complex outcome such as all-cause mortality. We propose the IMPACT (Interpretable Machine learning Prediction of All-Cause morTality) framework that implements and explains complex, non-linear ML models in epidemiological research, by combining a tree ensemble mortality prediction model and an explainability method. We use 133 variables from NHANES 1999–2014 datasets (number of samples: n = 47, 261) to predict all-cause mortality. To explain our model, we extract local (i.e., per-sample) explanations to verify well-studied mortality risk factors, and make new discoveries. We present major factors for predicting x-year mortality (x = 1, 3, 5) across different age groups and their individualized impact on mortality prediction. Moreover, we highlight interactions between risk factors associated with mortality prediction, which leads to findings that linear models do not reveal. We demonstrate that compared with traditional linear models, tree-based models have unique strengths such as: (1) improving prediction power, (2) making no distribution assumptions, (3) capturing non-linear relationships and important thresholds, (4) identifying feature interactions, and (5) detecting different non-linear relationships between models. Given the popularity of complex ML models in prognostic research, combining these models with explainability methods has implications for further applications of ML in medical fields. To our knowledge, this is the first study that combines complex ML models and state-of-the-art feature attributions to explain mortality prediction, which enables us to achieve higher prediction accuracy and gain new insights into the effect of risk factors on mortality.

Stroke ◽  
2020 ◽  
Vol 51 (Suppl_1) ◽  
Author(s):  
Agni Orfanoudaki ◽  
Amre M Nouh ◽  
Emma Chesley ◽  
Christian Cadisch ◽  
Barry Stein ◽  
...  

Background: Current stroke risk assessment tools presume the impact of risk factors is linear and cumulative. However, both novel risk factors and their interplay influencing stroke incidence are difficult to reveal using traditional linear models. Objective: To improve upon the Revised-Framingham Stroke Risk Score and design an interactive non-linear Stroke Risk Score (NSRS). Our work aimed at increasing the accuracy of event prediction and uncovering new relationships in an interpretable user-friendly fashion. Methods: A two phase approach was used to develop our stroke risk score predictor. First, clinical examinations of the Framingham offspring cohort were utilized as the training dataset for the predictive model consisting of 14,196 samples where each clinical examination was considered an independent observation. Optimal Classification Trees (OCT) were used to train a model to predict 10-year stroke risk. Second, this model was validated with 17,527 observations from the Boston Medical Center. The NSRS was developed into an online user friendly application in the form of a questionnaire (http://www.mit.edu/~agniorf/files/questionnaire_Cohort2.html). Results: The algorithm suggests a key dichotomy between patients with or without history of cardiovascular disease. While the model agrees with known findings, it also identified 23 unique stroke risk profiles and introduced new non-linear relationships; such as the role of T-wave abnormality on electrocardiography and hematocrit levels in a patient’s risk profile. Our results in both the training and validation populations suggested that the non-linear approach significantly improves upon the existing revised Framingham stroke risk calculator in the c-statistic (training 87.43% (CI 0.85-0.90) vs. 73.74% (CI 0.70-0.76); validation 75.29% (CI 0.74-0.76) vs 65.93% (CI 0.64-0.67), even in multi-ethnicity populations. Conclusions: We constructed a highly predictive, interpretable and user-friendly stroke risk calculator using novel machine-learning uncovering new risk factors, interactions and unique profiles. The clinical implications include prioritization of risk factor modification and personalized care improving targeted intervention for stroke prevention.


2020 ◽  
Vol 12 (5) ◽  
pp. 379-391
Author(s):  
Ihsane Gryech ◽  
Mounir Ghogho ◽  
Hajar Elhammouti ◽  
Nada Sbihi ◽  
Abdellatif Kobbane

The presence of pollutants in the air has a direct impact on our health and causes detrimental changes to our environment. Air quality monitoring is therefore of paramount importance. The high cost of the acquisition and maintenance of accurate air quality stations implies that only a small number of these stations can be deployed in a country. To improve the spatial resolution of the air monitoring process, an interesting idea is to develop data-driven models to predict air quality based on readily available data. In this paper, we investigate the correlations between air pollutants concentrations and meteorological and road traffic data. Using machine learning, regression models are developed to predict pollutants concentration. Both linear and non-linear models are investigated in this paper. It is shown that non-linear models, namely Random Forest (RF) and Support Vector Regression (SVR), better describe the impact of traffic flows and meteorology on the concentrations of pollutants in the atmosphere. It is also shown that more accurate prediction models can be obtained when including some pollutants’ concentration as predictors. This may be used to infer the concentrations of some pollutants using those of other pollutants, thereby reducing the number of air pollution sensors.


Author(s):  
Salvatore Tedesco ◽  
Martina Andrulli ◽  
Markus Åkerlund Larsson ◽  
Daniel Kelly ◽  
Antti Alamäki ◽  
...  

As global demographics change, ageing is a global phenomenon which is increasingly of interest in our modern and rapidly changing society. Thus, the application of proper prognostic indices in clinical decisions regarding mortality prediction has assumed a significant importance for personalized risk management (i.e., identifying patients who are at high or low risk of death) and to help ensure effective healthcare services to patients. Consequently, prognostic modelling expressed as all-cause mortality prediction is an important step for effective patient management. Machine learning has the potential to transform prognostic modelling. In this paper, results on the development of machine learning models for all-cause mortality prediction in a cohort of healthy older adults are reported. The models are based on features covering anthropometric variables, physical and lab examinations, questionnaires, and lifestyles, as well as wearable data collected in free-living settings, obtained for the “Healthy Ageing Initiative” study conducted on 2291 recruited participants. Several machine learning techniques including feature engineering, feature selection, data augmentation and resampling were investigated for this purpose. A detailed empirical comparison of the impact of the different techniques is presented and discussed. The achieved performances were also compared with a standard epidemiological model. This investigation showed that, for the dataset under consideration, the best results were achieved with Random UnderSampling in conjunction with Random Forest (either with or without probability calibration). However, while including probability calibration slightly reduced the average performance, it increased the model robustness, as indicated by the lower 95% confidence intervals. The analysis showed that machine learning models could provide comparable results to standard epidemiological models while being completely data-driven and disease-agnostic, thus demonstrating the opportunity for building machine learning models on health records data for research and clinical practice. However, further testing is required to significantly improve the model performance and its robustness.


2020 ◽  
Author(s):  
Julio Chevarria ◽  
Donal J Sexton ◽  
Susan L Murray ◽  
Chaudhry E Adeel ◽  
Patrick O’Kelly ◽  
...  

Abstract Background Non-traditional cardiovascular risk factors, including calcium and phosphate derangement, may play a role in mortality in renal transplant. The data regarding this effect are conflicting. Our aim was to assess the impact of calcium and phosphate derangements in the first 90 days post-transplant on allograft and recipient outcomes. Methods We performed a retrospective cohort review of all-adult, first renal transplants in the Republic of Ireland between 1999 and 2015. We divided patients into tertiles based on serum phosphate and calcium levels post-transplant. We assessed their effect on death-censored graft survival and all-cause mortality. We used Stata for statistical analysis and did survival analysis and spline curves to assess the association. Results We included 1525 renal transplant recipients. Of the total, 86.3% had hypophosphataemia and 36.1% hypercalcaemia. Patients in the lowest phosphate tertile were younger, more likely female, had lower weight, more time on dialysis, received a kidney from a younger donor, had less delayed graft function and better transplant function compared with other tertiles. Patients in the highest calcium tertile were younger, more likely male, had higher body mass index, more time on dialysis and better transplant function. Adjusting for differences between groups, we were unable to show any difference in death-censored graft failure [phosphate = 1.14, 95% confidence interval (CI) 0.92–1.41; calcium = 0.98, 95% CI 0.80–1.20] or all-cause mortality (phosphate = 1.10, 95% CI 0.91–1.32; calcium = 0.96, 95% CI 0.81–1.13) based on tertiles of calcium or phosphate in the initial 90 days. Conclusions Hypophosphataemia and hypercalcaemia are common occurrences post-kidney transplant. We have identified different risk factors for these metabolic derangements. The calcium and phosphate levels exhibit no independent association with death-censored graft failure and mortality.


Electronics ◽  
2020 ◽  
Vol 9 (2) ◽  
pp. 374 ◽  
Author(s):  
Sudhanshu Kumar ◽  
Monika Gahalawat ◽  
Partha Pratim Roy ◽  
Debi Prosad Dogra ◽  
Byung-Gyu Kim

Sentiment analysis is a rapidly growing field of research due to the explosive growth in digital information. In the modern world of artificial intelligence, sentiment analysis is one of the essential tools to extract emotion information from massive data. Sentiment analysis is applied to a variety of user data from customer reviews to social network posts. To the best of our knowledge, there is less work on sentiment analysis based on the categorization of users by demographics. Demographics play an important role in deciding the marketing strategies for different products. In this study, we explore the impact of age and gender in sentiment analysis, as this can help e-commerce retailers to market their products based on specific demographics. The dataset is created by collecting reviews on books from Facebook users by asking them to answer a questionnaire containing questions about their preferences in books, along with their age groups and gender information. Next, the paper analyzes the segmented data for sentiments based on each age group and gender. Finally, sentiment analysis is done using different Machine Learning (ML) approaches including maximum entropy, support vector machine, convolutional neural network, and long short term memory to study the impact of age and gender on user reviews. Experiments have been conducted to identify new insights into the effect of age and gender for sentiment analysis.


Author(s):  
Lilian Messias Sampaio Brito ◽  
Luis Paulo Gomes Mascarenhas ◽  
Deise Cristiane Moser ◽  
Ana Cláudia Kapp Titski ◽  
Monica Nunes Lima Cat ◽  
...  

DOI: http://dx.doi.org/10.5007/1980-0037.2016v18n6p678 The aim of this study was to investigate the impact of physical activity (PA) and cardiorespiratory fitness (CRF) levels on the prevalence of overweight and high blood pressure levels in adolescents. In this observational, cross-sectional study, 614 boys aged 10-14 years were assessed for height, body mass, body mass index (BMI), waist circumference (WC) and blood pressure (BP). CRF was assessed using a run test (Léger Test) and subjects were then grouped according to their CRF level. PA level was assessed through a questionnaire (The Three Day Physical Activity Recall) and classified into two groups, namely > 300 minutes of PA/week and < 300 minutes of PA/week. Maturational stage was evaluated according to the development of pubic hair (self-assessment) as proposed by Tanner. We used statistical descriptive analysis, univariate and multivariate analyses in the total participants and subjects were divided by age. Fifty percent of the sample performed < 300 minutes of PA/week and 67.6% had unsatisfactory CRF levels. There was a higher prevalence of unsatisfactory CRF levels among subjects with altered BMI (overweight), WC (abdominal obesity) or BP (high blood pressure) for all age groups. PA history, however, did not show any significance. A total of 31% of participants were overweight, 24.8% had abdominal obesity and 15.4% had increased BP. Unsatisfactory CRF levels were found to be a better predictor for the diagnosis of cardiovascular diseases (CV) risk factors than PA history, regardless of age group. 


2021 ◽  
Vol 28 (1) ◽  
pp. e100439
Author(s):  
Lukasz S Wylezinski ◽  
Coleman R Harris ◽  
Cody N Heiser ◽  
Jamieson D Gray ◽  
Charles F Spurlock

IntroductionThe SARS-CoV-2 (COVID-19) pandemic has exposed health disparities throughout the USA, particularly among racial and ethnic minorities. As a result, there is a need for data-driven approaches to pinpoint the unique constellation of clinical and social determinants of health (SDOH) risk factors that give rise to poor patient outcomes following infection in US communities.MethodsWe combined county-level COVID-19 testing data, COVID-19 vaccination rates and SDOH information in Tennessee. Between February and May 2021, we trained machine learning models on a semimonthly basis using these datasets to predict COVID-19 incidence in Tennessee counties. We then analyzed SDOH data features at each time point to rank the impact of each feature on model performance.ResultsOur results indicate that COVID-19 vaccination rates play a crucial role in determining future COVID-19 disease risk. Beginning in mid-March 2021, higher vaccination rates significantly correlated with lower COVID-19 case growth predictions. Further, as the relative importance of COVID-19 vaccination data features grew, demographic SDOH features such as age, race and ethnicity decreased while the impact of socioeconomic and environmental factors, including access to healthcare and transportation, increased.ConclusionIncorporating a data framework to track the evolving patterns of community-level SDOH risk factors could provide policy-makers with additional data resources to improve health equity and resilience to future public health emergencies.


2021 ◽  
Vol 251 ◽  
pp. 01017
Author(s):  
Zhixiang Lu

With the vigorous development of the sharing economy, the short-term rental industry has also spawned many emerging industries that belong to the sharing economy. However, due to the impact of the COVID-19 pandemic in 2020, many sharing economy industries, including the short-term housing leasing industry, have been affected. This study takes the rental information of 1,004 short-term rental houses in New York in April 2020 as an example, through machine learning and quantitative analysis, we conducted statistical and visual analysis on the impact of different factors on the housing rental status. This project is based on the machine learning model to predict the changes in the rental status of the house on the time series. The results show that the prediction accuracy of the random forest model has reached more than 94%, and the prediction accuracy of the logistic model has reached more than 74%. At the same time, we have further explored the impact of time span differences and regional differences on the housing rental status.


Author(s):  
David Opeoluwa Oyewola ◽  
Emmanuel Gbenga Dada ◽  
Juliana Ngozi Ndunagu ◽  
Terrang Abubakar Umar ◽  
Akinwunmi S.A

Since the declaration of COVID-19 as a global pandemic, it has been transmitted to more than 200 nations of the world. The harmful impact of the pandemic on the economy of nations is far greater than anything suffered in almost a century. The main objective of this paper is to apply Structural Equation Modeling (SEM) and Machine Learning (ML) to determine the relationships among COVID-19 risk factors, epidemiology factors and economic factors. Structural equation modeling is a statistical technique for calculating and evaluating the relationships of manifest and latent variables. It explores the causal relationship between variables and at the same time taking measurement error into account. Bagging (BAG), Boosting (BST), Support Vector Machine (SVM), Decision Tree (DT) and Random Forest (RF) Machine Learning techniques was applied to predict the impact of COVID-19 risk factors. Data from patients who came into contact with coronavirus disease were collected from Kaggle database between 23 January 2020 and 24 June 2020. Results indicate that COVID-19 risk factors have negative effects on epidemiology factors. It also has negative effects on economic factors.


Sign in / Sign up

Export Citation Format

Share Document