Explainable machine learning models to understand determinants of COVID-19 mortality in the United States (Preprint)

2020 ◽  
Author(s):  
Piyush Mathur ◽  
Tavpritesh Sethi ◽  
Anya Mathur ◽  
Kamal Maheshwari ◽  
Jacek Cywinski ◽  
...  

UNSTRUCTURED Introduction The COVID-19 pandemic exhibits an uneven geographic spread which leads to a locational mismatch of testing, mitigation measures and allocation of healthcare resources (human, equipment, and infrastructure).(1) In the absence of effective treatment, understanding and predicting the spread of COVID-19 is unquestionably valuable for public health and hospital authorities to plan for and manage the pandemic. While there have been many models developed to predict mortality, the authors sought to develop a machine learning prediction model that provides an estimate of the relative association of socioeconomic, demographic, travel, and health care characteristics of COVID-19 disease mortality among states in the United States(US). Methods State-wise data was collected for all the features predicting COVID-19 mortality and for deriving feature importance (eTable 1 in the Supplement).(2) Key feature categories include demographic characteristics of the population, pre-existing healthcare utilization, travel, weather, socioeconomic variables, racial distribution and timing of disease mitigation measures (Figure 1 & 2). Two machine learning models, Catboost regression and random forest were trained independently to predict mortality in states on data partitioned into a training (80%) and test (20%) set.(3) Accuracy of models was assessed by R2 score. Importance of the features for prediction of mortality was calculated via two machine learning algorithms - SHAP (SHapley Additive exPlanations) calculated upon CatBoost model and Boruta, a random forest based method trained with 10,000 trees for calculating statistical significance (3-5). Results Results are based on 60,604 total deaths in the US, as of April 30, 2020. Actual number of deaths ranged widely from 7 (Wyoming) to 18,909 (New York).CatBoost regression model obtained an R2 score of 0.99 on the training data set and 0.50 on the test set. Random Forest model obtained an R2 score of 0.88 on the training data set and 0.39 on the test set. Nine out of twenty variables were significantly higher than the maximum variable importance achieved by the shadow dataset in Boruta regression (Figure 2).Both models showed the high feature importance for pre-existing high healthcare utilization reflective in nursing home beds per capita and doctors per 100,000 population. Overall population characteristics such as total population and population density also correlated positively with the number of deaths.Notably, both models revealed a high positive correlation of deaths with percentage of African Americans. Direct flights from China, especially Wuhan were also significant in both models as predictors of death, therefore reflecting early spread of the disease. Associations between deaths and weather patterns, hospital bed capacity, median age, timing of administrative action to mitigate disease spread such as the closure of educational institutions or stay at home order were not significant. The lack of some associations, e.g., administrative action may reflect delayed outcomes of interventions which were not yet reflected in data. Discussion COVID-19 disease has varied spread and mortality across communities amongst different states in the US. While our models show that high population density, pre-existing need for medical care and foreign travel may increase transmission and thus COVID-19 mortality, the effect of geographic, climate and racial disparities on COVID-19 related mortality is not clear. The purpose of our study was not state-wise accurate prediction of deaths in the US, which has already been challenging.(6) Location based understanding of key determinants of COVID-19 mortality, is critically needed for focused targeting of mitigation and control measures. Risk assessment-based understanding of determinants affecting COVID-19 outcomes, using a dynamic and scalable machine learning model such as the two proposed, can help guide resource management and policy framework.

2021 ◽  
Vol 9 ◽  
Author(s):  
Daniel Lowell Weller ◽  
Tanzy M. T. Love ◽  
Martin Wiedmann

Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners*3 resampling approaches) and 108 nested (5 learners*9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.


2020 ◽  
Vol 8 (6) ◽  
pp. 1623-1630

As huge amount of data accumulating currently, Challenges to draw out the required amount of data from available information is needed. Machine learning contributes to various fields. The fast-growing population caused the evolution of a wide range of diseases. This intern resulted in the need for the machine learning model that uses the patient's datasets. From different sources of datasets analysis, cancer is the most hazardous disease, it may cause the death of the forbearer. The outcome of the conducted surveys states cancer can be nearly cured in the initial stages and it may also cause the death of an affected person in later stages. One of the major types of cancer is lung cancer. It highly depends on the past data which requires detection in early stages. The recommended work is based on the machine learning algorithm for grouping the individual details into categories to predict whether they are going to expose to cancer in the early stage itself. Random forest algorithm is implemented, it results in more efficiency of 97% compare to KNN and Naive Bayes. Further, the KNN algorithm doesn't learn anything from training data but uses it for classification. Naive Bayes results in the inaccuracy of prediction. The proposed system is for predicting the chances of lung cancer by displaying three levels namely low, medium, and high. Thus, mortality rates can be reduced significantly.


Information ◽  
2021 ◽  
Vol 12 (2) ◽  
pp. 57
Author(s):  
Shrirang A. Kulkarni ◽  
Jodh S. Pannu ◽  
Andriy V. Koval ◽  
Gabriel J. Merrin ◽  
Varadraj P. Gurupur ◽  
...  

Background and objectives: Machine learning approaches using random forest have been effectively used to provide decision support in health and medical informatics. This is especially true when predicting variables associated with Medicare reimbursements. However, more work is needed to analyze and predict data associated with reimbursements through Medicare and Medicaid services for physical therapy practices in the United States. The key objective of this study is to analyze different machine learning models to predict key variables associated with Medicare standardized payments for physical therapy practices in the United States. Materials and Methods: This study employs five methods, namely, multiple linear regression, decision tree regression, random forest regression, K-nearest neighbors, and linear generalized additive model, (GAM) to predict key variables associated with Medicare payments for physical therapy practices in the United States. Results: The study described in this article adds to the body of knowledge on the effective use of random forest regression and linear generalized additive model in predicting Medicare Standardized payment. It turns out that random forest regression may have any edge over other methods employed for this purpose. Conclusions: The study provides a useful insight into comparing the performance of the aforementioned methods, while identifying a few intricate details associated with predicting Medicare costs while also ascertaining that linear generalized additive model and random forest regression as the most suitable machine learning models for predicting key variables associated with standardized Medicare payments.


Author(s):  
Mojtaba Haghighatlari ◽  
Ching-Yen Shih ◽  
Johannes Hachmann

<div><div><div><p>The appropriate sampling of training data out of a potentially imbalanced data set is of critical importance for the development of robust and accurate machine learning models. A challenge that underpins this task is the partitioning of the data into groups of similar instances, and the analysis of the group populations. In molecular data sets, different groups of molecules may be hard to identify. However, if the distribution of a given data set is ignored then some of these groups may remain under-represented and the sampling biased, even if the size of data is large. In this study, we use the example of the Harvard Clean Energy Project (CEP) data set to assess the challenges posed by imbalanced data and the impact that accounting for different groups during the selection of training data has on the quality of the resulting machine learning models. We employ a partitioning criterion based on the underlying rules for the CEP molecular library generation to identify groups of structurally similar compounds. First, we evaluate the performance of regression models that are trained globally (i.e., by randomly sampling the entire data set for training data). This traditional approach serves as the benchmark reference. We compare its results with those of models that are trained locally, i.e., within each of the identified molecular domains. We demonstrate that local models outperform the best reported global models by considerable margins and are more efficient in their training data needs. We propose a strategy to redesign training sets for the development of improved global models. While the resulting uniform training sets can successfully yield robust global models, we identify the distribution mismatch between feature representations of different molecular domains as a critical limitation for any further improvement. We take advantage of the discovered distribution shift and propose an ensemble of classification and regression models to achieve a generalized and reliable model that outperforms the state-of-the-art model, trained on the CEP data set. Moreover, this study provides a benchmark for the development of future methodologies concerned with imbalanced chemical data.</p></div></div></div>


2020 ◽  
Author(s):  
Piyush Mathur ◽  
Tavpritesh Sethi ◽  
Anya Mathur ◽  
Kamal Maheshwari ◽  
Jacek B Cywinski ◽  
...  

AbstractBackgroundCOVID-19 is now one of the leading causes of mortality amongst adults in the United States for the year 2020. Multiple epidemiological models have been built, often based on limited data, to understand the spread and impact of the pandemic. However, many geographic and local factors may have played an important role in higher morbidity and mortality in certain populations.ObjectiveThe goal of this study was to develop machine learning models to understand the relative association of socioeconomic, demographic, travel, and health care characteristics of different states across the United States and COVID-19 mortality.MethodsUsing multiple public data sets, 24 variables linked to COVID-19 disease were chosen to build the models. Two independent machine learning models using CatBoost regression and random forest were developed. SHAP feature importance and a Boruta algorithm were used to elucidate the relative importance of features on COVID-19 mortality in the United States.ResultsFeature importances from both the categorical models, i.e., CatBoost and random forest consistently showed that a high population density, number of nursing homes, number of nursing home beds and foreign travel were strongest predictors of COVID-19 mortality. Percentage of African American amongst the population was also found to be of high importance in prediction of COVID-19 mortality whereas racial majority (primarily, Caucasian) was not. Both models fitted the data well with a training R2 of 0.99 and 0.88 respectively. The effect of median age,median income, climate and disease mitigation measures on COVID-19 related mortality remained unclear.ConclusionsCOVID-19 policy making will need to take population density, pre-existing medical care and state travel policies into account. Our models identified and quantified the relative importance of each of these for mortality predictions using machine learning.


Author(s):  
Christoph Völker ◽  
Rafia Firdous ◽  
Dietmar Stephan ◽  
Sabine Kruschwitz

AbstractAlkali-activated binders (AAB) can provide a clean alternative to conventional cement in terms of CO2 emissions. However, as yet there are no sufficiently accurate material models to effectively predict the AAB properties, thus making optimal mix design highly costly and reducing the attractiveness of such binders. This work adopts sequential learning (SL) in high-dimensional material spaces (consisting of composition and processing data) to find AABs that exhibit desired properties. The SL approach combines machine learning models and feedback from real experiments. For this purpose, 131 data points were collected from different publications. The data sources are described in detail, and the differences between the binders are discussed. The sought-after target property is the compressive strength of the binders after 28 days. The success is benchmarked in terms of the number of experiments required to find materials with the desired strength. The influence of some constraints was systematically analyzed, e.g., the possibility to parallelize the experiments, the influence of the chosen algorithm and the size of the training data set. The results show the advantage of SL, i.e., the amount of data required can potentially be reduced by at least one order of magnitude compared to traditional machine learning models, while at the same time exploiting highly complex information. This brings applications in laboratory practice within reach.


2021 ◽  
Author(s):  
Dong Wang ◽  
JinBo Li ◽  
Yali Sun ◽  
Xianfei Ding ◽  
Xiaojuan Zhang ◽  
...  

Abstract Background: Although numerous studies are conducted every year on how to reduce the fatality rate associated with sepsis, it is still a major challenge faced by patients, clinicians, and medical systems worldwide. Early identification and prediction of patients at risk of sepsis and adverse outcomes associated with sepsis are critical. We aimed to develop an artificial intelligence algorithm that can predict sepsis early.Methods: This was a secondary analysis of an observational cohort study from the Intensive Care Unit of the First Affiliated Hospital of Zhengzhou University. A total of 4449 infected patients were randomly assigned to the development and validation data set at a ratio of 4:1. After extracting electronic medical record data, a set of 55 features (variables) was calculated and passed to the random forest algorithm to predict the onset of sepsis.Results: The pre-procedure clinical variables were used to build a prediction model from the training data set using the random forest machine learning method; a 5-fold cross-validation was used to evaluate the prediction accuracy of the model. Finally, we tested the model using the validation data set. The area obtained by the model under the receiver operating characteristic (ROC) curve (AUC) was 0.91, the sensitivity was 87%, and the specificity was 89%.Conclusions: The newly established model can accurately predict the onset of sepsis in ICU patients in clinical settings as early as possible. Prospective studies are necessary to determine the clinical utility of the proposed sepsis prediction model.


2019 ◽  
Author(s):  
Mojtaba Haghighatlari ◽  
Ching-Yen Shih ◽  
Johannes Hachmann

<div><div><div><p>The appropriate sampling of training data out of a potentially imbalanced data set is of critical importance for the development of robust and accurate machine learning models. A challenge that underpins this task is the partitioning of the data into groups of similar instances, and the analysis of the group populations. In molecular data sets, different groups of molecules may be hard to identify. However, if the distribution of a given data set is ignored then some of these groups may remain under-represented and the sampling biased, even if the size of data is large. In this study, we use the example of the Harvard Clean Energy Project (CEP) data set to assess the challenges posed by imbalanced data and the impact that accounting for different groups during the selection of training data has on the quality of the resulting machine learning models. We employ a partitioning criterion based on the underlying rules for the CEP molecular library generation to identify groups of structurally similar compounds. First, we evaluate the performance of regression models that are trained globally (i.e., by randomly sampling the entire data set for training data). This traditional approach serves as the benchmark reference. We compare its results with those of models that are trained locally, i.e., within each of the identified molecular domains. We demonstrate that local models outperform the best reported global models by considerable margins and are more efficient in their training data needs. We propose a strategy to redesign training sets for the development of improved global models. While the resulting uniform training sets can successfully yield robust global models, we identify the distribution mismatch between feature representations of different molecular domains as a critical limitation for any further improvement. We take advantage of the discovered distribution shift and propose an ensemble of classification and regression models to achieve a generalized and reliable model that outperforms the state-of-the-art model, trained on the CEP data set. Moreover, this study provides a benchmark for the development of future methodologies concerned with imbalanced chemical data.</p></div></div></div>


Author(s):  
Soroor Karimi ◽  
Bohan Xu ◽  
Alireza Asgharpour ◽  
Siamack A. Shirazi ◽  
Sandip Sen

Abstract AI approaches include machine learning algorithms in which models are trained from existing data to predict the behavior of the system for previously unseen cases. Recent studies at the Erosion/Corrosion Research Center (E/CRC) have shown that these methods can be quite effective in predicting erosion. However, these methods are not widely used in the engineering industries due to the lack of work and information in this area. Moreover, in most of the available literature, the reported models and results have not been rigorously tested. This fact suggests that these models cannot be fully trusted for the applications for which they are trained. Therefore, in this study three machine learning models, including Elastic Net, Random Forest and Support Vector Machine (SVM), are utilized to increase the confidence in these tools. First, these models are trained with a training data set. Next, the model hyper-parameters are optimized by using nested cross validation. Finally, the results are verified with a test data set. This process is repeated several times to assure the accuracy of the results. In order to be able to predict the erosion under different conditions with these three models, six main variables are considered in the training data set. These variables include material hardness, pipe diameter, particle size, liquid viscosity, liquid superficial velocity, and gas superficial velocity. All three studied models show good prediction performances. The Random Forest and SVM approaches, however, show slightly better results compared to Elastic Net. The performance of these models is compared to both CFD erosion simulation results and also to Sand Production Pipe Saver (SPPS) results, a mechanistic erosion prediction software developed at the E/CRC. The comparison shows SVM prediction has a better match with both CFD and SPPS. The application of AI model to determine the uncertainty of calculated erosion is also discussed.


2021 ◽  
Vol 6 (2) ◽  
pp. 213
Author(s):  
Nadya Intan Mustika ◽  
Bagus Nenda ◽  
Dona Ramadhan

This study aims to implement a machine learning algorithm in detecting fraud based on historical data set in a retail consumer financing company. The outcome of machine learning is used as samples for the fraud detection team. Data analysis is performed through data processing, feature selection, hold-on methods, and accuracy testing. There are five machine learning methods applied in this study: Logistic Regression, K-Nearest Neighbor (KNN), Decision Tree, Random Forest, and Support Vector Machine (SVM). Historical data are divided into two groups: training data and test data. The results show that the Random Forest algorithm has the highest accuracy with a training score of 0.994999 and a test score of 0.745437. This means that the Random Forest algorithm is the most accurate method for detecting fraud. Further research is suggested to add more predictor variables to increase the accuracy value and apply this method to different financial institutions and different industries.


Sign in / Sign up

Export Citation Format

Share Document