scholarly journals Explainable machine learning models to understand determinants of COVID-19 mortality in the United States

Author(s):  
Piyush Mathur ◽  
Tavpritesh Sethi ◽  
Anya Mathur ◽  
Kamal Maheshwari ◽  
Jacek B Cywinski ◽  
...  

AbstractBackgroundCOVID-19 is now one of the leading causes of mortality amongst adults in the United States for the year 2020. Multiple epidemiological models have been built, often based on limited data, to understand the spread and impact of the pandemic. However, many geographic and local factors may have played an important role in higher morbidity and mortality in certain populations.ObjectiveThe goal of this study was to develop machine learning models to understand the relative association of socioeconomic, demographic, travel, and health care characteristics of different states across the United States and COVID-19 mortality.MethodsUsing multiple public data sets, 24 variables linked to COVID-19 disease were chosen to build the models. Two independent machine learning models using CatBoost regression and random forest were developed. SHAP feature importance and a Boruta algorithm were used to elucidate the relative importance of features on COVID-19 mortality in the United States.ResultsFeature importances from both the categorical models, i.e., CatBoost and random forest consistently showed that a high population density, number of nursing homes, number of nursing home beds and foreign travel were strongest predictors of COVID-19 mortality. Percentage of African American amongst the population was also found to be of high importance in prediction of COVID-19 mortality whereas racial majority (primarily, Caucasian) was not. Both models fitted the data well with a training R2 of 0.99 and 0.88 respectively. The effect of median age,median income, climate and disease mitigation measures on COVID-19 related mortality remained unclear.ConclusionsCOVID-19 policy making will need to take population density, pre-existing medical care and state travel policies into account. Our models identified and quantified the relative importance of each of these for mortality predictions using machine learning.

Information ◽  
2021 ◽  
Vol 12 (2) ◽  
pp. 57
Author(s):  
Shrirang A. Kulkarni ◽  
Jodh S. Pannu ◽  
Andriy V. Koval ◽  
Gabriel J. Merrin ◽  
Varadraj P. Gurupur ◽  
...  

Background and objectives: Machine learning approaches using random forest have been effectively used to provide decision support in health and medical informatics. This is especially true when predicting variables associated with Medicare reimbursements. However, more work is needed to analyze and predict data associated with reimbursements through Medicare and Medicaid services for physical therapy practices in the United States. The key objective of this study is to analyze different machine learning models to predict key variables associated with Medicare standardized payments for physical therapy practices in the United States. Materials and Methods: This study employs five methods, namely, multiple linear regression, decision tree regression, random forest regression, K-nearest neighbors, and linear generalized additive model, (GAM) to predict key variables associated with Medicare payments for physical therapy practices in the United States. Results: The study described in this article adds to the body of knowledge on the effective use of random forest regression and linear generalized additive model in predicting Medicare Standardized payment. It turns out that random forest regression may have any edge over other methods employed for this purpose. Conclusions: The study provides a useful insight into comparing the performance of the aforementioned methods, while identifying a few intricate details associated with predicting Medicare costs while also ascertaining that linear generalized additive model and random forest regression as the most suitable machine learning models for predicting key variables associated with standardized Medicare payments.


2020 ◽  
Vol 8 (1) ◽  
pp. 15-21
Author(s):  
James G. Koomson

The unprecedented outbreak of COVID-19 also known as the coronavirus has caused a pandemic like none ever seen before this century. Its impact has been massive on a global level. The deadly virus has commanded nations around the world to increase their efforts to fight against the spread of the virus after the stress it has put on resources. With the number of new cases increasing day by day around the world, the objective of this paper is to contribute towards the analysis of the virus by leveraging machine learning models to understand its behavior and predict future patterns in the United States (US) based on data obtained from the COVID-19 Tracking Project.


2020 ◽  
Author(s):  
Anaiy Somalwar

COVID-19 has become a great national security problem for the United States and many other countries, where public policy and healthcare decisions are based on the several models for the prediction of the future deaths and cases of COVID-19. While the most commonly used models for COVID-19 include epidemiological models and Gaussian curve-fitting models, recent literature has indicated that these models could be improved by incorporating machine learning. However, within this research on potential machine learning models for COVID-19 forecasting, there has been a large emphasis on providing an array of different types of machine learning models rather than optimizing a single one. In this research, we suggest and optimize a linear machine learning model with a gradient-based optimizer for the prediction of future COVID-19 cases and deaths in the United States. We also suggest that a hybrid of a machine learning model for shorter range predictions and a Gaussian curve-fitting model or an epidemiological model for longer range predictions could greatly increase the accuracy of COVID-19 forecasting.


2020 ◽  
Author(s):  
Anaiy Somalwar

UNSTRUCTURED COVID-19 has become a great national security problem for the United States and many other countries, where public policy and healthcare decisions are based on the several models for the prediction of the future deaths and cases of COVID-19. While the most commonly used models for COVID-19 include epidemiological models and Gaussian curve-fitting models, recent literature has indicated that these models could be improved by incorporating machine learning. However, within this research on potential machine learning models for COVID-19 forecasting, there has been a large emphasis on providing an array of different types of machine learning models rather than optimizing a single one. In this research, we suggest and optimize a linear machine learning model with a gradient-based optimizer for the prediction of future COVID-19 cases and deaths in the United States. We also suggest that a hybrid of a machine learning model for shorter range predictions and a Gaussian curve-fitting model or an epidemiological model for longer range predictions could greatly increase the accuracy of COVID-19 forecasting. INTERNATIONAL REGISTERED REPORT RR2-https://doi.org/10.1101/2020.08.13.20174631


Author(s):  
Farrikh Alzami ◽  
Erika Devi Udayanti ◽  
Dwi Puji Prabowo ◽  
Rama Aria Megantara

Sentiment analysis in terms of polarity classification is very important in everyday life, with the existence of polarity, many people can find out whether the respected document has positive or negative sentiment so that it can help in choosing and making decisions. Sentiment analysis usually done manually. Therefore, an automatic sentiment analysis classification process is needed. However, it is rare to find studies that discuss extraction features and which learning models are suitable for unstructured sentiment analysis types with the Amazon food review case. This research explores some extraction features such as Word Bags, TF-IDF, Word2Vector, as well as a combination of TF-IDF and Word2Vector with several machine learning models such as Random Forest, SVM, KNN and Naïve Bayes to find out a combination of feature extraction and learning models that can help add variety to the analysis of polarity sentiments. By assisting with document preparation such as html tags and punctuation and special characters, using snowball stemming, TF-IDF results obtained with SVM are suitable for obtaining a polarity classification in unstructured sentiment analysis for the case of Amazon food review with a performance result of 87,3 percent.


2021 ◽  
Vol 9 ◽  
Author(s):  
Daniel Lowell Weller ◽  
Tanzy M. T. Love ◽  
Martin Wiedmann

Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners*3 resampling approaches) and 108 nested (5 learners*9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.


Sign in / Sign up

Export Citation Format

Share Document