scholarly journals Applying Random Forest Model Algorithm to GFR Estimation

2020 ◽  
Author(s):  
Peijia Liu ◽  
Dong Yang ◽  
Shaomin Li ◽  
Yutian Chong ◽  
Wentao Hu ◽  
...  

Abstract Background The utilization of estimating-GFR equations is critical for kidney disease in the clinic. However, the performance of the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) equation has not improved substantially in the past eight years. Here we hypothesized that random forest regression(RF) method could go beyond revised linear regression, which is used to build the CKD-EPI equationMethods 1732 participants were enrolled in this study totally (1333 in development data set from Tianhe District and 399 in external data set Luogang District). Recursive feature elimination (RFE) is applied to the development data to select important variables and build random forest models. Then same variables were used to develop the estimated GFR equation with linear regression as a comparison. The performances of these equations are measured by bias, 30% accuracy , precision and root mean square error(RMSE).Results Of all the variables, creatinine, cystatin C, weight, body mass index (BMI), age, uric acid(UA), blood urea nitrogen(BUN), hematocrit(HCT) and apolipoprotein B(APOB) were selected by RFE method. The results revealed that the overall performance of random forest regression models ascended the revised regression models based on the same variables. In the 9-variable model, RF model was better than revised linear regression in term of bias, precision ,30%accuracy and RMSE(0.78 vs 2.98, 16.90 vs 23.62, 0.84 vs 0.80, 16.88 vs 18.70, all P<0.01 ). In the 4-variable model, random forest regression model showed an improvement in precision and RMSE compared with revised regression model. (20.82 vs 25.25, P<0.01, 19.08 vs 20.60, P<0.001). Bias and 30%accurancy were preferable, but the results were not statistically significant (0.34 vs 2.07, P=0.10, 0.8 vs 0.78, P=0.19, respectively).Conclusions The performances of random forest regression models are better than revised linear regression models when it comes to GFR estimation.

2020 ◽  
Author(s):  
Peijia Liu ◽  
Dong Yang ◽  
Shaomin Li ◽  
Yutian Chong ◽  
Ming Li ◽  
...  

Abstract Background The utilization of estimating-GFR equations is critical for kidney disease in the clinic. However, the performance of the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) equation has not improved substantially in the past eight years. Here we hypothesized that random forest regression(RF) method could go beyond revised linear regression, which is used to build the CKD-EPI equation Methods 1732 participants were enrolled in this study totally (1333 in development data set from Tianhe District and 399 in external data set Luogang District). Recursive feature elimination (RFE) is applied to the development data to select important variables and build random forest models. Then same variables were used to develop the estimated GFR equation with linear regression as a comparison. The performances of these equations are measured by bias, 30% accuracy, precision and root mean square error(RMSE). Results Of all the variables, creatinine, cystatin C, weight, body mass index (BMI), age, uric acid(UA), blood urea nitrogen(BUN), hematocrit(HCT) and apolipoprotein B(APOB) were selected by RFE method. The results revealed that the overall performance of random forest regression models ascended the revised regression models based on the same variables. In the 9-variable model, RF model was better than revised linear regression in term of bias, precision ,30%accuracy and RMSE(0.78 vs 2.98, 16.90 vs 23.62, 0.84 vs 0.80, 16.88 vs 18.70, all P < 0.01 ). In the 4-variable model, random forest regression model showed an improvement in precision and RMSE compared with revised regression model. (20.82 vs 25.25, P < 0.01, 19.08 vs 20.60, P < 0.001). Bias and 30%accurancy were preferable, but the results were not statistically significant (0.34 vs 2.07, P = 0.10, 0.8 vs 0.78, P = 0.19, respectively). Conclusions The performances of random forest regression models are better than revised linear regression models when it comes to GFR estimation.


2021 ◽  
Vol 13 (16) ◽  
pp. 3123
Author(s):  
Chunzhu Wei ◽  
Qianying Zhao ◽  
Yang Lu ◽  
Dongjie Fu

Pearl River Delta (PRD), as one of the most densely populated regions in the world, is facing both natural changes (e.g., sea level rise) and human-induced changes (e.g., dredging for navigation and land reclamation). Bathymetric information is thus important for the protection and management of the estuarine environment, but little effort has been made to comprehensively evaluate the performance of different methods and datasets. In this study, two linear regression models—the linear band model and the log-transformed band ratio model, and two non-linear regression models—the support vector regression model and the random forest regression model—were applied to Landsat 8 (L8) and Sentinel-2 (S2) imagery for bathymetry mapping in 2019 and 2020. Results suggested that a priori area clustering based on spectral features using the K-means algorithm improved estimation accuracy. The random forest regression model performed best, and the three-band combinations outperformed two-band combinations in all models. When the non-linear models were applied with three-band combination (red, green, blue) to L8 and S2 imagery, the Root Mean Square Error (Mean Absolute Error) decreased by 23.10% (35.53%), and the coefficient of determination (Kling-Gupta efficiency) increased by 0.08 (0.09) on average, compared to those using the linear regression models. Despite the differences in spatial resolution and band wavelength, L8 and S2 performed similarly in bathymetry estimation. This study quantified the relative performance of different models and may shed light on the potential combination of multiple data sources for more timely and accurate bathymetry mapping.


1995 ◽  
Vol 3 (3) ◽  
pp. 133-142 ◽  
Author(s):  
M. Hana ◽  
W.F. McClure ◽  
T.B. Whitaker ◽  
M. White ◽  
D.R. Bahler

Two artificial neural network models were used to estimate the nicotine in tobacco: (i) a back-propagation network and (ii) a linear network. The back-propagation network consisted of an input layer, an output layer and one hidden layer. The linear network consisted of an input layer and an output layer. Both networks used the generalised delta rule for learning. Performances of both networks were compared to the multiple linear regression method MLR of calibration. The nicotine content in tobacco samples was estimated for two different data sets. Data set A contained 110 near infrared (NIR) spectra each consisting of reflected energy at eight wavelengths. Data set B consisted of 200 NIR spectra with each spectrum having 840 spectral data points. The Fast Fourier transformation was applied to data set B in order to compress each spectrum into 13 Fourier coefficients. For data set A, the linear regression model gave better results followed by the back-propagation network which was followed by the linear network. The true performance of the linear regression model was better than the back-propagation and the linear networks by 14.0% and 18.1%, respectively. For data set B, the back-propagation network gave the best result followed by MLR and the linear network. Both the linear network and MLR models gave almost the same results. The true performance of the back-propagation network model was better than the MLR and linear network by 35.14%.


2020 ◽  
Vol 9 (11) ◽  
pp. 654
Author(s):  
Guanwei Zhao ◽  
Muzhuang Yang

Mapping population distribution at fine resolutions with high accuracy is crucial to urban planning and management. This paper takes Guangzhou city as the study area, illustrates the gridded population distribution map by using machine learning methods based on zoning strategy with multisource geospatial data such as night light remote sensing data, point of interest data, land use data, and so on. The street-level accuracy evaluation results show that the proposed approach achieved good overall accuracy, with determinant coefficient (R2) being 0.713 and root mean square error (RMSE) being 5512.9. Meanwhile, the goodness of fit for single linear regression (LR) model and random forest (RF) regression model are 0.0039 and 0.605, respectively. For dense area, the accuracy of the random forest model is better than the linear regression model, while for sparse area, the accuracy of the linear regression model is better than the random forest model. The results indicated that the proposed method has great potential in fine-scale population mapping. Therefore, it is advised that the zonal modeling strategy should be the primary choice for solving regional differences in the population distribution mapping research.


2020 ◽  
Author(s):  
Niema Ghanad Poor ◽  
Nicholas C West ◽  
Rama Syamala Sreepada ◽  
Srinivas Murthy ◽  
Matthias Görges

BACKGROUND In the pediatric intensive care unit (PICU), quantifying illness severity can be guided by risk models to enable timely identification and appropriate intervention. Logistic regression models, including the pediatric index of mortality 2 (PIM-2) and pediatric risk of mortality III (PRISM-III), produce a mortality risk score using data that are routinely available at PICU admission. Artificial neural networks (ANNs) outperform regression models in some medical fields. OBJECTIVE In light of this potential, we aim to examine ANN performance, compared to that of logistic regression, for mortality risk estimation in the PICU. METHODS The analyzed data set included patients from North American PICUs whose discharge diagnostic codes indicated evidence of infection and included the data used for the PIM-2 and PRISM-III calculations and their corresponding scores. We stratified the data set into training and test sets, with approximately equal mortality rates, in an effort to replicate real-world data. Data preprocessing included imputing missing data through simple substitution and normalizing data into binary variables using PRISM-III thresholds. A 2-layer ANN model was built to predict pediatric mortality, along with a simple logistic regression model for comparison. Both models used the same features required by PIM-2 and PRISM-III. Alternative ANN models using single-layer or unnormalized data were also evaluated. Model performance was compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC) and their empirical 95% CIs. RESULTS Data from 102,945 patients (including 4068 deaths) were included in the analysis. The highest performing ANN (AUROC 0.871, 95% CI 0.862-0.880; AUPRC 0.372, 95% CI 0.345-0.396) that used normalized data performed better than PIM-2 (AUROC 0.805, 95% CI 0.801-0.816; AUPRC 0.234, 95% CI 0.213-0.255) and PRISM-III (AUROC 0.844, 95% CI 0.841-0.855; AUPRC 0.348, 95% CI 0.322-0.367). The performance of this ANN was also significantly better than that of the logistic regression model (AUROC 0.862, 95% CI 0.852-0.872; AUPRC 0.329, 95% CI 0.304-0.351). The performance of the ANN that used unnormalized data (AUROC 0.865, 95% CI 0.856-0.874) was slightly inferior to our highest performing ANN; the single-layer ANN architecture performed poorly and was not investigated further. CONCLUSIONS A simple ANN model performed slightly better than the benchmark PIM-2 and PRISM-III scores and a traditional logistic regression model trained on the same data set. The small performance gains achieved by this two-layer ANN model may not offer clinically significant improvement; however, further research with other or more sophisticated model designs and better imputation of missing data may be warranted. CLINICALTRIAL


2020 ◽  
Vol 12 (5) ◽  
pp. 41-51
Author(s):  
Shaimaa Mahmoud ◽  
◽  
Mahmoud Hussein ◽  
Arabi Keshk

Opinion mining in social networks data is considered as one of most important research areas because a large number of users interact with different topics on it. This paper discusses the problem of predicting future products rate according to users’ comments. Researchers interacted with this problem by using machine learning algorithms (e.g. Logistic Regression, Random Forest Regression, Support Vector Regression, Simple Linear Regression, Multiple Linear Regression, Polynomial Regression and Decision Tree). However, the accuracy of these techniques still needs to be improved. In this study, we introduce an approach for predicting future products rate using LR, RFR, and SVR. Our data set consists of tweets and its rate from 1:5. The main goal of our approach is improving the prediction accuracy about existing techniques. SVR can predict future product rate with a Mean Squared Error (MSE) of 0.4122, Linear Regression model predict with a Mean Squared Error of 0.4986 and Random Forest Regression can predict with a Mean Squared Error of 0.4770. This is better than the existing approaches accuracy.


10.2196/24079 ◽  
2021 ◽  
Vol 9 (8) ◽  
pp. e24079
Author(s):  
Niema Ghanad Poor ◽  
Nicholas C West ◽  
Rama Syamala Sreepada ◽  
Srinivas Murthy ◽  
Matthias Görges

Background In the pediatric intensive care unit (PICU), quantifying illness severity can be guided by risk models to enable timely identification and appropriate intervention. Logistic regression models, including the pediatric index of mortality 2 (PIM-2) and pediatric risk of mortality III (PRISM-III), produce a mortality risk score using data that are routinely available at PICU admission. Artificial neural networks (ANNs) outperform regression models in some medical fields. Objective In light of this potential, we aim to examine ANN performance, compared to that of logistic regression, for mortality risk estimation in the PICU. Methods The analyzed data set included patients from North American PICUs whose discharge diagnostic codes indicated evidence of infection and included the data used for the PIM-2 and PRISM-III calculations and their corresponding scores. We stratified the data set into training and test sets, with approximately equal mortality rates, in an effort to replicate real-world data. Data preprocessing included imputing missing data through simple substitution and normalizing data into binary variables using PRISM-III thresholds. A 2-layer ANN model was built to predict pediatric mortality, along with a simple logistic regression model for comparison. Both models used the same features required by PIM-2 and PRISM-III. Alternative ANN models using single-layer or unnormalized data were also evaluated. Model performance was compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC) and their empirical 95% CIs. Results Data from 102,945 patients (including 4068 deaths) were included in the analysis. The highest performing ANN (AUROC 0.871, 95% CI 0.862-0.880; AUPRC 0.372, 95% CI 0.345-0.396) that used normalized data performed better than PIM-2 (AUROC 0.805, 95% CI 0.801-0.816; AUPRC 0.234, 95% CI 0.213-0.255) and PRISM-III (AUROC 0.844, 95% CI 0.841-0.855; AUPRC 0.348, 95% CI 0.322-0.367). The performance of this ANN was also significantly better than that of the logistic regression model (AUROC 0.862, 95% CI 0.852-0.872; AUPRC 0.329, 95% CI 0.304-0.351). The performance of the ANN that used unnormalized data (AUROC 0.865, 95% CI 0.856-0.874) was slightly inferior to our highest performing ANN; the single-layer ANN architecture performed poorly and was not investigated further. Conclusions A simple ANN model performed slightly better than the benchmark PIM-2 and PRISM-III scores and a traditional logistic regression model trained on the same data set. The small performance gains achieved by this two-layer ANN model may not offer clinically significant improvement; however, further research with other or more sophisticated model designs and better imputation of missing data may be warranted.


2020 ◽  
Author(s):  
Nusrat Rouf ◽  
Majid Bashir Malik ◽  
Tasleem Arif

Abstract Introduction: Advancement in information technology, be it hardware, software or communication technology, over few decades has rapidly impacted almost every field of study. Machine learning tools and techniques are nowadays applied to every field. It has opened the ways for interdisciplinary research by promising effective analyzation and decision-making strategies. COVID-19 has badly affected more than 200 countries within a short span of time. It has drastically affected both daily activities as well as economic activities. Herd behavior of investors has triggered panic selling. As a result, stock markets around the world have plunged down.Methods: In this paper, we analyze the impact of COVID-19 on NSE (National Stock Exchange) index Nifty50. We employ Pearson Correlation and investigate the impact of total confirmed cases and daily cases on Nifty50 closing price. We use various machine learning regression models for predictive analysis viz, linear regression with polynomial terms (quadratic, cubic), Decision Tree Regression and Random Forest Regression. Model performance is measured using MSE (Mean Square Error), RMSE (Root Mean Square Error) and R2 (R Squared) evaluators. Results: Correlation analysis reveals that total confirmed cases and daily cases in both India and the World have negative correlation with Nifty50 closing prices. Moreover, Nifty50 closing prices are more negatively correlated with total confirmed and daily cases in India. Predictive analysis shows that the Random Forest Regression model outperforms all other models. Conclusion: We analyze and predict the impact of COVID-19 on closing price of Nifty50 index. We employ Pearson Correlation and investigate the impact of COVID-19 on Nifty50 closing prices. We use various machine learning regression models to predict the closing price of Nifty50 index. Results reveal that the market volatility is directly proportional to increase in number of COVID-19 cases. Random Forest Regression model has comparatively shown better RMSE and R2 values.


2019 ◽  
Vol 12 (3) ◽  
pp. 1209-1225 ◽  
Author(s):  
Christoph A. Keller ◽  
Mat J. Evans

Abstract. Atmospheric chemistry models are a central tool to study the impact of chemical constituents on the environment, vegetation and human health. These models are numerically intense, and previous attempts to reduce the numerical cost of chemistry solvers have not delivered transformative change. We show here the potential of a machine learning (in this case random forest regression) replacement for the gas-phase chemistry in atmospheric chemistry transport models. Our training data consist of 1 month (July 2013) of output of chemical conditions together with the model physical state, produced from the GEOS-Chem chemistry model v10. From this data set we train random forest regression models to predict the concentration of each transported species after the integrator, based on the physical and chemical conditions before the integrator. The choice of prediction type has a strong impact on the skill of the regression model. We find best results from predicting the change in concentration for long-lived species and the absolute concentration for short-lived species. We also find improvements from a simple implementation of chemical families (NOx = NO + NO2). We then implement the trained random forest predictors back into GEOS-Chem to replace the numerical integrator. The machine-learning-driven GEOS-Chem model compares well to the standard simulation. For ozone (O3), errors from using the random forests (compared to the reference simulation) grow slowly and after 5 days the normalized mean bias (NMB), root mean square error (RMSE) and R2 are 4.2 %, 35 % and 0.9, respectively; after 30 days the errors increase to 13 %, 67 % and 0.75, respectively. The biases become largest in remote areas such as the tropical Pacific where errors in the chemistry can accumulate with little balancing influence from emissions or deposition. Over polluted regions the model error is less than 10 % and has significant fidelity in following the time series of the full model. Modelled NOx shows similar features, with the most significant errors occurring in remote locations far from recent emissions. For other species such as inorganic bromine species and short-lived nitrogen species, errors become large, with NMB, RMSE and R2 reaching >2100 % >400 % and <0.1, respectively. This proof-of-concept implementation takes 1.8 times more time than the direct integration of the differential equations, but optimization and software engineering should allow substantial increases in speed. We discuss potential improvements in the implementation, some of its advantages from both a software and hardware perspective, its limitations, and its applicability to operational air quality activities.


Sign in / Sign up

Export Citation Format

Share Document