Development of a Diabetes Melitus Detection and Prediction Model Using Light Gradient Boosting Machine and K-Nearest Neighbour

2021 ◽  
Vol 3 (1) ◽  
Author(s):  
B. A Omodunbi

Diabetes mellitus is a health disorder that occurs when the blood sugar level becomes extremely high due to body resistance in producing the required amount of insulin. The aliment happens to be among the major causes of death in Nigeria and the world at large. This study was carried out to detect diabetes mellitus by developing a hybrid model that comprises of two machine learning model namely Light Gradient Boosting Machine (LGBM) and K-Nearest Neighbor (KNN). This research is aimed at developing a machine learning model for detecting the occurrence of diabetes in patients. The performance metrics employed in evaluating the finding for this study are Receiver Operating Characteristics (ROC) Curve, Five-fold Cross-validation, precision, and accuracy score. The proposed system had an accuracy of 91% and the area under the Receiver Operating Characteristic Curve was 93%. The experimental result shows that the prediction accuracy of the hybrid model is better than traditional machine learning

2021 ◽  
Author(s):  
Zhihao Song ◽  
Bin Chen ◽  
Yue Huang ◽  
Li Dong ◽  
Tingting Yang

Abstract. The satellite remote-sensing aerosol optical depth (AOD) and meteorological elements were employed to invert PM2.5 in order to control air pollution more effectively. This paper proposes a restricted gradient-descent linear hybrid machine learning model (RGD–LHMLM) by integrating a random forest (RF), a gradient boosting regression tree (GBRT), and a deep neural network (DNN) to estimate the concentration of PM2.5 in China in 2019. The research data included Himawari-8 AOD with high spatiotemporal resolution, ERA-5 meteorological data, and geographic information. The results showed that, in the hybrid model developed by linear fitting, the DNN accounted for the largest proportion, whereas the weight coefficient was 0.62. The R2 values of RF, GBRT, and DNN were reported 0.79, 0.81, and 0.8, respectively. Preferably, the generalization ability of the mixed model was better than that of each sub-model, and R2 reached 0.84, whereas RMSE and MAE were reported 12.92 µg/m3 and 8.01 µg/m3, respectively. For the RGD-LHMLM, R2 was above 0.7 in more than 70 % of the sites, whereas RMSE and MAE were below 20 µg/m3 and 15 µg/m3, respectively, in more than 70 % of the sites due to the correlation coefficient having seasonal difference between the meteorological factor and PM2.5. Furthermore, the hybrid model performed best in winter (mean R2 was 0.84) and worst in summer (mean R2 was 0.71). The spatiotemporal distribution characteristics of PM2.5 in China were then estimated and analyzed. According to the results, there was severe pollution in winter with an average concentration of PM2.5 being reported 62.10 µg/m3. However, there was slight pollution in summer with an average concentration of PM2.5 being reported 47.39 µg/m3. The findings also indicate that North China and East China are more polluted than other areas and that their average annual concentration of PM2.5 was reported 82.68 µg/m3. Moreover, there was relatively low pollution in Inner Mongolia, Qinghai, and Tibet, for their average PM2.5 concentrations were reported below 40 µg/m3.


2021 ◽  
Vol 14 (8) ◽  
pp. 5333-5347
Author(s):  
Zhihao Song ◽  
Bin Chen ◽  
Yue Huang ◽  
Li Dong ◽  
Tingting Yang

Abstract. Satellite remote sensing aerosol optical depth (AOD) and meteorological elements were employed to invert PM2.5 (the fine particulate matter with a diameter below 2.5 µm) in order to control air pollution more effectively. This paper proposes a restricted gradient-descent linear hybrid machine learning model (RGD-LHMLM) by integrating a random forest (RF), a gradient boosting regression tree (GBRT), and a deep neural network (DNN) to estimate the concentration of PM2.5 in China in 2019. The research data included Himawari-8 AOD with high spatiotemporal resolution, ERA5 meteorological data, and geographic information. The results showed that, in the hybrid model developed by linear fitting, the DNN accounted for the largest proportion, and the weight coefficient was 0.62. The R2 values of RF, GBRT, and DNN were reported as 0.79, 0.81, and 0.8, respectively. Preferably, the generalization ability of the mixed model was better than that of each sub-model, and R2 (determination coefficient) reached 0.84, and RMSE (root mean square error) and MAE (mean absolute error) were reported as 12.92 and 8.01 µg m−3, respectively. For the RGD-LHMLM, R2 was above 0.7 in more than 70 % of the sites and RMSE and MAE were below 20 and 15 µg m−3, respectively, in more than 70 % of the sites due to the correlation coefficient having a seasonal difference between the meteorological factor and PM2.5. Furthermore, the hybrid model performed best in winter (mean R2 was 0.84) and worst in summer (mean R2 was 0.71). The spatiotemporal distribution characteristics of PM2.5 in China were then estimated and analyzed. According to the results, there was severe pollution in winter with an average concentration of PM2.5 being reported as 62.10 µg m−3. However, there was only slight pollution in summer with an average concentration of PM2.5 being reported as 47.39 µg m−3. The period from 10:00 to 15:00 LT (Beijing time, UTC+8 every day is the best time for model inversion; at this time the pollution is also high. The findings also indicate that North China and East China are more polluted than other areas, and their average annual concentration of PM2.5 was reported as 82.68 µg m−3. Moreover, there was relatively low pollution in Inner Mongolia, Qinghai, and Tibet, for their average PM2.5 concentrations were reported below 40 µg m−3.


Diagnostics ◽  
2021 ◽  
Vol 11 (11) ◽  
pp. 2102
Author(s):  
Eyal Klang ◽  
Robert Freeman ◽  
Matthew A. Levin ◽  
Shelly Soffer ◽  
Yiftach Barash ◽  
...  

Background & Aims: We aimed at identifying specific emergency department (ED) risk factors for developing complicated acute diverticulitis (AD) and evaluate a machine learning model (ML) for predicting complicated AD. Methods: We analyzed data retrieved from unselected consecutive large bowel AD patients from five hospitals from the Mount Sinai health system, NY. The study time frame was from January 2011 through March 2021. Data were used to train and evaluate a gradient-boosting machine learning model to identify patients with complicated diverticulitis, defined as a need for invasive intervention or in-hospital mortality. The model was trained and evaluated on data from four hospitals and externally validated on held-out data from the fifth hospital. Results: The final cohort included 4997 AD visits. Of them, 129 (2.9%) visits had complicated diverticulitis. Patients with complicated diverticulitis were more likely to be men, black, and arrive by ambulance. Regarding laboratory values, patients with complicated diverticulitis had higher levels of absolute neutrophils (AUC 0.73), higher white blood cells (AUC 0.70), platelet count (AUC 0.68) and lactate (AUC 0.61), and lower levels of albumin (AUC 0.69), chloride (AUC 0.64), and sodium (AUC 0.61). In the external validation cohort, the ML model showed AUC 0.85 (95% CI 0.78–0.91) for predicting complicated diverticulitis. For Youden’s index, the model showed a sensitivity of 88% with a false positive rate of 1:3.6. Conclusions: A ML model trained on clinical measures provides a proof of concept performance in predicting complications in patients presenting to the ED with AD. Clinically, it implies that a ML model may classify low-risk patients to be discharged from the ED for further treatment under an ambulatory setting.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Chalachew Muluken Liyew ◽  
Haileyesus Amsaya Melese

AbstractPredicting the amount of daily rainfall improves agricultural productivity and secures food and water supply to keep citizens healthy. To predict rainfall, several types of research have been conducted using data mining and machine learning techniques of different countries’ environmental datasets. An erratic rainfall distribution in the country affects the agriculture on which the economy of the country depends on. Wise use of rainfall water should be planned and practiced in the country to minimize the problem of the drought and flood occurred in the country. The main objective of this study is to identify the relevant atmospheric features that cause rainfall and predict the intensity of daily rainfall using machine learning techniques. The Pearson correlation technique was used to select relevant environmental variables which were used as an input for the machine learning model. The dataset was collected from the local meteorological office at Bahir Dar City, Ethiopia to measure the performance of three machine learning techniques (Multivariate Linear Regression, Random Forest, and Extreme Gradient Boost). Root mean squared error and Mean absolute Error methods were used to measure the performance of the machine learning model. The result of the study revealed that the Extreme Gradient Boosting machine learning algorithm performed better than others.


2020 ◽  
Vol 9 (3) ◽  
pp. 875
Author(s):  
Young Suk Kwon ◽  
Moon Seong Baek

The quick sepsis-related organ failure assessment (qSOFA) score has been introduced to predict the likelihood of organ dysfunction in patients with suspected infection. We hypothesized that machine-learning models using qSOFA variables for predicting three-day mortality would provide better accuracy than the qSOFA score in the emergency department (ED). Between January 2016 and December 2018, the medical records of patients aged over 18 years with suspected infection were retrospectively obtained from four EDs in Korea. Data from three hospitals (n = 19,353) were used as training-validation datasets and data from one (n = 4234) as the test dataset. Machine-learning algorithms including extreme gradient boosting, light gradient boosting machine, and random forest were used. We assessed the prediction ability of machine-learning models using the area under the receiver operating characteristic (AUROC) curve, and DeLong’s test was used to compare AUROCs between the qSOFA scores and qSOFA-based machine-learning models. A total of 447,926 patients visited EDs during the study period. We analyzed 23,587 patients with suspected infection who were admitted to the EDs. The median age of the patients was 63 years (interquartile range: 43–78 years) and in-hospital mortality was 4.0% (n = 941). For predicting three-day mortality among patients with suspected infection in the ED, the AUROC of the qSOFA-based machine-learning model (0.86 [95% CI 0.85–0.87]) for three -day mortality was higher than that of the qSOFA scores (0.78 [95% CI 0.77–0.79], p < 0.001). For predicting three-day mortality in patients with suspected infection in the ED, the qSOFA-based machine-learning model was found to be superior to the conventional qSOFA scores.


2021 ◽  
Author(s):  
Eric Sonny Mathew ◽  
Moussa Tembely ◽  
Waleed AlAmeri ◽  
Emad W. Al-Shalabi ◽  
Abdul Ravoof Shaik

Abstract A meticulous interpretation of steady-state or unsteady-state relative permeability (Kr) experimental data is required to determine a complete set of Kr curves. In this work, three different machine learning models was developed to assist in a faster estimation of these curves from steady-state drainage coreflooding experimental runs. The three different models that were tested and compared were extreme gradient boosting (XGB), deep neural network (DNN) and recurrent neural network (RNN) algorithms. Based on existing mathematical models, a leading edge framework was developed where a large database of Kr and Pc curves were generated. This database was used to perform thousands of coreflood simulation runs representing oil-water drainage steady-state experiments. The results obtained from these simulation runs, mainly pressure drop along with other conventional core analysis data, were utilized to estimate Kr curves based on Darcy's law. These analytically estimated Kr curves along with the previously generated Pc curves were fed as features into the machine learning model. The entire data set was split into 80% for training and 20% for testing. K-fold cross validation technique was applied to increase the model accuracy by splitting the 80% of the training data into 10 folds. In this manner, for each of the 10 experiments, 9 folds were used for training and the remaining one was used for model validation. Once the model is trained and validated, it was subjected to blind testing on the remaining 20% of the data set. The machine learning model learns to capture fluid flow behavior inside the core from the training dataset. The trained/tested model was thereby employed to estimate Kr curves based on available experimental results. The performance of the developed model was assessed using the values of the coefficient of determination (R2) along with the loss calculated during training/validation of the model. The respective cross plots along with comparisons of ground-truth versus AI predicted curves indicate that the model is capable of making accurate predictions with error percentage between 0.2 and 0.6% on history matching experimental data for all the three tested ML techniques (XGB, DNN, and RNN). This implies that the AI-based model exhibits better efficiency and reliability in determining Kr curves when compared to conventional methods. The results also include a comparison between classical machine learning approaches, shallow and deep neural networks in terms of accuracy in predicting the final Kr curves. The various models discussed in this research work currently focusses on the prediction of Kr curves for drainage steady-state experiments; however, the work can be extended to capture the imbibition cycle as well.


2021 ◽  
Author(s):  
Ada Y. Chen ◽  
Juyong Lee ◽  
Ana Damjanovic ◽  
Bernard R. Brooks

We present four tree-based machine learning models for protein pKa prediction. The four models, Random Forest, Extra Trees, eXtreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machine (LightGBM), were trained on three experimental PDB and pKa datasets, two of which included a notable portion of internal residues. We observed similar performance among the four machine learning algorithms. The best model trained on the largest dataset performs 37% better than the widely used empirical pKa prediction tool PROPKA. The overall RMSE for this model is 0.69, with surface and buried RMSE values being 0.56 and 0.78, respectively, considering six residue types (Asp, Glu, His, Lys, Cys and Tyr), and 0.63 when considering Asp, Glu, His and Lys only. We provide pKa predictions for proteins in human proteome from the AlphaFold Protein Structure Database and observed that 1% of Asp/Glu/Lys residues have highly shifted pKa values close to the physiological pH.


Sign in / Sign up

Export Citation Format

Share Document