scholarly journals MALARIA PREDICTION MODEL USING ADVANCED ENSEMBLE MACHINE LEARNING TECHNIQUES

2021 ◽  
Vol 10 (6) ◽  
pp. 3794-3801
Author(s):  
Yusuf Aliyu Adamu

Malaria is a life-threatening disease that leads to death globally, its early prediction is necessary for preventing the rapid transmission. In this work, an enhanced ensemble learning approach for predicting malaria outbreaks is suggested. Using a mean-based splitting strategy, the dataset is randomly partitioned into smaller groups. The splits are then modelled using a classification and regression tree, and an accuracy-based weighted aging classifier ensemble is used to construct a homogenous ensemble from the several Classification and Regression Tree models. This approach ensures higher performance is achieved. Seven different Algorithms were tested and one ensemble method is used which combines all the seven classifiers together and finally, the accuracy, precision, and sensitivity achieved for the proposed method is 93%, 92%, and 100% respectively, which outperformed better than machine learning classifiers and ensemble method used in this research. The correlation between the variables used is established and how each factor contributes to the malaria incidence. The result indicates that malaria outbreaks can be predicted successfully using the suggested technique.

10.2196/18910 ◽  
2020 ◽  
Vol 8 (7) ◽  
pp. e18910
Author(s):  
Debbie Rankin ◽  
Michaela Black ◽  
Raymond Bond ◽  
Jonathan Wallace ◽  
Maurice Mulvenna ◽  
...  

Background The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. Objective This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. Methods A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. Results A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. Conclusions The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.


2020 ◽  
Author(s):  
Debbie Rankin ◽  
Michaela Black ◽  
Raymond Bond ◽  
Jonathan Wallace ◽  
Maurice Mulvenna ◽  
...  

BACKGROUND The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. OBJECTIVE This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. METHODS A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. RESULTS A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. CONCLUSIONS The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.


2021 ◽  
pp. 1-18
Author(s):  
Shashikant Rathod ◽  
Leena Phadke ◽  
Uttam Chaskar ◽  
Chetankumar Patil

BACKGROUND: According to the World Health Organization, one in ten adults will have Type 2 Diabetes Mellitus (T2DM) in the next few years. Autonomic dysfunction is one of the significant complications of T2DM. Autonomic dysfunction is usually assessed by standard Ewing’s test and resting Heart Rate Variability (HRV) indices. OBJECTIVE: Resting HRV has limited use in screening due to its large intra and inter-individual variations. Therefore, a combined approach of resting and orthostatic challenge HRV measurement with a machine learning technique was used in the present study. METHODS: A total of 213 subjects of both genders between 20 to 70 years of age participated in this study from March 2018 to December 2019 at Smt. Kashibai Navale Medical College and General Hospital (SKNMCGH) in Pune, India. The volunteers were categorized according to their glycemic status as control (n= 51 Euglycemic) and T2DM (n= 162). The short-term ECG signal in the resting and after an orthostatic challenge was recorded. The HRV indices were extracted from the ECG signal as per HRV-Taskforce guidelines. RESULTS: We observed a significant difference in time, frequency, and non-linear resting HRV indices between the control and T2DM groups. A blunted autonomic response to an orthostatic challenge quantified by percentage difference was observed in T2DM compared to the control group. HRV patterns during rest and the orthostatic challenge were extracted by various machine learning algorithms. The classification and regression tree (CART) model has shown better performance among all the machine learning algorithms. It has shown an accuracy of 84.04%, the sensitivity of 89.51%, a specificity of 66.67%, with an Area Under Receiver Operating Characteristic Curve (AUC) of 0.78 compared to resting HRV alone with 75.12% accuracy, 86.42% sensitivity, 39.22% specificity, with an AUC of 0.63 for differentiating autonomic dysfunction in non-diabetic control and T2DM. CONCLUSION: It was possible to develop a Classification and Regression Tree (CART) model to detect autonomic dysfunction. The technique of percentage difference between resting and orthostatic challenge HRV indicates the blunted autonomic response. The developed CART model can differentiate the autonomic dysfunction using both resting and orthostatic challenge HRV data compared to only resting HRV data in T2DM. Thus, monitoring HRV parameters using the CART model during rest and after orthostatic challenge may be a better alternative to detect autonomic dysfunction in T2DM as against only resting HRV.


2014 ◽  
pp. 115-123
Author(s):  
Rachid Beghdad

The purpose of this study is to identify some higher-level KDD features, and to train the resulting set with an appropriate machine learning technique, in order to classify and predict attacks. To achieve that, a two-steps approach is proposed. Firstly, the Fisher’s ANOVA technique was used to deduce the important features. Secondly, 4 types of classification trees: ID3, C4.5, classification and regression tree (CART), and random tree (RnDT), were tested to classify and detect attacks. According to our tests, the RndT leads to the better results. That is why we will present here the classification and prediction results of this technique in details. Some of the remaining results will be used later to make comparisons. We used the KDD’99 data sets to evaluate the considered algorithms. For these evaluations, only the four attack categories’ case was considered. Our simulations show the efficiency of our approach, and show also that it is very competitive with some similar previous works.


Author(s):  
K Sumanth Reddy ◽  
Gaddam Pranith ◽  
Karre Varun ◽  
Thipparthy Surya Sai Teja

The compressive strength of concrete plays an important role in determining the durability and performance of concrete. Due to rapid growth in material engineering finalizing an appropriate proportion for the mix of concrete to obtain the desired compressive strength of concrete has become cumbersome and a laborious task further the problem becomes more complex to obtain a rational relation between the concrete materials used to the strength obtained. The development in computational methods can be used to obtain a rational relation between the materials used and the compressive strength using machine learning techniques which reduces the influence of outliers and all unwanted variables influence in the determination of compressive strength. In this paper basic machine learning technics Multilayer perceptron neural network (MLP), Support Vector Machines (SVM), linear regressions (LR) and Classification and Regression Tree (CART), have been used to develop a model for determining the compressive strength for two different set of data (ingredients). Among all technics used the SVM provides a better results in comparison to other, but comprehensively the SVM cannot be a universal model because many recent literatures have proved that such models need more data and also the dynamicity of the attributes involved play an important role in determining the efficacy of the model.


2018 ◽  
Vol 2018 ◽  
pp. 1-9 ◽  
Author(s):  
Ya-Han Hu ◽  
Chun-Tien Tai ◽  
Chih-Fong Tsai ◽  
Min-Wei Huang

Digoxin is a high-alert medication because of its narrow therapeutic range and high drug-to-drug interactions (DDIs). Approximately 50% of digoxin toxicity cases are preventable, which motivated us to improve the treatment outcomes of digoxin. The objective of this study is to apply machine learning techniques to predict the appropriateness of initial digoxin dosage. A total of 307 inpatients who had their conditions treated with digoxin between 2004 and 2013 at a medical center in Taiwan were collected in the study. Ten independent variables, including demographic information, laboratory data, and whether the patients had CHF were also noted. A patient with serum digoxin concentration being controlled at 0.5–0.9 ng/mL after his/her initial digoxin dosage was defined as having an appropriate use of digoxin; otherwise, a patient was defined as having an inappropriate use of digoxin. Weka 3.7.3, an open source machine learning software, was adopted to develop prediction models. Six machine learning techniques were considered, including decision tree (C4.5), k-nearest neighbors (kNN), classification and regression tree (CART), randomForest (RF), multilayer perceptron (MLP), and logistic regression (LGR). In the non-DDI group, the area under ROC curve (AUC) of RF (0.912) was excellent, followed by that of MLP (0.813), CART (0.791), and C4.5 (0.784); the remaining classifiers performed poorly. For the DDI group, the AUC of RF (0.892) was the best, followed by CART (0.795), MLP (0.777), and C4.5 (0.774); the other classifiers’ performances were less than ideal. The decision tree-based approaches and MLP exhibited markedly superior accuracy performance, regardless of DDI status. Although digoxin is a high-alert medication, its initial dose can be accurately determined by using data mining techniques such as decision tree-based and MLP approaches. Developing a dosage decision support system may serve as a supplementary tool for clinicians and also increase drug safety in clinical practice.


2019 ◽  
Vol 147 ◽  
Author(s):  
Phani Krishna Kondeti ◽  
Kumar Ravi ◽  
Srinivasa Rao Mutheneni ◽  
Madhusudhan Rao Kadiri ◽  
Sriram Kumaraswamy ◽  
...  

Abstract Filariasis is one of the major public health concerns in India. Approximately 600 million people spread across 250 districts of India are at risk of filariasis. To predict this disease, a pilot scale study was carried out in 30 villages of Karimnagar district of Telangana from 2004 to 2007 to collect epidemiological and socio-economic data. The collected data are analysed by employing various machine learning techniques such as Naïve Bayes (NB), logistic model tree, probabilistic neural network, J48 (C4.5), classification and regression tree, JRip and gradient boosting machine. The performances of these algorithms are reported using sensitivity, specificity, accuracy and area under ROC curve (AUC). Among all employed classification methods, NB yielded the best AUC of 64% and was equally statistically significant with the rest of the classifiers. Similarly, the J48 algorithm generated 23 decision rules that help in developing an early warning system to implement better prevention and control efforts in the management of filariasis.


Author(s):  
Pardomuan Robinson Sihombing ◽  
Istiqomatul Fajriyah Yuliati

Penelitian ini akan mengkaji penerapan beberapa metode machine learning dengan memperhatikan kasus imbalanced data dalam pemodelan klasifikasi untuk penentuan risiko kejadian bayi dengan BBLR yang diharapkan dapat menjadi solusi dalam menurunkan kelahiran bayi dengan BBLR di Indonesia. Adapun metode meachine learning yang digunakan adalah Classification and Regression Tree (CART), Naïve Bayes, Random Forest dan Support Vector Machine (SVM). Pemodelan klasifikasi dengan menggunakan teknik resample pada kasus imbalanced data dan set data besar terbukti mampu meningkatkan ketepatan klasifikasi khususnya terhadap kelas minoritas yang dapat diihat dari nilai sensitivity yang tinggi dibandingkan data asli (tanpa treatment). Selanjutnya, dari kelima model klasifikasi yang iuji menunjukkan bahwa model random forest memberikan kinerja terbaik berdasarkan nilai sensitivity, specificity, G-mean dan AUC tertinggi. Variabel terpenting/paling berpengaruh dalam klasifikasi resiko kejadian BBLR adalah jarak dan urutan kelahiran, pemeriksaan kehamilan, dan umur ibu


Sign in / Sign up

Export Citation Format

Share Document