Flood Susceptibility Modeling in a Subtropical Humid Low-Relief Alluvial Plain Environment: Application of Novel Ensemble Machine Learning Approach

This study has developed a new ensemble model and tested another ensemble model for flood susceptibility mapping in the Middle Ganga Plain (MGP). The results of these two models have been quantitatively compared for performance analysis in zoning flood susceptible areas of low altitudinal range, humid subtropical fluvial floodplain environment of the Middle Ganga Plain (MGP). This part of the MGP, which is in the central Ganga River Basin (GRB), is experiencing worse floods in the changing climatic scenario causing an increased level of loss of life and property. The MGP experiencing monsoonal subtropical humid climate, active tectonics induced ground subsidence, increasing population, and shifting landuse/landcover trends and pattern, is the best natural laboratory to test all the susceptibility prediction genre of models to achieve the choice of best performing model with the constant number of input parameters for this type of topoclimatic environmental setting. This will help in achieving the goal of model universality, i.e., finding out the best performing susceptibility prediction model for this type of topoclimatic setting with the similar number and type of input variables. Based on the highly accurate flood inventory and using 12 flood predictors (FPs) (selected using field experience of the study area and literature survey), two machine learning (ML) ensemble models developed by bagging frequency ratio (FR) and evidential belief function (EBF) with classification and regression tree (CART), CART-FR and CART-EBF, were applied for flood susceptibility zonation mapping. Flood and non-flood points randomly generated using flood inventory have been apportioned in 70:30 ratio for training and validation of the ensembles. Based on the evaluation performance using threshold-independent evaluation statistic, area under receiver operating characteristic (AUROC) curve, 14 threshold-dependent evaluation metrices, and seed cell area index (SCAI) meant for assessing different aspects of ensembles, the study suggests that CART-EBF (AUCSR = 0.843; AUCPR = 0.819) was a better performant than CART-FR (AUCSR = 0.828; AUCPR = 0.802). The variability in performances of these novel-advanced ensembles and their comparison with results of other published models espouse the need of testing these as well as other genres of susceptibility models in other topoclimatic environments also. Results of this study are important for natural hazard managers and can be used to compute the damages through risk analysis.

Download Full-text

Optimization of state-of-the-art fuzzy-metaheuristic ANFIS-based machine learning models for flood susceptibility prediction mapping in the Middle Ganga Plain, India

The Science of The Total Environment ◽

10.1016/j.scitotenv.2020.141565 ◽

2021 ◽

Vol 750 ◽

pp. 141565

Author(s):

Aman Arora ◽

Alireza Arabameri ◽

Manish Pandey ◽

Masood A. Siddiqui ◽

U.K. Shukla ◽

...

Keyword(s):

Machine Learning ◽

State Of The Art ◽

Ganga Plain ◽

Learning Models ◽

Flood Susceptibility ◽

Middle Ganga Plain ◽

Machine Learning Models

Download Full-text

GIS-based groundwater potential mapping using boosted regression tree, classification and regression tree, and random forest machine learning models in Iran

Environmental Monitoring and Assessment ◽

10.1007/s10661-015-5049-6 ◽

2015 ◽

Vol 188 (1) ◽

Cited By ~ 208

Author(s):

Seyed Amir Naghibi ◽

Hamid Reza Pourghasemi ◽

Barnali Dixon

Keyword(s):

Machine Learning ◽

Regression Tree ◽

Groundwater Potential ◽

Classification And Regression Tree ◽

Learning Models ◽

Boosted Regression Tree ◽

Potential Mapping ◽

Classification And Regression ◽

Groundwater Potential Mapping ◽

Machine Learning Models

Download Full-text

Single-Channel EEG-Based Machine Learning Method for Prescreening Major Depressive Disorder

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622019500342 ◽

2019 ◽

Vol 18 (05) ◽

pp. 1579-1603 ◽

Cited By ~ 2

Author(s):

Zhijiang Wan ◽

Hao Zhang ◽

Jiajin Huang ◽

Haiyan Zhou ◽

Jie Yang ◽

...

Keyword(s):

Machine Learning ◽

Major Depressive Disorder ◽

Depressive Disorder ◽

Single Channel ◽

Regression Tree ◽

Classification And Regression Tree ◽

Machine Learning Method ◽

Learning Method ◽

Eeg Analysis ◽

Major Depressive

Many studies developed the machine learning method for discriminating Major Depressive Disorder (MDD) and normal control based on multi-channel electroencephalogram (EEG) data, less concerned about using single channel EEG collected from forehead scalp to discriminate the MDD. The EEG dataset is collected by the Fp1 and Fp2 electrode of a 32-channel EEG system. The result demonstrates that the classification performance based on the EEG of Fp1 location exceeds the performance based on the EEG of Fp2 location, and shows that single-channel EEG analysis can provide discrimination of MDD at the level of multi-channel EEG analysis. Furthermore, a portable EEG device collecting the signal from Fp1 location is used to collect the second dataset. The Classification and Regression Tree combining genetic algorithm (GA) achieves the highest accuracy of 86.67% based on leave-one-participant-out cross validation, which shows that the single-channel EEG-based machine learning method is promising to support MDD prescreening application.

Download Full-text

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

JMIR Medical Informatics ◽

10.2196/18910 ◽

2020 ◽

Vol 8 (7) ◽

pp. e18910

Author(s):

Debbie Rankin ◽

Michaela Black ◽

Raymond Bond ◽

Jonathan Wallace ◽

Maurice Mulvenna ◽

...

Keyword(s):

Machine Learning ◽

Health Care ◽

Bayesian Network ◽

Synthetic Data ◽

Regression Tree ◽

Real Data ◽

Classification And Regression Tree ◽

Supervised Machine Learning ◽

Statistical Disclosure ◽

Classification And Regression

Background The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. Objective This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. Methods A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. Results A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. Conclusions The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.

Download Full-text

Estimating the Optimal Dexketoprofen Pharmaceutical Formulation with Machine Learning Methods and Statistical Approaches

Healthcare Informatics Research ◽

10.4258/hir.2021.27.4.279 ◽

2021 ◽

Vol 27 (4) ◽

pp. 279-286

Author(s):

Atakan Başkor ◽

Yağmur Pirinçci Tok ◽

Burcu Mesut ◽

Yıldız Özsoy ◽

Tamer Uçar

Keyword(s):

Machine Learning ◽

Regression Tree ◽

Cost Effective ◽

Pharmaceutical Formulation ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Disintegration Time ◽

Pharmaceutical Dosage Form ◽

Extreme Gradient Boosting

Objectives: Orally disintegrating tablets (ODTs) can be utilized without any drinking water; this feature makes ODTs easy to use and suitable for specific groups of patients. Oral administration of drugs is the most commonly used route, and tablets constitute the most preferable pharmaceutical dosage form. However, the preparation of ODTs is costly and requires long trials, which creates obstacles for dosage trials. The aim of this study was to identify the most appropriate formulation using machine learning (ML) models of ODT dexketoprofen formulations, with the goal of providing a cost-effective and timereducing solution.Methods: This research utilized nonlinear regression models, including the k-nearest neighborhood (k-NN), support vector regression (SVR), classification and regression tree (CART), bootstrap aggregating (bagging), random forest (RF), gradient boosting machine (GBM), and extreme gradient boosting (XGBoost) methods, as well as the t-test, to predict the quantity of various components in the dexketoprofen formulation within fixed criteria.Results: All the models were developed with Python libraries. The performance of the ML models was evaluated with R2 values and the root mean square error. Hardness values of 0.99 and 2.88, friability values of 0.92 and 0.02, and disintegration time values of 0.97 and 10.09 using the GBM algorithm gave the best results.Conclusions: In this study, we developed a computational approach to estimate the optimal pharmaceutical formulation of dexketoprofen. The results were evaluated by an expert, and it was found that they complied with Food and Drug Administration criteria.

Download Full-text

Consensus of Feature Selection Methods and Reduced Generalization Gap Model to Improve Diagnosis of Heart Disease

Journal of Scientific Research ◽

10.3329/jsr.v13i3.53290 ◽

2021 ◽

Vol 13 (3) ◽

pp. 901-913

Author(s):

S. Gupta ◽

R. R. Sedamkar

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Heart Disease ◽

Missing Values ◽

Performance Metrics ◽

Model Performance ◽

Regression Tree ◽

Classification And Regression Tree ◽

Proposed Model ◽

Time Required

Enhancing the diagnostic ability of Machine Learning models for acceptable prediction in the healthcare community is still a concern. There are critical care disease datasets available online on which researchers have experimented with a different number of instances and features for similar disease prediction. Further, different Machine Learning (ML) models have different preprocessing requirements. Framingham heart disease data is multicollinear and has missing values. Thus, the proposed model aims to explore the differential preprocessing needs of ML models followed by feature selection in consensus with domain experts and feature extraction to resolve multicollinearity issues. Missing values have been imputed differently for each feature. The work also identifies optimal train set size by plotting a learning curve that provides a minimum generalization gap. When testing is done on this hyperparameter tuned model, performance is enhanced with respect to the F score weighted by support and stratification since the data is imbalanced. Experimental results demonstrate improvement in performance metrics, i.e., weighted F score, precision, recall, accuracy up to 3 %, and F1 score by 8 % for Logistic Regression Classifier with the proposed model. Further, the time required for hyperparameter tuning is reduced by 50% for tree-based models, particularly Classification and Regression Tree (CART).

Download Full-text

HOW MACHINE LEARNING METHOD PERFORMANCE FOR IMBALANCED DATA

TEKNOKOM ◽

10.31943/teknokom.v4i2.64 ◽

2021 ◽

Vol 4 (2) ◽

pp. 48-52

Author(s):

Pardomuan Robinson Sihombing

Keyword(s):

Machine Learning ◽

Regression Tree ◽

Imbalanced Data ◽

Large Data ◽

Original Data ◽

Classification And Regression Tree ◽

Support Vector ◽

Method Performance ◽

Survey Statistics ◽

Working Status

This study will examine the application of several classification methods to machine learning models by taking into account the case of imbalanced data. The research was conducted on a case study of classification modeling for working status in Banten Province in 2020. The data used comes from the National Labor Force Survey, Statistics Indonesia. The machine learning methods used are Classification and Regression Tree (CART), Naïve Bayes, Random Forest, Rotation Forest, Support Vector Machine (SVM), Neural Network Analysis, One Rule (OneR), and Boosting. Classification modeling using resample techniques in cases of imbalanced data and large data sets is proven to improve classification accuracy, especially for minority classes, which can be seen from the sensitivity and specificity values that are more balanced than the original data (without treatment). Furthermore, the eight classification models tested shows that the Boost model provides the best performance based on the highest sensitivity, specificity, G-mean, and kappa coefficient values. The most important/most influential variables in the classification of working status are marital status, education, and age.

Download Full-text

Predictors of Turnover Intention in U.S. Federal Government Workforce: Machine Learning Evidence That Perceived Comprehensive HR Practices Predict Turnover Intention

Public Personnel Management ◽

10.1177/0091026020977562 ◽

2020 ◽

pp. 009102602097756

Author(s):

In Gu Kang ◽

Ben Croft ◽

Barbara A. Bichelmeyer

Keyword(s):

Machine Learning ◽

Job Satisfaction ◽

At Risk ◽

Turnover Intention ◽

Regression Tree ◽

Classification And Regression Tree ◽

Federal Employees ◽

Organizational Policies ◽

Machine Learning Classification ◽

Hr Practices

This study aims to identify important predictors of turnover intention and to characterize subgroups of U.S. federal employees at high risk for turnover intention. Data were drawn from the 2018 Federal Employee Viewpoint Survey (FEVS, unweighted N = 598,003), a nationally representative sample of U.S. federal employees. Machine learning Classification and Regression Tree (CART) analyses were conducted to predict turnover intention and accounted for sample weights. CART analyses identified six at-risk subgroups. Predictor importance scores showed job satisfaction was the strongest predictor of turnover intention, followed by satisfaction with organization, loyalty, accomplishment, involvement in decisions, likeness to job, satisfaction with promotion opportunities, skill development opportunities, organizational tenure, and pay satisfaction. Consequently, Human Resource (HR) departments should seek to implement comprehensive HR practices to enhance employees’ perceptions on job satisfaction, workplace environments and systems, and favorable organizational policies and supports and make tailored interventions for the at-risk subgroups.

Download Full-text

MALARIA PREDICTION MODEL USING ADVANCED ENSEMBLE MACHINE LEARNING TECHNIQUES

Journal of medical pharmaceutical and allied sciences ◽

10.22270/jmpas.v10i6.1701 ◽

2021 ◽

Vol 10 (6) ◽

pp. 3794-3801

Author(s):

Yusuf Aliyu Adamu

Keyword(s):

Machine Learning ◽

Malaria Incidence ◽

Regression Tree ◽

Ensemble Method ◽

Classification And Regression Tree ◽

Machine Learning Techniques ◽

Ensemble Machine Learning ◽

Suggested Technique ◽

Life Threatening ◽

Classification And Regression

Malaria is a life-threatening disease that leads to death globally, its early prediction is necessary for preventing the rapid transmission. In this work, an enhanced ensemble learning approach for predicting malaria outbreaks is suggested. Using a mean-based splitting strategy, the dataset is randomly partitioned into smaller groups. The splits are then modelled using a classification and regression tree, and an accuracy-based weighted aging classifier ensemble is used to construct a homogenous ensemble from the several Classification and Regression Tree models. This approach ensures higher performance is achieved. Seven different Algorithms were tested and one ensemble method is used which combines all the seven classifiers together and finally, the accuracy, precision, and sensitivity achieved for the proposed method is 93%, 92%, and 100% respectively, which outperformed better than machine learning classifiers and ensemble method used in this research. The correlation between the variables used is established and how each factor contributes to the malaria incidence. The result indicates that malaria outbreaks can be predicted successfully using the suggested technique.

Download Full-text

ANALISA AKAR MASALAH RADIAL RUN OUT BAN MENGGUNAKAN DECISION TREE

Jurnal Muara Sains, Teknologi, Kedokteran dan Ilmu Kesehatan ◽

10.24912/jmstkik.v5i2.9417 ◽

2021 ◽

Vol 5 (2) ◽

pp. 351

Author(s):

Bambang Biantoro ◽

Hernadewita Hernadewita

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Production Process ◽

Radio Frequency Identification ◽

Regression Tree ◽

Classification And Regression Tree ◽

Process Data ◽

Tree Model ◽

Tire Industry ◽

Root Cause

Problem solving in the multistage production process is a challenge for the industry. The use of modern techniques such as machine learning in solving quality problems continues to be developed. One of the machine learning is decision tree. The tire industry entered the era of industrial revolution 4.0 with the use of information technology. Utilizing data using machine learning in finding the root cause of the problem can support the tire industry in industrial competition. This study aims to explore the process data in the tire industry to solve one of the tire quality problems, namely radial run-out tires. The technique of finding the root of the problem in this research is done using Classification and Regression Tree (CART) technique. Input variables involve 60 factors in the production process. From the research, it was found that the factors that influence the radial run out value are the lot of the Tread, Bead and Sidewall components. The factors causing the high radial run-out of the tires are the variations in the lot of the tire components Tread and Bead. The decision tree model that was formed has a precision level of 74.7% in detecting high radial run-out events. The effects of improvement on the lot tread and bead components resulting from the decision tree can reduce the defect of radial run out rate by 99.9%. Keywords: Decision tree; Root cause analysis; Radial run-out Tire; Data mining AbstrakPemecahan masalah pada proses produksi multistage merupakan tantangan untuk indusri. Pemanfaatan teknik modern seperti machine learning dalam pemecahan masalah kualitas terus dikembangkan. Salah satu machine learning adalah decision tree. Industri ban memasuki era industri revolusi 4.0 dengan adanya pemakaian teknologi informasi seperti barcode atau radio frequency identification. Pemanfaataan data dengan menggunakan machine learning dalam pencarian akar masalah bisa mendukung industri ban dalam kompetisi industri. Penelitian ini bertujuan untuk mengekplorasi data proses pada industri ban untuk memecahkan permasalahan kualitas ban yaitu radial run-out ban. Teknik pencarian akar masalah dilakukan menggunakan Clasification and Regression Tree (CART). Variabel input melibatkan 60 faktor dalam proses produksi. Dari penelitian didapatkan faktor yang mempengaruhi nilai radial run out adalah lot komponen Tread, Bead dan Sidewall. Untuk faktor penyebab tingginya radial run-out ban adalah variasi lot komponen Tread dan Bead. Model decision tree yang terbentuk memiliki tingkat presisi 74,7% dalam mendeteksi kejadian radial run-out berkategori tinggi. Efek perbaikan pada komponen lot Tread dan Bead yang dihasilkan dari decision tree dapat menurunkan tingkat defect radial run- out ban sebesar 99,9%.

Download Full-text