The Comparison of Tree-Based Ensemble Machine Learning for Classifying Public Datasets

Nur Heri Cahyana; Yuli Fauziah; Agus Sasmito Aribowo

doi:10.31098/cset.v1i1.412

The Comparison of Tree-Based Ensemble Machine Learning for Classifying Public Datasets

RSF Conference Series: Engineering and Technology ◽

10.31098/cset.v1i1.412 ◽

2021 ◽

Vol 1 (1) ◽

pp. 407-413

Author(s):

Nur Heri Cahyana ◽

Yuli Fauziah ◽

Agus Sasmito Aribowo

Keyword(s):

Machine Learning ◽

Random Forest ◽

Class Size ◽

Good Method ◽

Test Dataset ◽

Ensemble Machine Learning ◽

Tree Classifier ◽

Public Datasets ◽

Number Of Classes ◽

The Relationship

This study aims to determine the best methods of tree-based ensemble machine learning to classify the datasets used, a total of 34 datasets. This study also wants to know the relationship between the number of records and columns of the test dataset with the number of estimators (trees) for each ensemble model, namely Random Forest, Extra Tree Classifier, AdaBoost, and Gradient Bosting. The four methods will be compared to the maximum accuracy and the number of estimators when tested to classify the test dataset. Based on the results of the experiments above, tree-based ensemble machine learning methods have been obtained and the best number of estimators for the classification of each dataset used in the study. The Extra Tree method is the best classifier method for binary-class and multi-class. Random Forest is good for multi-classes, and AdaBoost is a pretty good method for binary-classes. The number of rows, columns and data classes is positively correlated with the number of estimators. This means that to process a dataset with a large row, column or class size requires more estimators than processing a dataset with a small row, column or class size. However, the relationship between the number of classes and accuracy is negatively correlated, meaning that the accuracy will decrease if there are more classes for classification.

Download Full-text

Ensemble Machine Learning Assisted Reservoir Characterization Using Field Production Data–An Offshore Field Case Study

Energies ◽

10.3390/en14041052 ◽

2021 ◽

Vol 14 (4) ◽

pp. 1052

Author(s):

Baozhong Wang ◽

Jyotsna Sharma ◽

Jianhua Chen ◽

Patricia Persaud

Keyword(s):

Machine Learning ◽

Random Forest ◽

Reservoir Characterization ◽

Time Lapse ◽

Production Data ◽

Oil Saturation ◽

Ensemble Machine Learning ◽

Input Parameters ◽

Saturation Profiles ◽

Field Production

Estimation of fluid saturation is an important step in dynamic reservoir characterization. Machine learning techniques have been increasingly used in recent years for reservoir saturation prediction workflows. However, most of these studies require input parameters derived from cores, petrophysical logs, or seismic data, which may not always be readily available. Additionally, very few studies incorporate the production data, which is an important reflection of the dynamic reservoir properties and also typically the most frequently and reliably measured quantity throughout the life of a field. In this research, the random forest ensemble machine learning algorithm is implemented that uses the field-wide production and injection data (both measured at the surface) as the only input parameters to predict the time-lapse oil saturation profiles at well locations. The algorithm is optimized using feature selection based on feature importance score and Pearson correlation coefficient, in combination with geophysical domain-knowledge. The workflow is demonstrated using the actual field data from a structurally complex, heterogeneous, and heavily faulted offshore reservoir. The random forest model captures the trends from three and a half years of historical field production, injection, and simulated saturation data to predict future time-lapse oil saturation profiles at four deviated well locations with over 90% R-square, less than 6% Root Mean Square Error, and less than 7% Mean Absolute Percentage Error, in each case.

Download Full-text

Learning from Imbalanced Educational Data Using Ensemble Machine Learning Algorithms

Webology ◽

10.14704/web/v18si01/web18053 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 183-195

Author(s):

Thingbaijam Lenin ◽

N. Chandrasekaran

Keyword(s):

Machine Learning ◽

Random Forest ◽

Missing Values ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Adaptive Boosting ◽

Stochastic Gradient Boosting ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Student’S Performance

Student’s academic performance is one of the most important parameters for evaluating the standard of any institute. It has become a paramount importance for any institute to identify the student at risk of underperforming or failing or even drop out from the course. Machine Learning techniques may be used to develop a model for predicting student’s performance as early as at the time of admission. The task however is challenging as the educational data required to explore for modelling are usually imbalanced. We explore ensemble machine learning techniques namely bagging algorithm like random forest (rf) and boosting algorithms like adaptive boosting (adaboost), stochastic gradient boosting (gbm), extreme gradient boosting (xgbTree) in an attempt to develop a model for predicting the student’s performance of a private university at Meghalaya using three categories of data namely demographic, prior academic record, personality. The collected data are found to be highly imbalanced and also consists of missing values. We employ k-nearest neighbor (knn) data imputation technique to tackle the missing values. The models are developed on the imputed data with 10 fold cross validation technique and are evaluated using precision, specificity, recall, kappa metrics. As the data are imbalanced, we avoid using accuracy as the metrics of evaluating the model and instead use balanced accuracy and F-score. We compare the ensemble technique with single classifier C4.5. The best result is provided by random forest and adaboost with F-score of 66.67%, balanced accuracy of 75%, and accuracy of 96.94%.

Download Full-text

Towards an Ensemble Machine Learning Model of Random Subspace Based Functional Tree Classifier for Snow Avalanche Susceptibility Mapping

IEEE Access ◽

10.1109/access.2020.3014816 ◽

2020 ◽

Vol 8 ◽

pp. 145968-145983 ◽

Cited By ~ 3

Author(s):

Amirhosein Mosavi ◽

Ataollah Shirzadi ◽

Bahram Choubin ◽

Fereshteh Taromideh ◽

Farzaneh Sajedi Hosseini ◽

...

Keyword(s):

Machine Learning ◽

Learning Model ◽

Susceptibility Mapping ◽

Snow Avalanche ◽

Random Subspace ◽

Ensemble Machine Learning ◽

Machine Learning Model ◽

Tree Classifier

Download Full-text

EP11 Using machine learning to recover unrecorded prehospital data

Emergency Medicine Journal ◽

10.1136/emermed-2021-999.11 ◽

2021 ◽

Vol 38 (9) ◽

pp. A5.3-A6

Author(s):

Thilo Reich ◽

Adam Bancroft ◽

Marcin Budka

Keyword(s):

Machine Learning ◽

Random Forest ◽

Vital Signs ◽

Machine Learning Algorithms ◽

Free Text ◽

Patient Records ◽

Electronic Patient Records ◽

Minority Class ◽

South Central ◽

Tree Classifier

BackgroundThe recording practices, of electronic patient records for ambulance crews, are continuously developing. South Central Ambulance Service (SCAS) adapted the common AVPU-scale (Alert, Voice, Pain, Unresponsive) in 2019 to include an option for ‘New Confusion’. Progressing to this new AVCPU-scale made comparisons with older data impossible. We demonstrate a method to retrospectively classify patients into the alertness levels most influenced by this update.MethodsSCAS provided ~1.6 million Electronic Patient Records, including vital signs, demographics, and presenting complaint free-text, these were split into training, validation, and testing datasets (80%, 10%, 10% respectively), and under sampled to the minority class. These data were used to train and validate predictions of the classes most affected by the modification of the scale (Alert, New Confusion, Voice).A transfer-learning natural language processing (NLP) classifier was used, using a language model described by Smerity et al. (2017) to classify the presenting complaint free-text.A second approach used vital signs, demographics, conveyance, and assessments (30 metrics) for classification. Categorical data were binary encoded and continuous variables were normalised. 20 machine learning algorithms were empirically tested and the best 3 combined into a voting ensemble combining three vital-sign based algorithms (Random Forest, Extra Tree Classifier, Decision Tree) with the NLP classifier using a Random Forest output layer.ResultsThe ensemble method resulted in a weighted F1 of 0.78 for the test set. The sensitivities/specificities for each of the classes are: 84%/ 90% (Alert), 73%/ 89% (Newly Confused) and 68%/ 93% (Voice).ConclusionsThe ensemble combining free text and vital signs resulted in high sensitivity and specificity when reclassifying the alertness levels of prehospital patients. This study demonstrates the capabilities of machine learning classifiers to recover missing data, allowing the comparison of data collected with different recording standards.

Download Full-text

Automatic Classification of Hypertension Types Based on Personal Features by Machine Learning Algorithms

Mathematical Problems in Engineering ◽

10.1155/2020/2742781 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13 ◽

Cited By ~ 4

Author(s):

Majid Nour ◽

Kemal Polat

Keyword(s):

Machine Learning ◽

Blood Pressure ◽

Random Forest ◽

Decision Tree ◽

Systolic Blood Pressure ◽

Diastolic Blood Pressure ◽

Decision Tree Classifier ◽

Tree Classifier ◽

C4.5 Decision Tree

Hypertension (high blood pressure) is an important disease seen among the public, and early detection of hypertension is significant for early treatment. Hypertension is depicted as systolic blood pressure higher than 140 mmHg or diastolic blood pressure higher than 90 mmHg. In this paper, in order to detect the hypertension types based on the personal information and features, four machine learning (ML) methods including C4.5 decision tree classifier (DTC), random forest, linear discriminant analysis (LDA), and linear support vector machine (LSVM) have been used and then compared with each other. In the literature, we have first carried out the classification of hypertension types using classification algorithms based on personal data. To further explain the variability of the classifier type, four different classifier algorithms were selected for solving this problem. In the hypertension dataset, there are eight features including sex, age, height (cm), weight (kg), systolic blood pressure (mmHg), diastolic blood pressure (mmHg), heart rate (bpm), and BMI (kg/m2) to explain the hypertension status and then there are four classes comprising the normal (healthy), prehypertension, stage-1 hypertension, and stage-2 hypertension. In the classification of the hypertension dataset, the obtained classification accuracies are 99.5%, 99.5%, 96.3%, and 92.7% using the C4.5 decision tree classifier, random forest, LDA, and LSVM. The obtained results have shown that ML methods could be confidently used in the automatic determination of the hypertension types.

Download Full-text

Coal Pit Mapping with Random Forest-Based Ensemble Machine Learning at Lower Benue Trough

International Journal of Scientific and Research Publications (IJSRP) ◽

10.29322/ijsrp.10.12.2020.p10851 ◽

2020 ◽

Vol 10 (12) ◽

pp. 470-473

Author(s):

Okeke Francis Ifeanyi ◽

Ibrahim Adesina Adekunle ◽

Echeonwu Emmanuel Chinyere

Keyword(s):

Machine Learning ◽

Random Forest ◽

Benue Trough ◽

Ensemble Machine Learning ◽

Lower Benue Trough

Download Full-text

Application of random forest regressions on stellar parameters of A-type stars and feature extraction

Research in Astronomy and Astrophysics ◽

10.1088/1674-4527/ac41c5 ◽

2021 ◽

Author(s):

Shuxin Chen ◽

Weimin Sun ◽

Ying He

Keyword(s):

Machine Learning ◽

Random Forest ◽

Signal To Noise Ratio ◽

Machine Learning Algorithms ◽

Support Vector ◽

Large Uncertainty ◽

Sparse Features ◽

The Relationship ◽

Flux Calibration ◽

The Continuum

Abstract Measuring the stellar parameters of A-type stars is more difficult than FGK stars because of the sparse features in their spectra and the degeneracy between effective temperature (Teff ) and gravity (logg). Modeling the relationship between fundamental stellar parameters and features through Machine Learning is possible because we can employ the advantage of big data rather than sparse known features. As soon as the model is successfully trained, it can be an efficient approach for predicting Teff and logg for A-type stars especially when there is large uncertainty in the continuum caused by flux calibration or extinction. In this paper, A- type stars are selected from LAMOST DR7 with signal-to-noise ratio greater than 50 and the Teff ranging within 7000K to 8500K. We perform the Random Forest (RF) algorithm, one of the most widely used Machine Learning algorithms to establish the regressio，relationship between the flux of all wavelengths and their corresponding stellar parameters((Teff ) and (logg) respectively). The trained RF model not only can regress the stellar parameters but also can obtain the rank of the wavelength based on their sensibility to parameters.According to the rankings, we define line indices by merging adjacent wavelengths. The objectively defined line indices in this work are amendments to Lick indices including some weak lines. We use the Support Vector Regression algorithm based on our new defined line indices to measure the temperature and gravity and use some common stars from Simbad to evaluate our result. In addition, the Gaia HR diagram is used for checking the accuracy of Teff and logg.

Download Full-text

Assessing Biotic and Abiotic Effects on Biodiversity Index Using Machine Learning

Forests ◽

10.3390/f12040461 ◽

2021 ◽

Vol 12 (4) ◽

pp. 461

Author(s):

Mahmoud Bayat ◽

Harold Burkhart ◽

Manouchehr Namiranian ◽

Seyedeh Kosar Hamidi ◽

Sahar Heidari ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Forest Ecosystems ◽

Abiotic Factors ◽

Basal Area ◽

Tree Diversity ◽

Additive Models ◽

Slope Aspect ◽

Biotic And Abiotic Factors ◽

The Relationship

Forest ecosystems play multiple important roles in meeting the habitat needs of different organisms and providing a variety of services to humans. Biodiversity is one of the structural features in dynamic and complex forest ecosystems. One of the most challenging issues in assessing forest ecosystems is understanding the relationship between biodiversity and environmental factors. The aim of this study was to investigate the effect of biotic and abiotic factors on tree diversity of Hyrcanian forests in northern Iran. For this purpose, we analyzed tree diversity in 8 forest sites in different locations from east to west of the Caspian Sea. 15,988 trees were measured in 655 circular permanent sample plots (0.1 ha). A combination of machine learning methods was used for modeling and investigating the relationship between tree diversity and biotic and abiotic factors. Machine learning models included generalized additive models (GAMs), support vector machine (SVM), random forest (RF) and K-nearest–neighbor (KNN). To determine the most important factors related to tree diversity we used from variables such as the average diameter at breast height (DBH) in the plot, basal area in largest trees (BAL), basal area (BA), number of trees per hectare, tree species, slope, aspect and elevation. A comparison of RMSEs, relative RMSEs, and the coefficients of determination of the different methods, showed that the random forest (RF) method resulted in the best models among all those tested. Based on the results of the RF method, elevation, BA and BAL were recognized as the most influential factors defining variation of tree diversity.

Download Full-text

Swindling Shonky Anatomization of Credit Card Transactions using Machine Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d7621.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 1477-1483

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Decision Tree ◽

Credit Card ◽

Naive Bayes ◽

Gradient Boosting ◽

Decision Tree Classifier ◽

Tree Classifier ◽

Feature Importance

With the fast moving technological advancement, the internet usage has been increased rapidly in all the fields. The money transactions for all the applications like online shopping, banking transactions, bill settlement in any industries, online ticket booking for travel and hotels, Fees payment for educational organization, Payment for treatment to hospitals, Payment for super market and variety of applications are using online credit card transactions. This leads to the fraud usage of other accounts and transaction that result in the loss of service and profit to the institution. With this background, this paper focuses on predicting the fraudulent credit card transaction. The Credit Card Transaction dataset from KAGGLE machine learning Repository is used for prediction analysis. The analysis of fraudulent credit card transaction is achieved in four ways. Firstly, the relationship between the variables of the dataset is identified and represented by the graphical notations. Secondly, the feature importance of the dataset is identified using Random Forest, Ada boost, Logistic Regression, Decision Tree, Extra Tree, Gradient Boosting and Naive Bayes classifiers. Thirdly, the extracted feature importance if the credit card transaction dataset is fitted to Random Forest classifier, Ada boost classifier, Logistic Regression classifier, Decision Tree classifier, Extra Tree classifier, Gradient Boosting classifier and Naive Bayes classifier. Fourth, the Performance Analysis is done by analyzing the performance metrics like Accuracy, FScore, AUC Score, Precision and Recall. The implementation is done by python in Anaconda Spyder Navigator Integrated Development Environment. Experimental Results shows that the Decision Tree classifier have achieved the effective prediction with the precision of 1.0, recall of 1.0, FScore of 1.0 , AUC Score of 89.09 and Accuracy of 99.92%.

Download Full-text

Spatial Simulation Modeling of Settlement Distribution Driven by Random Forest: Consideration of Landscape Visibility

Sustainability ◽

10.3390/su12114748 ◽

2020 ◽

Vol 12 (11) ◽

pp. 4748

Author(s):

Minrui Zheng ◽

Wenwu Tang ◽

Akinwumi Ogundiran ◽

Jianxin Yang

Keyword(s):

Machine Learning ◽

Random Forest ◽

Simulation Modeling ◽

Learning Algorithm ◽

Learning Algorithms ◽

Simulation Models ◽

Machine Learning Algorithms ◽

Spatial Simulation ◽

Study Region ◽

The Relationship

Settlement models help to understand the social–ecological functioning of landscape and associated land use and land cover change. One of the issues of settlement modeling is that models are typically used to explore the relationship between settlement locations and associated influential factors (e.g., slope and aspect). However, few studies in settlement modeling adopted landscape visibility analysis. Landscape visibility provides useful information for understanding human decision-making associated with the establishment of settlements. In the past years, machine learning algorithms have demonstrated their capabilities in improving the performance of the settlement modeling and particularly capturing the nonlinear relationship between settlement locations and their drivers. However, simulation models using machine learning algorithms in settlement modeling are still not well studied. Moreover, overfitting issues and optimization of model parameters are major challenges for most machine learning algorithms. Therefore, in this study, we sought to pursue two research objectives. First, we aimed to evaluate the contribution of viewsheds and landscape visibility to the simulation modeling of - settlement locations. The second objective is to examine the performance of the machine learning algorithm-based simulation models for settlement location studies. Our study region is located in the metropolitan area of Oyo Empire, Nigeria, West Africa, ca. AD 1570–1830, and its pre-Imperial antecedents, ca. AD 1360–1570. We developed an event-driven spatial simulation model enabled by random forest algorithm to represent dynamics in settlement systems in our study region. Experimental results demonstrate that viewsheds and landscape visibility may offer more insights into unveiling the underlying mechanism that drives settlement locations. Random forest algorithm, as a machine learning algorithm, provide solid support for establishing the relationship between settlement occurrences and their drivers.

Download Full-text