scholarly journals The Comparison of Tree-Based Ensemble Machine Learning for Classifying Public Datasets

2021 ◽  
Vol 1 (1) ◽  
pp. 407-413
Author(s):  
Nur Heri Cahyana ◽  
Yuli Fauziah ◽  
Agus Sasmito Aribowo

This study aims to determine the best methods of tree-based ensemble machine learning to classify the datasets used, a total of 34 datasets. This study also wants to know the relationship between the number of records and columns of the test dataset with the number of estimators (trees) for each ensemble model, namely Random Forest, Extra Tree Classifier, AdaBoost, and Gradient Bosting. The four methods will be compared to the maximum accuracy and the number of estimators when tested to classify the test dataset. Based on the results of the experiments above, tree-based ensemble machine learning methods have been obtained and the best number of estimators for the classification of each dataset used in the study. The Extra Tree method is the best classifier method for binary-class and multi-class. Random Forest is good for multi-classes, and AdaBoost is a pretty good method for binary-classes. The number of rows, columns and data classes is positively correlated with the number of estimators. This means that to process a dataset with a large row, column or class size requires more estimators than processing a dataset with a small row, column or class size. However, the relationship between the number of classes and accuracy is negatively correlated, meaning that the accuracy will decrease if there are more classes for classification.

Energies ◽  
2021 ◽  
Vol 14 (4) ◽  
pp. 1052
Author(s):  
Baozhong Wang ◽  
Jyotsna Sharma ◽  
Jianhua Chen ◽  
Patricia Persaud

Estimation of fluid saturation is an important step in dynamic reservoir characterization. Machine learning techniques have been increasingly used in recent years for reservoir saturation prediction workflows. However, most of these studies require input parameters derived from cores, petrophysical logs, or seismic data, which may not always be readily available. Additionally, very few studies incorporate the production data, which is an important reflection of the dynamic reservoir properties and also typically the most frequently and reliably measured quantity throughout the life of a field. In this research, the random forest ensemble machine learning algorithm is implemented that uses the field-wide production and injection data (both measured at the surface) as the only input parameters to predict the time-lapse oil saturation profiles at well locations. The algorithm is optimized using feature selection based on feature importance score and Pearson correlation coefficient, in combination with geophysical domain-knowledge. The workflow is demonstrated using the actual field data from a structurally complex, heterogeneous, and heavily faulted offshore reservoir. The random forest model captures the trends from three and a half years of historical field production, injection, and simulated saturation data to predict future time-lapse oil saturation profiles at four deviated well locations with over 90% R-square, less than 6% Root Mean Square Error, and less than 7% Mean Absolute Percentage Error, in each case.


Webology ◽  
2021 ◽  
Vol 18 (Special Issue 01) ◽  
pp. 183-195
Author(s):  
Thingbaijam Lenin ◽  
N. Chandrasekaran

Student’s academic performance is one of the most important parameters for evaluating the standard of any institute. It has become a paramount importance for any institute to identify the student at risk of underperforming or failing or even drop out from the course. Machine Learning techniques may be used to develop a model for predicting student’s performance as early as at the time of admission. The task however is challenging as the educational data required to explore for modelling are usually imbalanced. We explore ensemble machine learning techniques namely bagging algorithm like random forest (rf) and boosting algorithms like adaptive boosting (adaboost), stochastic gradient boosting (gbm), extreme gradient boosting (xgbTree) in an attempt to develop a model for predicting the student’s performance of a private university at Meghalaya using three categories of data namely demographic, prior academic record, personality. The collected data are found to be highly imbalanced and also consists of missing values. We employ k-nearest neighbor (knn) data imputation technique to tackle the missing values. The models are developed on the imputed data with 10 fold cross validation technique and are evaluated using precision, specificity, recall, kappa metrics. As the data are imbalanced, we avoid using accuracy as the metrics of evaluating the model and instead use balanced accuracy and F-score. We compare the ensemble technique with single classifier C4.5. The best result is provided by random forest and adaboost with F-score of 66.67%, balanced accuracy of 75%, and accuracy of 96.94%.


IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 145968-145983 ◽  
Author(s):  
Amirhosein Mosavi ◽  
Ataollah Shirzadi ◽  
Bahram Choubin ◽  
Fereshteh Taromideh ◽  
Farzaneh Sajedi Hosseini ◽  
...  

2021 ◽  
Vol 38 (9) ◽  
pp. A5.3-A6
Author(s):  
Thilo Reich ◽  
Adam Bancroft ◽  
Marcin Budka

BackgroundThe recording practices, of electronic patient records for ambulance crews, are continuously developing. South Central Ambulance Service (SCAS) adapted the common AVPU-scale (Alert, Voice, Pain, Unresponsive) in 2019 to include an option for ‘New Confusion’. Progressing to this new AVCPU-scale made comparisons with older data impossible. We demonstrate a method to retrospectively classify patients into the alertness levels most influenced by this update.MethodsSCAS provided ~1.6 million Electronic Patient Records, including vital signs, demographics, and presenting complaint free-text, these were split into training, validation, and testing datasets (80%, 10%, 10% respectively), and under sampled to the minority class. These data were used to train and validate predictions of the classes most affected by the modification of the scale (Alert, New Confusion, Voice).A transfer-learning natural language processing (NLP) classifier was used, using a language model described by Smerity et al. (2017) to classify the presenting complaint free-text.A second approach used vital signs, demographics, conveyance, and assessments (30 metrics) for classification. Categorical data were binary encoded and continuous variables were normalised. 20 machine learning algorithms were empirically tested and the best 3 combined into a voting ensemble combining three vital-sign based algorithms (Random Forest, Extra Tree Classifier, Decision Tree) with the NLP classifier using a Random Forest output layer.ResultsThe ensemble method resulted in a weighted F1 of 0.78 for the test set. The sensitivities/specificities for each of the classes are: 84%/ 90% (Alert), 73%/ 89% (Newly Confused) and 68%/ 93% (Voice).ConclusionsThe ensemble combining free text and vital signs resulted in high sensitivity and specificity when reclassifying the alertness levels of prehospital patients. This study demonstrates the capabilities of machine learning classifiers to recover missing data, allowing the comparison of data collected with different recording standards.


2020 ◽  
Vol 2020 ◽  
pp. 1-13 ◽  
Author(s):  
Majid Nour ◽  
Kemal Polat

Hypertension (high blood pressure) is an important disease seen among the public, and early detection of hypertension is significant for early treatment. Hypertension is depicted as systolic blood pressure higher than 140 mmHg or diastolic blood pressure higher than 90 mmHg. In this paper, in order to detect the hypertension types based on the personal information and features, four machine learning (ML) methods including C4.5 decision tree classifier (DTC), random forest, linear discriminant analysis (LDA), and linear support vector machine (LSVM) have been used and then compared with each other. In the literature, we have first carried out the classification of hypertension types using classification algorithms based on personal data. To further explain the variability of the classifier type, four different classifier algorithms were selected for solving this problem. In the hypertension dataset, there are eight features including sex, age, height (cm), weight (kg), systolic blood pressure (mmHg), diastolic blood pressure (mmHg), heart rate (bpm), and BMI (kg/m2) to explain the hypertension status and then there are four classes comprising the normal (healthy), prehypertension, stage-1 hypertension, and stage-2 hypertension. In the classification of the hypertension dataset, the obtained classification accuracies are 99.5%, 99.5%, 96.3%, and 92.7% using the C4.5 decision tree classifier, random forest, LDA, and LSVM. The obtained results have shown that ML methods could be confidently used in the automatic determination of the hypertension types.


Author(s):  
Shuxin Chen ◽  
Weimin Sun ◽  
Ying He

Abstract Measuring the stellar parameters of A-type stars is more difficult than FGK stars because of the sparse features in their spectra and the degeneracy between effective temperature (Teff ) and gravity (logg). Modeling the relationship between fundamental stellar parameters and features through Machine Learning is possible because we can employ the advantage of big data rather than sparse known features. As soon as the model is successfully trained, it can be an efficient approach for predicting Teff and logg for A-type stars especially when there is large uncertainty in the continuum caused by flux calibration or extinction. In this paper, A- type stars are selected from LAMOST DR7 with signal-to-noise ratio greater than 50 and the Teff ranging within 7000K to 8500K. We perform the Random Forest (RF) algorithm, one of the most widely used Machine Learning algorithms to establish the regressio,relationship between the flux of all wavelengths and their corresponding stellar parameters((Teff ) and (logg) respectively). The trained RF model not only can regress the stellar parameters but also can obtain the rank of the wavelength based on their sensibility to parameters.According to the rankings, we define line indices by merging adjacent wavelengths. The objectively defined line indices in this work are amendments to Lick indices including some weak lines. We use the Support Vector Regression algorithm based on our new defined line indices to measure the temperature and gravity and use some common stars from Simbad to evaluate our result. In addition, the Gaia HR diagram is used for checking the accuracy of Teff and logg.


Forests ◽  
2021 ◽  
Vol 12 (4) ◽  
pp. 461
Author(s):  
Mahmoud Bayat ◽  
Harold Burkhart ◽  
Manouchehr Namiranian ◽  
Seyedeh Kosar Hamidi ◽  
Sahar Heidari ◽  
...  

Forest ecosystems play multiple important roles in meeting the habitat needs of different organisms and providing a variety of services to humans. Biodiversity is one of the structural features in dynamic and complex forest ecosystems. One of the most challenging issues in assessing forest ecosystems is understanding the relationship between biodiversity and environmental factors. The aim of this study was to investigate the effect of biotic and abiotic factors on tree diversity of Hyrcanian forests in northern Iran. For this purpose, we analyzed tree diversity in 8 forest sites in different locations from east to west of the Caspian Sea. 15,988 trees were measured in 655 circular permanent sample plots (0.1 ha). A combination of machine learning methods was used for modeling and investigating the relationship between tree diversity and biotic and abiotic factors. Machine learning models included generalized additive models (GAMs), support vector machine (SVM), random forest (RF) and K-nearest–neighbor (KNN). To determine the most important factors related to tree diversity we used from variables such as the average diameter at breast height (DBH) in the plot, basal area in largest trees (BAL), basal area (BA), number of trees per hectare, tree species, slope, aspect and elevation. A comparison of RMSEs, relative RMSEs, and the coefficients of determination of the different methods, showed that the random forest (RF) method resulted in the best models among all those tested. Based on the results of the RF method, elevation, BA and BAL were recognized as the most influential factors defining variation of tree diversity.


2019 ◽  
Vol 8 (4) ◽  
pp. 1477-1483

With the fast moving technological advancement, the internet usage has been increased rapidly in all the fields. The money transactions for all the applications like online shopping, banking transactions, bill settlement in any industries, online ticket booking for travel and hotels, Fees payment for educational organization, Payment for treatment to hospitals, Payment for super market and variety of applications are using online credit card transactions. This leads to the fraud usage of other accounts and transaction that result in the loss of service and profit to the institution. With this background, this paper focuses on predicting the fraudulent credit card transaction. The Credit Card Transaction dataset from KAGGLE machine learning Repository is used for prediction analysis. The analysis of fraudulent credit card transaction is achieved in four ways. Firstly, the relationship between the variables of the dataset is identified and represented by the graphical notations. Secondly, the feature importance of the dataset is identified using Random Forest, Ada boost, Logistic Regression, Decision Tree, Extra Tree, Gradient Boosting and Naive Bayes classifiers. Thirdly, the extracted feature importance if the credit card transaction dataset is fitted to Random Forest classifier, Ada boost classifier, Logistic Regression classifier, Decision Tree classifier, Extra Tree classifier, Gradient Boosting classifier and Naive Bayes classifier. Fourth, the Performance Analysis is done by analyzing the performance metrics like Accuracy, FScore, AUC Score, Precision and Recall. The implementation is done by python in Anaconda Spyder Navigator Integrated Development Environment. Experimental Results shows that the Decision Tree classifier have achieved the effective prediction with the precision of 1.0, recall of 1.0, FScore of 1.0 , AUC Score of 89.09 and Accuracy of 99.92%.


2020 ◽  
Vol 12 (11) ◽  
pp. 4748
Author(s):  
Minrui Zheng ◽  
Wenwu Tang ◽  
Akinwumi Ogundiran ◽  
Jianxin Yang

Settlement models help to understand the social–ecological functioning of landscape and associated land use and land cover change. One of the issues of settlement modeling is that models are typically used to explore the relationship between settlement locations and associated influential factors (e.g., slope and aspect). However, few studies in settlement modeling adopted landscape visibility analysis. Landscape visibility provides useful information for understanding human decision-making associated with the establishment of settlements. In the past years, machine learning algorithms have demonstrated their capabilities in improving the performance of the settlement modeling and particularly capturing the nonlinear relationship between settlement locations and their drivers. However, simulation models using machine learning algorithms in settlement modeling are still not well studied. Moreover, overfitting issues and optimization of model parameters are major challenges for most machine learning algorithms. Therefore, in this study, we sought to pursue two research objectives. First, we aimed to evaluate the contribution of viewsheds and landscape visibility to the simulation modeling of - settlement locations. The second objective is to examine the performance of the machine learning algorithm-based simulation models for settlement location studies. Our study region is located in the metropolitan area of Oyo Empire, Nigeria, West Africa, ca. AD 1570–1830, and its pre-Imperial antecedents, ca. AD 1360–1570. We developed an event-driven spatial simulation model enabled by random forest algorithm to represent dynamics in settlement systems in our study region. Experimental results demonstrate that viewsheds and landscape visibility may offer more insights into unveiling the underlying mechanism that drives settlement locations. Random forest algorithm, as a machine learning algorithm, provide solid support for establishing the relationship between settlement occurrences and their drivers.


Sign in / Sign up

Export Citation Format

Share Document