Coal Pit Mapping with Random Forest-Based Ensemble Machine Learning at Lower Benue Trough

Estimation of fluid saturation is an important step in dynamic reservoir characterization. Machine learning techniques have been increasingly used in recent years for reservoir saturation prediction workflows. However, most of these studies require input parameters derived from cores, petrophysical logs, or seismic data, which may not always be readily available. Additionally, very few studies incorporate the production data, which is an important reflection of the dynamic reservoir properties and also typically the most frequently and reliably measured quantity throughout the life of a field. In this research, the random forest ensemble machine learning algorithm is implemented that uses the field-wide production and injection data (both measured at the surface) as the only input parameters to predict the time-lapse oil saturation profiles at well locations. The algorithm is optimized using feature selection based on feature importance score and Pearson correlation coefficient, in combination with geophysical domain-knowledge. The workflow is demonstrated using the actual field data from a structurally complex, heterogeneous, and heavily faulted offshore reservoir. The random forest model captures the trends from three and a half years of historical field production, injection, and simulated saturation data to predict future time-lapse oil saturation profiles at four deviated well locations with over 90% R-square, less than 6% Root Mean Square Error, and less than 7% Mean Absolute Percentage Error, in each case.

Download Full-text

Learning from Imbalanced Educational Data Using Ensemble Machine Learning Algorithms

Webology ◽

10.14704/web/v18si01/web18053 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 183-195

Author(s):

Thingbaijam Lenin ◽

N. Chandrasekaran

Keyword(s):

Machine Learning ◽

Random Forest ◽

Missing Values ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Adaptive Boosting ◽

Stochastic Gradient Boosting ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Student’S Performance

Student’s academic performance is one of the most important parameters for evaluating the standard of any institute. It has become a paramount importance for any institute to identify the student at risk of underperforming or failing or even drop out from the course. Machine Learning techniques may be used to develop a model for predicting student’s performance as early as at the time of admission. The task however is challenging as the educational data required to explore for modelling are usually imbalanced. We explore ensemble machine learning techniques namely bagging algorithm like random forest (rf) and boosting algorithms like adaptive boosting (adaboost), stochastic gradient boosting (gbm), extreme gradient boosting (xgbTree) in an attempt to develop a model for predicting the student’s performance of a private university at Meghalaya using three categories of data namely demographic, prior academic record, personality. The collected data are found to be highly imbalanced and also consists of missing values. We employ k-nearest neighbor (knn) data imputation technique to tackle the missing values. The models are developed on the imputed data with 10 fold cross validation technique and are evaluated using precision, specificity, recall, kappa metrics. As the data are imbalanced, we avoid using accuracy as the metrics of evaluating the model and instead use balanced accuracy and F-score. We compare the ensemble technique with single classifier C4.5. The best result is provided by random forest and adaboost with F-score of 66.67%, balanced accuracy of 75%, and accuracy of 96.94%.

Download Full-text

ANALYSIS OF SINGLE AND ENSEMBLE MACHINE LEARNING CLASSIFIERS FOR PHISHING ATTACKS DETECTION

International Journal of Computer Systems & Software Engineering ◽

10.15282/ijsecs.7.2.2021.5.0088 ◽

2021 ◽

Vol 7 (2) ◽

pp. 44-49

Author(s):

Oyelakin A. M ◽

Alimi O. M ◽

Mustapha I. O ◽

Ajiboye I. K

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Decision Trees ◽

Random Forest Algorithm ◽

Ensemble Techniques ◽

Learning Classifiers ◽

Phishing Attacks ◽

Ensemble Machine Learning

Phishing attacks have been used in different ways to harvest the confidential information of unsuspecting internet users. To stem the tide of phishing-based attacks, several machine learning techniques have been proposed in the past. However, fewer studies have considered investigating single and ensemble machine learning-based models for the classification of phishing attacks. This study carried out performance analysis of selected single and ensemble machine learning (ML) classifiers in phishing classification.The focus is to investigate how these algorithms behave in the classification of phishing attacks in the chosen dataset. Logistic Regression and Decision Trees were chosen as single learning classifiers while simple voting techniques and Random Forest were used as the ensemble machine learning algorithms. Accuracy, Precision, Recall and F1-score were used as performance metrics. Logistic Regression algorithm recorded 0.86 as accuracy, 0.89 as precision, 0.87 as recall and 0.81 as F1-score. Similarly, the Decision Trees classifier achieved an accuracy of 0.87, 0.83 for precision, 0.88 for recall and 0.81 for F1-score. In the voting ensemble, accuracy of 0.92 was achieved. 0.90 was obtained for precision, 0.92 for recall and 0.92 for F1-score. Random Forest algorithm recorded 0.98, 0.97, 0.98 and 0.97 as accuracy, precision, recall and F1-score respectively. From the experimental analyses, Random Forest algorithm outperformed simple averaging classifier and the two single algorithms used for phishing url detection. The study established that the ensemble techniques that were used for the experimentations are more efficient for phishing url identification compared to the single classifiers.

Download Full-text

The Comparison of Tree-Based Ensemble Machine Learning for Classifying Public Datasets

RSF Conference Series: Engineering and Technology ◽

10.31098/cset.v1i1.412 ◽

2021 ◽

Vol 1 (1) ◽

pp. 407-413

Author(s):

Nur Heri Cahyana ◽

Yuli Fauziah ◽

Agus Sasmito Aribowo

Keyword(s):

Machine Learning ◽

Random Forest ◽

Class Size ◽

Good Method ◽

Test Dataset ◽

Ensemble Machine Learning ◽

Tree Classifier ◽

Public Datasets ◽

Number Of Classes ◽

The Relationship

This study aims to determine the best methods of tree-based ensemble machine learning to classify the datasets used, a total of 34 datasets. This study also wants to know the relationship between the number of records and columns of the test dataset with the number of estimators (trees) for each ensemble model, namely Random Forest, Extra Tree Classifier, AdaBoost, and Gradient Bosting. The four methods will be compared to the maximum accuracy and the number of estimators when tested to classify the test dataset. Based on the results of the experiments above, tree-based ensemble machine learning methods have been obtained and the best number of estimators for the classification of each dataset used in the study. The Extra Tree method is the best classifier method for binary-class and multi-class. Random Forest is good for multi-classes, and AdaBoost is a pretty good method for binary-classes. The number of rows, columns and data classes is positively correlated with the number of estimators. This means that to process a dataset with a large row, column or class size requires more estimators than processing a dataset with a small row, column or class size. However, the relationship between the number of classes and accuracy is negatively correlated, meaning that the accuracy will decrease if there are more classes for classification.

Download Full-text

Price Prediction for Pre-Owned Cars Using Ensemble Machine Learning Techniques

10.3233/apc210194 ◽

2021 ◽

Author(s):

Chetna Longani ◽

Sai Prasad Potharaju ◽

Sandhya Deore

Keyword(s):

Machine Learning ◽

Random Forest ◽

Mean Squared Error ◽

Machine Learning Techniques ◽

Random Forest Algorithm ◽

Fair Price ◽

Ensemble Machine Learning ◽

Comparable Performance ◽

Used Car ◽

Used Cars

The Pre-owned cars or so-called used cars have capacious markets across the globe. Before acquiring a used car, the buyer should be able to decide whether the price affixed for the car is genuine. Several facets including mileage, year, model, make, run and many more are needed to be considered before getting a hold of any pre-owned car. Both the seller and the buyer should have a fair deal. This paper presents a system that has been implemented to predict a fair price for any pre-owned car. The system works well to anticipate the price of used cars for the Mumbai region. Ensemble techniques in machine learning namely Random Forest Algorithm, eXtreme Gradient Boost are deployed to develop models that can predict an appropriate price for the used cars. The techniques are compared so as to determine an optimal one. Both the methods provided comparable performance wherein eXtreme Boost outperformed the random forest algorithm. Root Mean Squared Error of random forest recorded 3.44 whereas eXtreme Boost displayed 0.53.

Download Full-text

Disease Classification and Prediction using Ensemble Machine Learning Classification Algorithm

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f5507.039621 ◽

2021 ◽

Vol 9 (6) ◽

pp. 202-214

Author(s):

Preeth B.Meena ◽

Radha, P.

Keyword(s):

Machine Learning ◽

Random Forest ◽

Medical Records ◽

Machine Learning Algorithms ◽

Ensemble Prediction ◽

Support Vector ◽

Disease Rate ◽

Medical Field ◽

Machine Learning Classification ◽

Ensemble Machine Learning

In today’s scenario, disease prediction plays an important role in medical field. Early detection of diseases is essential because of the fast food habits and life. In my previous study for predicting diseases using radiology test report , and to classify the disease as positive or negative three classifiers Naïve Bayes (NB), Support Vector Machine (SVM) and Modified Extreme Learning Machine (MELM was used to increase the accuracy of results. To increase the efficiency of predicting the disease and to find which disease pricks the society, ensemble machine learning algorithm is used. The huge data from the healthcare industry were preprocessed., categorized and analyzed to find out and predict which patient to be treated and given priority and which hits the society the most. Ensemble machine learning's popularity in the medical industry is due to a variety of factors the Classifiers used are K Nearest Neighbors, Nearest Mean Classifier, Mean Feature Voting Classifier, KDtree KNN, Random Forest. To reduce the manual processes in medical field automating these processes has become important. Electronic medical records and significant advances in health care have given an opportunity to make find out which patients need to be given more importance. Several methodologies and techniques were used to preprocess the data in order to meet the study' requirements. To improve the performance of machine learning algorithms, feature selections were made using Tabu search. When ensemble prediction is combined with the Random Forest algorithm as the combiner, the results are more reliable. The aim of this study is to create a system to classify Medical records whether it is diseased or not and find out which disease rate has increased. This research will help the society to an individual to get treated easily and take preventive measures to avoid diseases.

Download Full-text

Cross-domain sentiment analysis model on Indonesian YouTube comment

International Journal of Advances in Intelligent Informatics ◽

10.26555/ijain.v7i1.554 ◽

2021 ◽

Vol 7 (1) ◽

pp. 12

Author(s):

Agus Sasmito Aribowo ◽

Halizah Basiron ◽

Noor Fazilla Abd Yusof ◽

Siti Khomsah

Keyword(s):

Machine Learning ◽

Random Forest ◽

Sentiment Analysis ◽

Machine Learning Method ◽

Learning Method ◽

Analysis Model ◽

Language Form ◽

Cross Domain ◽

Ensemble Machine Learning ◽

Stop Word

A cross-domain sentiment analysis (CDSA) study in the Indonesian language and tree-based ensemble machine learning is quite interesting. CDSA is useful to support the labeling process of cross-domain sentiment and reduce any dependence on the experts; however, the mechanism in the opinion unstructured by stop word, language expressions, and Indonesian slang words is unidentified yet. This study aimed to obtain the best model of CDSA for the opinion in Indonesia language that commonly is full of stop words and slang words in the Indonesian dialect. This study was purposely to observe the benefits of the stop words cleaning and slang words conversion in CDSA in the Indonesian language form. It was also to find out which machine learning method is suitable for this model. This study started by crawling five datasets of the comments on YouTube from 5 different domains. The dataset was copied into two groups: the dataset group without any process of stop word cleaning and slang word conversion and the dataset group to stop word cleaning and slang word conversion. CDSA model was built for each dataset group and then tested using two types of tree-based ensemble machine learning, i.e., Random Forest (RF) and Extra Tree (ET) classifier, and tested using three types of non-ensemble machine learning, including Naïve Bayes (NB), SVM, and Decision Tree (DT) as the comparison. Then, It can be suggested that the accuracy of CDSA in Indonesia Language increased if it still removed the stop words and converted the slang words. The best classifier model was built using tree-based ensemble machine learning, particularly ET, as in this study, the ET model could achieve the highest accuracy by 91.19%. This model is expected to be the CDSA technique alternative in the Indonesian language.

Download Full-text

Hierarchical ensemble learning method in diversified dataset analysis

Journal of Physics Conference Series ◽

10.1088/1742-6596/2078/1/012027 ◽

2021 ◽

Vol 2078 (1) ◽

pp. 012027

Author(s):

Ze yuan Liu ◽

Xin long Li

Keyword(s):

Machine Learning ◽

Random Forest ◽

Classification Accuracy ◽

Large Data ◽

Training Dataset ◽

Categorical Variables ◽

Ensemble Machine Learning ◽

Dataset Analysis ◽

Stage 1 ◽

Hierarchical Classifier

Abstract The remarkable advances in ensemble machine learning methods have led to a significant analysis in large data, such as random forest algorithms. However, the algorithms only use the current features during the process of learning, which caused the initial upper accuracy’s limit no matter how well the algorithms are. Moreover, the low classification accuracy happened especially when one type of observation’s proportion is much lower than the other types in training datasets. The aim of the present study is to design a hierarchical classifier which try to extract new features by ensemble machine learning regressors and statistical methods inside the whole machine learning process. In stage 1, all the categorical variables will be characterized by random forest algorithm to create a new variable through regression analysis while the numerical variables left will serve as the sample of factor analysis (FA) process to calculate the factors value of each observation. Then, all the features will be learned by random forest classifier in stage 2. Diversified datasets consist of categorical and numerical variables will be used in the method. The experiment results show that the classification accuracy increased by 8.61%. Meanwhile, it also improves the classification accuracy of observations with low proportion in the training dataset significantly.

Download Full-text

Random Forest Refinement of Pairwise Potentials for Protein-ligand Decoy Detection

10.26434/chemrxiv.8047820.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jun Pei ◽

Zheng Zheng ◽

Hyunji Kim ◽

Lin Song ◽

Sarah Walworth ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Probability Function ◽

Pair Potential ◽

Scoring Function ◽

Stable Structure ◽

Scoring Functions ◽

Atom Pair ◽

Data Set ◽

Atom Pairs

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not. <br>

Download Full-text

A Study on Host Tropism Determinants of Influenza Virus Using Machine Learning

Current Bioinformatics ◽

10.2174/1574893614666191104160927 ◽

2020 ◽

Vol 15 (2) ◽

pp. 121-134 ◽

Cited By ~ 2

Author(s):

Eunmi Kwon ◽

Myeongji Cho ◽

Hayeon Kim ◽

Hyeon S. Son

Keyword(s):

Machine Learning ◽

Amino Acids ◽

Influenza Virus ◽

Random Forest ◽

Physicochemical Properties ◽

Protein Sequences ◽

Influenza Viruses ◽

Host Tropism ◽

Post Hoc ◽

Ha Protein

Background: The host tropism determinants of influenza virus, which cause changes in the host range and increase the likelihood of interaction with specific hosts, are critical for understanding the infection and propagation of the virus in diverse host species. Methods: Six types of protein sequences of influenza viral strains isolated from three classes of hosts (avian, human, and swine) were obtained. Random forest, naïve Bayes classification, and knearest neighbor algorithms were used for host classification. The Java language was used for sequence analysis programming and identifying host-specific position markers. Results: A machine learning technique was explored to derive the physicochemical properties of amino acids used in host classification and prediction. HA protein was found to play the most important role in determining host tropism of the influenza virus, and the random forest method yielded the highest accuracy in host prediction. Conserved amino acids that exhibited host-specific differences were also selected and verified, and they were found to be useful position markers for host classification. Finally, ANOVA analysis and post-hoc testing revealed that the physicochemical properties of amino acids, comprising protein sequences combined with position markers, differed significantly among hosts. Conclusion: The host tropism determinants and position markers described in this study can be used in related research to classify, identify, and predict the hosts of influenza viruses that are currently susceptible or likely to be infected in the future.

Download Full-text