scholarly journals The random forest algorithm for statistical learning

Author(s):  
Matthias Schonlau ◽  
Rosie Yuyan Zou

Random forests (Breiman, 2001, Machine Learning 45: 5–32) is a statistical- or machine-learning algorithm for prediction. In this article, we introduce a corresponding new command, rforest. We overview the random forest algorithm and illustrate its use with two examples: The first example is a classification problem that predicts whether a credit card holder will default on his or her debt. The second example is a regression problem that predicts the logscaled number of shares of online news articles. We conclude with a discussion that summarizes key points demonstrated in the examples.

2021 ◽  
Vol 8 (3) ◽  
pp. 209-221
Author(s):  
Li-Li Wei ◽  
Yue-Shuai Pan ◽  
Yan Zhang ◽  
Kai Chen ◽  
Hao-Yu Wang ◽  
...  

Abstract Objective To study the application of a machine learning algorithm for predicting gestational diabetes mellitus (GDM) in early pregnancy. Methods This study identified indicators related to GDM through a literature review and expert discussion. Pregnant women who had attended medical institutions for an antenatal examination from November 2017 to August 2018 were selected for analysis, and the collected indicators were retrospectively analyzed. Based on Python, the indicators were classified and modeled using a random forest regression algorithm, and the performance of the prediction model was analyzed. Results We obtained 4806 analyzable data from 1625 pregnant women. Among these, 3265 samples with all 67 indicators were used to establish data set F1; 4806 samples with 38 identical indicators were used to establish data set F2. Each of F1 and F2 was used for training the random forest algorithm. The overall predictive accuracy of the F1 model was 93.10%, area under the receiver operating characteristic curve (AUC) was 0.66, and the predictive accuracy of GDM-positive cases was 37.10%. The corresponding values for the F2 model were 88.70%, 0.87, and 79.44%. The results thus showed that the F2 prediction model performed better than the F1 model. To explore the impact of sacrificial indicators on GDM prediction, the F3 data set was established using 3265 samples (F1) with 38 indicators (F2). After training, the overall predictive accuracy of the F3 model was 91.60%, AUC was 0.58, and the predictive accuracy of positive cases was 15.85%. Conclusions In this study, a model for predicting GDM with several input variables (e.g., physical examination, past history, personal history, family history, and laboratory indicators) was established using a random forest regression algorithm. The trained prediction model exhibited a good performance and is valuable as a reference for predicting GDM in women at an early stage of pregnancy. In addition, there are certain requirements for the proportions of negative and positive cases in sample data sets when the random forest algorithm is applied to the early prediction of GDM.


This research paper proposes a solution that should be deployed to identify whether the transaction is fraud or not. Although we know that most of the transaction takes place online meaning that this transaction can be theft on the go and will create problem to user therefore this paper focus on some particular machine learning algorithm for example Random forest Algorithm, Decision Tree Algorithm, Logistic Regression, Support Vector Machine, K Nearest Neighbour, XGBoost .Which aims at solving such kind of real-world problem.


Breast Cancer is one of the most dangerous diseases for women. This cancer occurs when some breast cells begin to grow abnormally. Machine learning is the subfield of computer science that studies programs that generalize from past experience. This project looks at classification, where an algorithm tries to predict the label for a sample. The machine learning algorithm takes many of these samples, called the training set, and builds an internal model. This built model is used to classify and predict the data. There are two classes, benign and malignant. Random Forest classifier is used to predict whether the cancer is benign or malignant. Training and testing of the model are done by Wisconsin Diagnosis Breast Cancer dataset.


Author(s):  
M. G. Khachatrian ◽  
P. G. Klyucharev

Online social networks are of essence, as a tool for communication, for millions of people in their real world. However, online social networks also serve an arena of information war. One tool for infowar is bots, which are thought of as software designed to simulate the real user’s behaviour in online social networks.The paper objective is to develop a model for recognition of bots in online social networks. To develop this model, a machine-learning algorithm “Random Forest” was used. Since implementation of machine-learning algorithms requires the maximum data amount, the Twitter online social network was used to solve the problem of bot recognition. This online social network is regularly used in many studies on the recognition of bots.For learning and testing the Random Forest algorithm, a Twitter account dataset was used, which involved above 3,000 users and over 6,000 bots. While learning and testing the Random Forest algorithm, the optimal hyper-parameters of the algorithm were determined at which the highest value of the F1 metric was reached. As a programming language that allowed the above actions to be implemented, was chosen Python, which is frequently used in solving problems related to machine learning.To compare the developed model with the other authors’ models, testing was based on the two Twitter account datasets, which involved as many as half of bots and half of real users. As a result of testing on these datasets, F1-metrics of 0.973 and 0.923 were obtained. The obtained F1-metric values  are quite high as compared with the papers of other authors.As a result, in this paper a model of high accuracy rates was obtained that can recognize bots in the Twitter online social network.


Author(s):  
Chitluri Sai Harish B ◽  
G gnana krishna vamsi ◽  
G jaya phani akhil ◽  
J n v hari sravan ◽  
V mounika chowdary

Heart diseases are one of the most challenging problems faced by the Health Care sectors all over the world. These diseases are very basic now a days. With the expanding count of deaths because of heart illnesses, the necessity to build up a system to foresee heart ailments precisely. The work in this paper focuses on finding the best Machine Learning algorithm for identification of heart diseases. Our study compares the precision of three well known classification algorithms, Decision Tree and Naïve Bayes, Random Forest for the prediction of heart disease by making the use of dataset provided by Kaggle. We utilized various characteristics which relate with this heart diseases well, to find the better algorithm for prediction. The result of this study indicates that the Random Forest algorithm is the most efficient algorithm for prediction of heart disease with accuracy score of 97.17%.


Author(s):  
Robert Chapleau ◽  
Philippe Gaudette ◽  
Tim Spurr

Even in a context of rapidly evolving transportation and information technologies, household travel surveys remain an essential source of information for transportation planning. Moreover, as planning authorities become increasingly concerned with reducing the use of the private car, travelers’ mode choice patterns should be reexamined. In this study, a machine learning algorithm (Random Forest) was employed to characterize the use of eight different travel modes observed in two consecutive household travel surveys undertaken in Montreal, Canada. The analysis incorporated roughly 160,000 observed trips. The Random Forest algorithm was trained on the 2008 survey data and applied to the 2013 survey. The usefulness of the algorithm was evaluated using two numerical representations: the confusion matrix and the importance matrix. The results of this evaluation showed that the Random Forest algorithm could generate a detailed and precise characterization of travel submarkets for four of the most commonly observed modes of travel (auto-drive, public transit, school bus, and walk) using 11 attributes of households, persons, and trips. However, the auto-passenger mode was difficult to characterize because of its dependence on unobserved intra-household interactions. The algorithm also had difficulty identifying users of rarely observed modes (park-and-ride, kiss-and-ride, bicycle), but performed better in this regard than a traditional mode choice model. Finally, traveler’s age and the spatial orientation of origin–destination pairs were found to be decisive factors in the use of the auto-drive mode. This finding, combined with the stability of mode choice patterns observed over 5 years, highlights the difficulty of significantly reducing automobile use.


Author(s):  
P.Santhi, Et. al.

Machine Learning Algorithm is used for many different diseases. Machine Learning is a learning of machine by own itself. And it is a part of AI that deals with to learn a machine according to their own. Now-a-days most are affected due to Heart attack it becomes head ache for doctors. In order to reduce the count of death we need to predict the Heart attack. For this problem Machine Learning play a major role in this paper. This prediction takes a people from the danger zone of their life. In this paper we use KNN algorithm and Random forest algorithm can predict the heart attack in advance.


2020 ◽  
Vol 6 (2) ◽  
pp. 97-106
Author(s):  
Khan Nasik Sami ◽  
Zian Md Afique Amin ◽  
Raini Hassan

Waste Management is one of the essential issues that the world is currently facing does not matter if the country is developed or under developing. The key issue in this waste segregation is that the trash bin at open spots gets flooded well ahead of time before the beginning of the following cleaning process. The isolation of waste is done by unskilled workers which is less effective, time-consuming, and not plausible because of a lot of waste. So, we are proposing an automated waste classification problem utilizing Machine Learning and Deep Learning algorithms. The goal of this task is to gather a dataset and arrange it into six classes consisting of glass, paper, and metal, plastic, cardboard, and waste. The model that we have used are classification models. For our research we did comparisons between four algorithms, those are CNN, SVM, Random Forest, and Decision Tree. As our concern is a classification problem, we have used several machine learning and deep learning algorithm that best fits for classification solutions. For our model, CNN accomplished high characterization on accuracy around 90%, while SVM additionally indicated an excellent transformation to various kinds of waste which were 85%, and Random Forest and Decision Tree have accomplished 55% and 65% respectively


Author(s):  
Mr. Chitluri Sai Harish ◽  
◽  
Mr. G gnana krishna vamsi ◽  
Mr. G jaya phani akhil ◽  
Mr. J n v hari sravan ◽  
...  

Heart diseases are one of the most challenging problems faced by the Health Care sectors all over the world. These diseases are very basic now a days. With the expanding count of deaths because of heart illnesses, the necessity to build up a system to foresee heart ailments precisely. The work in this paper focuses on finding the best Machine Learning algorithm for identification of heart diseases. Our study compares the precision of three well known classification algorithms, Decision Tree and Naïve Bayes, Random Forest for the prediction of heart disease by making the use of dataset provided by Kaggle. We utilized various characteristics which relate with this heart diseases well, to find the better algorithm for prediction. The result of this study indicates that the Random Forest algorithm is the most efficient algorithm for prediction of heart disease with accuracy score of 97.17%.


2021 ◽  
Vol 6 (2) ◽  
pp. 213
Author(s):  
Nadya Intan Mustika ◽  
Bagus Nenda ◽  
Dona Ramadhan

This study aims to implement a machine learning algorithm in detecting fraud based on historical data set in a retail consumer financing company. The outcome of machine learning is used as samples for the fraud detection team. Data analysis is performed through data processing, feature selection, hold-on methods, and accuracy testing. There are five machine learning methods applied in this study: Logistic Regression, K-Nearest Neighbor (KNN), Decision Tree, Random Forest, and Support Vector Machine (SVM). Historical data are divided into two groups: training data and test data. The results show that the Random Forest algorithm has the highest accuracy with a training score of 0.994999 and a test score of 0.745437. This means that the Random Forest algorithm is the most accurate method for detecting fraud. Further research is suggested to add more predictor variables to increase the accuracy value and apply this method to different financial institutions and different industries.


Sign in / Sign up

Export Citation Format

Share Document