Algebraic aggregation of random forests: towards explainability and rapid evaluation

International Journal on Software Tools for Technology Transfer ◽

10.1007/s10009-021-00635-x ◽

2021 ◽

Author(s):

Frederik Gossen ◽

Bernhard Steffen

Keyword(s):

Machine Learning ◽

Data Structure ◽

Random Forest ◽

Random Forests ◽

Rapid Evaluation ◽

Running Time ◽

Specific Choice ◽

Size Number

AbstractRandom Forests are one of the most popular classifiers in machine learning. The larger they are, the more precise the outcome of their predictions. However, this comes at a cost: it is increasingly difficult to understand why a Random Forest made a specific choice, and its running time for classification grows linearly with the size (number of trees). In this paper, we propose a method to aggregate large Random Forests into a single, semantically equivalent decision diagram which has the following two effects: (1) minimal, sufficient explanations for Random Forest-based classifications can be obtained by means of a simple three step reduction, and (2) the running time is radically improved. In fact, our experiments on various popular datasets show speed-ups of several orders of magnitude, while, at the same time, also significantly reducing the size of the required data structure.

Download Full-text

For Honor, for Toxicity

Proceedings of the ACM on Human-Computer Interaction ◽

10.1145/3474680 ◽

2021 ◽

Vol 5 (CHI PLAY) ◽

pp. 1-29

Author(s):

Alessandro Canossa ◽

Dmitry Salimov ◽

Ahmad Azadvar ◽

Casper Harteveld ◽

Georgios Yannakakis

Keyword(s):

Machine Learning ◽

Random Forest ◽

Random Forests ◽

Initial Study ◽

Unfair Advantage ◽

Offensive Behavior ◽

Forest Models ◽

Random Forest Models ◽

Action Type ◽

Degree Of Severity

Is it possible to detect toxicity in games just by observing in-game behavior? If so, what are the behavioral factors that will help machine learning to discover the unknown relationship between gameplay and toxic behavior? In this initial study, we examine whether it is possible to predict toxicity in the MOBA gameFor Honor by observing in-game behavior for players that have been labeled as toxic (i.e. players that have been sanctioned by Ubisoft community managers). We test our hypothesis of detecting toxicity through gameplay with a dataset of almost 1,800 sanctioned players, and comparing these sanctioned players with unsanctioned players. Sanctioned players are defined by their toxic action type (offensive behavior vs. unfair advantage) and degree of severity (warned vs. banned). Our findings, based on supervised learning with random forests, suggest that it is not only possible to behaviorally distinguish sanctioned from unsanctioned players based on selected features of gameplay; it is also possible to predict both the sanction severity (warned vs. banned) and the sanction type (offensive behavior vs. unfair advantage). In particular, all random forest models predict toxicity, its severity, and type, with an accuracy of at least 82%, on average, on unseen players. This research shows that observing in-game behavior can support the work of community managers in moderating and possibly containing the burden of toxic behavior.

Download Full-text

Probability Machines

Methods of Information in Medicine ◽

10.3414/me00-01-0052 ◽

2012 ◽

Vol 51 (01) ◽

pp. 74-81 ◽

Cited By ~ 83

Author(s):

J. D. Malley ◽

J. Kruppa ◽

A. Dasgupta ◽

K. G. Malley ◽

A. Ziegler

Keyword(s):

Machine Learning ◽

Random Forest ◽

Random Forests ◽

Nearest Neighbor ◽

Real Data ◽

Nearest Neighbors ◽

Learning Approaches ◽

Pima Indians ◽

Regression Problem ◽

Binary Responses

SummaryBackground: Most machine learning approaches only provide a classification for binary responses. However, probabilities are required for risk estimation using individual patient characteristics. It has been shown recently that every statistical learning machine known to be consistent for a nonparametric regression problem is a probability machine that is provably consistent for this estimation problem.Objectives: The aim of this paper is to show how random forests and nearest neighbors can be used for consistent estimation of individual probabilities.Methods: Two random forest algorithms and two nearest neighbor algorithms are described in detail for estimation of individual probabilities. We discuss the consistency of random forests, nearest neighbors and other learning machines in detail. We conduct a simulation study to illustrate the validity of the methods. We exemplify the algorithms by analyzing two well-known data sets on the diagnosis of appendicitis and the diagnosis of diabetes in Pima Indians.Results: Simulations demonstrate the validity of the method. With the real data application, we show the accuracy and practicality of this approach. We provide sample code from R packages in which the probability estimation is already available. This means that all calculations can be performed using existing software.Conclusions: Random forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses. Freely available implementations are available in R and may be used for applications.

Download Full-text

Machine Learning Approach for Malware Detection Using Random Forest Classifier on Process List Data Structure

Proceedings of the 2nd International Conference on Information System and Data Mining - ICISDM '18 ◽

10.1145/3206098.3206113 ◽

2018 ◽

Cited By ~ 1

Author(s):

Santosh Joshi ◽

Himanshu Upadhyay ◽

Leonel Lagos ◽

Naga Suryamitra Akkipeddi ◽

Valerie Guerra

Keyword(s):

Machine Learning ◽

Data Structure ◽

Random Forest ◽

Malware Detection ◽

Random Forest Classifier ◽

Learning Approach ◽

Machine Learning Approach

Download Full-text

Iterative Update of a Random Forest Classifier for Diabetic Retinopathy

10.3233/faia210136 ◽

2021 ◽

Author(s):

Jordi Pascual-Fontanilles ◽

Aida Valls ◽

Antonio Moreno ◽

Pedro Romero-Aroca

Keyword(s):

Machine Learning ◽

Diabetic Retinopathy ◽

Random Forest ◽

Random Forests ◽

Random Forest Classifier ◽

Diabetic Patients ◽

Similar Data ◽

Machine Learning Classification ◽

Inherent Ambiguity ◽

New Iterative Method

Random Forests are well-known Machine Learning classification mechanisms based on a collection of decision trees. In the last years, they have been applied to assess the risk of diabetic patients to develop Diabetic Retinopathy. The results have been good, despite the unbalance of data between classes and the inherent ambiguity of the problem (patients with similar data may belong to different classes). In this work we propose a new iterative method to update the set of trees in the Random Forest by considering trees generated from the data of the new patients that are visited in the medical centre. With this method, it has been possible to improve the results obtained with standard Random Forests.

Download Full-text

VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data

10.1101/702902 ◽

2019 ◽

Cited By ~ 1

Author(s):

Arash Bayat ◽

Piotr Szul ◽

Aidan R. O’Brien ◽

Robert Dunne ◽

Oscar J. Luo ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Internet Of Things ◽

Random Forests ◽

Life Sciences ◽

High Dimensional ◽

The Internet ◽

Machine Learning Methods ◽

High Dimensional Datasets ◽

The Internet Of Things

AbstractThe demands on machine learning methods to cater for ultra high dimensional datasets, datasets with millions of features, have been increasing in domains like life sciences and the Internet of Things (IoT). While Random Forests are suitable for “wide” datasets, current implementations such as Google’s PLANET lack the ability to scale to such dimensions. Recent improvements by Yggdrasil begin to address these limitations but do not extend to Random Forest. This paper introduces CursedForest, a novel Random Forest implementation on top of Apache Spark and part of the VariantSpark platform, which parallelises processing of all nodes over the entire forest. CursedForest is 9 and up to 89 times faster than Google’s PLANET and Yggdrasil, respectively, and is the first method capable of scaling to millions of features.

Download Full-text

Addressing Measurement Error in Random Forests using Quantitative Bias Analysis

American Journal of Epidemiology ◽

10.1093/aje/kwab010 ◽

2021 ◽

Author(s):

Tammy Jiang ◽

Jaimie L Gradus ◽

Timothy L Lash ◽

Matthew P Fox

Keyword(s):

Machine Learning ◽

Measurement Error ◽

Random Forest ◽

Random Forests ◽

Model Performance ◽

Variable Importance ◽

Bias Analysis ◽

Variable Importance Measures ◽

Quantitative Bias Analysis ◽

The Impact

Abstract Although variables are often measured with error, the impact of measurement error on machine learning predictions is seldom quantified. The purpose of this study was to assess the impact of measurement error on random forest model performance and variable importance. First, we assessed the impact of misclassification (i.e., measurement error of categorical variables) of predictors on random forest model performance (e.g., accuracy, sensitivity) and variable importance (mean decrease in accuracy) using data from the United States National Comorbidity Survey Replication (2001 - 2003). Second, we simulated datasets in which we know the true model performance and variable importance measures and could verify that quantitative bias analysis was recovering the truth in misclassified versions of the datasets. Our findings show that measurement error in the data used to construct random forests can distort model performance and variable importance measures, and that bias analysis can recover the correct results. This study highlights the utility of applying quantitative bias analysis in machine learning to quantify the impact of measurement error on study results.

Download Full-text

Influenza Prediction: Analyzing Machine Learning Algorithms

Asian Journal of Computer Science and Technology ◽

10.51983/ajcst-2020.9.1.2155 ◽

2020 ◽

Vol 9 (1) ◽

pp. 14-18

Author(s):

Sapna Yadav ◽

Pankaj Agarwal

Keyword(s):

Machine Learning ◽

Random Forest ◽

Linear Regression ◽

Random Forests ◽

Ridge Regression ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Digital Data ◽

Support Vector ◽

Lasso Regression

Analyzing online or digital data for detecting epidemics is one of the hot areas of research and now becomes more relevant during the present outbreak of Covid-19. There are several different types of the influenza virus and moreover they keep evolving constantly in the same manner the COVID-19 virus has done. As a result, they pose a greater challenge when it comes to analyzing them, predicting when, where and at what degree of severity it will outbreak during the flu season across the world. There is need for greater surveillance to both seasonal and pandemic influenza to ensure the health and safety of the mankind. The objective of work is to apply machine learning algorithms for building predictive models that can predict where the occurrence, peak and severity of influenza in each season. For this work we have considered a freely available dataset of Ireland which is recorded for the duration of 2005 to 2016. Specifically, we have tested three ML Algorithms namely Linear Regression, Support Vector Regression and Random Forests. We found Random Forests is giving better predictive results. We also conducted experiment through weka tool and tested Zero R, Linear Regression, Lazy Kstar, Random Forest, REP Tree, Multilayer Perceptron models. We again found the Random Forest is performing better in comparison to all other models. We also evaluated other regression models including Ridge Regression, modified Ridge regression, Lasso Regression, K Neighbor Regression and evaluated the mean absolute errors. We found that modified Ridge regression is producing minimum error. The proposed work is inclined towards finding the suitability & appropriate ML algorithm for solving this problem on Flu.

Download Full-text

Predicting deceased donor kidney transplant outcomes: Comparing KDRI/KDPI with machine learning

10.21203/rs.2.16892/v1 ◽

2019 ◽

Author(s):

Eric Pahl ◽

Nick Street ◽

Hans Johnson ◽

Alan Reed

Keyword(s):

Machine Learning ◽

Random Forest ◽

Kidney Transplant ◽

Random Forests ◽

Deceased Donor ◽

Error Rates ◽

Donor Kidney ◽

Deceased Donor Kidney ◽

Cost Effective Treatment ◽

Deceased Donor Kidney Transplant

Abstract Background: Kidney transplantation is a cost-effective treatment for end-stage renal failure patients that provides a significant survival benefit and improves their quality of life compared to other forms of renal replacement. The predominant method used for donor kidney quality assessment is the Cox regression-based, piecewise linear kidney donor risk index (KDRI). A machine learning method (random forest) was compared to KDRI for predicting graft failure at 12, 24, and 36 months after transplantation. Methods: Random forest was trained and evaluated with the same deceased donor kidney transplant data (n=70242) initially used to develop KDRI (1995-2005) and included four readily available recipient variables from the estimated post-transplant survival score. Results: When comparing type II error rates of 10%, random forests predicted an additional 2,148 successful grafts at 36 months after transplant (126%) than KDRI. Many high-KDRI kidneys, at risk for discard, were correctly predicted for successful transplantation with random forests. Random forest performed significantly better than KDRI for graft Kaplan-Meier survival analysis from 0-240 months (log-rank test p<0.00). Conclusions: Machine learning methods can provide a significant improvement over KDRI for the assessment of kidney offers. This work lays the foundation for the use of machine learning methodologies in transplantation and describes the steps to measure, analyze, and validate future models.

Download Full-text

Prediction of Domestic Airline Tickets using Machine Learning

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.35053 ◽

2021 ◽

Vol 9 (VI) ◽

pp. 666-674

Author(s):

Pranita Rajure

Keyword(s):

Machine Learning ◽

Random Forest ◽

Random Forests ◽

Learning Algorithm ◽

Time Series Prediction ◽

Complex Problem ◽

Reasonable Prediction ◽

The Individual ◽

Domestic Airline ◽

Individual Trees

Airlines usually keep their price strategies as commercial secrets and information is always asymmetric, it is difficult for ordinary customers to estimate future flight price changes. However, a reasonable prediction can help customers make decisions when to buy air tickets for a lower price. Flight price prediction can be regarded as a typical time series prediction problem. When you give customers a device that can help them save some money, they will pay you back with loyalty, which is priceless. Interesting fact: Fareboom users started spending twice as much time per session within a month of the release of an airfare price forecasting feature. Considering the features such as departure time, the number of days left for departure and time of the day it will give the best time to buy the ticket. Features are extracted from the collected data to apply Random Forest Machine Learning (ML) model. Then using this information, we are intended to build a system that can help buyers whether to buy a ticket or not. We have used Random Forest Algorithm which is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model. With that said, random forests are a strong modelling technique and much more robust than a single decision tree. They aggregate many decision trees to limit over fitting as well as error due to bias and therefore yield useful results. Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them.

Download Full-text

On Explaining Random Forests with SAT

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/356 ◽

2021 ◽

Author(s):

Yacine Izza ◽

Joao Marques-Silva

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Random Forest ◽

Random Forests ◽

Experimental Results ◽

Polynomial Algorithms ◽

Practical Applications ◽

Heuristic Approaches ◽

Wide Range ◽

Sat Solver

Random Forest (RFs) are among the most widely used Machine Learning (ML) classifiers. Even though RFs are not interpretable, there are no dedicated non-heuristic approaches for computing explanations of RFs. Moreover, there is recent work on polynomial algorithms for explaining ML models, including naive Bayes classifiers. Hence, one question is whether finding explanations of RFs can be solved in polynomial time. This paper answers this question negatively, by proving that computing one PI-explanation of an RF is D^P-hard. Furthermore, the paper proposes a propositional encoding for computing explanations of RFs, thus enabling finding PI-explanations with a SAT solver. This contrasts with earlier work on explaining boosted trees (BTs) and neural networks (NNs), which requires encodings based on SMT/MILP. Experimental results, obtained on a wide range of publicly available datasets, demonstrate that the proposed SAT-based approach scales to RFs of sizes common in practical applications. Perhaps more importantly, the experimental results demonstrate that, for the vast majority of examples considered, the SAT-based approach proposed in this paper significantly outperforms existing heuristic approaches.

Download Full-text