On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data

This paper proposes a cluster-aware supervised learning (CluSL) framework, which integrates the clustering analysis with supervised learning. The objective of CluSL is to simultaneously find the best clusters of the data points and minimize the sum of loss functions within each cluster. This framework has many potential applications in healthcare, operations management, manufacturing, and so on. Because CluSL, in general, is nonconvex, we develop a regularized alternating minimization (RAM) algorithm to solve it, where at each iteration, we penalize the distance between the current clustering solution and the one from the previous iteration. By choosing a proper penalty function, we show that each iteration of the RAM algorithm can be computed efficiently. We further prove that the proposed RAM algorithm will always converge to a stationary point within a finite number of iterations. This is the first known convergence result in cluster-aware learning literature. Furthermore, we extend CluSL to the high-dimensional data sets, termed the F-CluSL framework. In F-CluSL, we cluster features and minimize loss function at the same time. Similarly, to solve F-CluSL, a variant of the RAM algorithm (i.e., F-RAM) is developed and proven to be convergent to an [Formula: see text]-stationary point. Our numerical studies demonstrate that the proposed CluSL and F-CluSL can outperform the existing ones such as random forests and support vector classification, both in the interpretability of learning results and in prediction accuracy. Summary of Contribution: Aligned with the mission and scope of the INFORMS Journal on Computing, this paper proposes a cluster-aware supervised learning (CluSL) framework, which integrates the clustering analysis with supervised learning. Because CluSL is, in general, nonconvex, a regularized alternating projection algorithm is developed to solve it and is proven to always find a stationary solution. We further generalize the framework to the high-dimensional data set, F-CluSL. Our numerical studies demonstrate that the proposed CluSL and F-CluSL can deliver more interpretable learning results and outperform the existing ones such as random forests and support vector classification in computational time and prediction accuracy.

Download Full-text

Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

The Scientific World JOURNAL ◽

10.1155/2015/471371 ◽

2015 ◽

Vol 2015 ◽

pp. 1-18 ◽

Cited By ~ 16

Author(s):

Thanh-Tung Nguyen ◽

Joshua Zhexue Huang ◽

Thuy Thi Nguyen

Keyword(s):

Feature Selection ◽

Random Forests ◽

Selection Process ◽

High Dimensional Data ◽

Feature Weighting ◽

High Dimensional ◽

Feature Subset ◽

Value Assessment ◽

Statistical Measures ◽

Real World Datasets

Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features usingp-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.

Download Full-text

Classifying Very-High-Dimensional Data with Random Forests of Oblique Decision Trees

Advances in Knowledge Discovery and Management - Studies in Computational Intelligence ◽

10.1007/978-3-642-00580-0_3 ◽

2010 ◽

pp. 39-55 ◽

Cited By ~ 16

Author(s):

Thanh-Nghi Do ◽

Philippe Lenca ◽

Stéphane Lallich ◽

Nguyen-Khang Pham

Keyword(s):

Decision Trees ◽

Random Forests ◽

High Dimensional Data ◽

High Dimensional ◽

Very High

Download Full-text

A computationally fast variable importance test for random forests for high-dimensional data

Advances in Data Analysis and Classification ◽

10.1007/s11634-016-0276-4 ◽

2016 ◽

Vol 12 (4) ◽

pp. 885-915 ◽

Cited By ~ 33

Author(s):

Silke Janitza ◽

Ender Celik ◽

Anne-Laure Boulesteix

Keyword(s):

Random Forests ◽

High Dimensional Data ◽

Variable Importance ◽

High Dimensional ◽

Fast Variable

Download Full-text

Random forests for high-dimensional longitudinal data

Statistical Methods in Medical Research ◽

10.1177/0962280220946080 ◽

2020 ◽

pp. 096228022094608

Author(s):

Louis Capitaine ◽

Robin Genuer ◽

Rodolphe Thiébaut

Keyword(s):

Longitudinal Data ◽

Random Forests ◽

High Dimensional Data ◽

Vaccine Trial ◽

Repeated Measurements ◽

Supervised Machine Learning ◽

Estimation Methods ◽

High Dimensional ◽

Hiv Viral Load ◽

Gene Transcripts

Random forests are one of the state-of-the-art supervised machine learning methods and achieve good performance in high-dimensional settings where p, the number of predictors, is much larger than n, the number of observations. Repeated measurements provide, in general, additional information, hence they are worth accounted especially when analyzing high-dimensional data. Tree-based methods have already been adapted to clustered and longitudinal data by using a semi-parametric mixed effects model, in which the non-parametric part is estimated using regression trees or random forests. We propose a general approach of random forests for high-dimensional longitudinal data. It includes a flexible stochastic model which allows the covariance structure to vary over time. Furthermore, we introduce a new method which takes intra-individual covariance into consideration to build random forests. Through simulation experiments, we then study the behavior of different estimation methods, especially in the context of high-dimensional data. Finally, the proposed method has been applied to an HIV vaccine trial including 17 HIV-infected patients with 10 repeated measurements of 20,000 gene transcripts and blood concentration of human immunodeficiency virus RNA. The approach selected 21 gene transcripts for which the association with HIV viral load was fully relevant and consistent with results observed during primary infection.

Download Full-text

Multiple Imputation and Random Forests (MIRF) for Unobservable, High-Dimensional Data

The International Journal of Biostatistics ◽

10.2202/1557-4679.1049 ◽

2007 ◽

Vol 3 (1) ◽

Cited By ~ 4

Author(s):

Bareng A. S. Nonyane ◽

Andrea S. Foulkes

Keyword(s):

Multiple Imputation ◽

Random Forests ◽

High Dimensional Data ◽

High Dimensional

Download Full-text