variable importance measures Latest Research Papers

AbstractThis paper reviews and advocates against the use of permute-and-predict (PaP) methods for interpreting black box functions. Methods such as the variable importance measures proposed for random forests, partial dependence plots, and individual conditional expectation plots remain popular because they are both model-agnostic and depend only on the pre-trained model output, making them computationally efficient and widely available in software. However, numerous studies have found that these tools can produce diagnostics that are highly misleading, particularly when there is strong dependence among features. The purpose of our work here is to (i) review this growing body of literature, (ii) provide further demonstrations of these drawbacks along with a detailed explanation as to why they occur, and (iii) advocate for alternative measures that involve additional modeling. In particular, we describe how breaking dependencies between features in hold-out data places undue emphasis on sparse regions of the feature space by forcing the original model to extrapolate to regions where there is little to no data. We explore these effects across various model setups and find support for previous claims in the literature that PaP metrics can vastly over-emphasize correlated features in both variable importance measures and partial dependence plots. As an alternative, we discuss and recommend more direct approaches that involve measuring the change in model performance after muting the effects of the features under investigation.

Download Full-text

A New Noisy Random Forest Based Method for Feature Selection

Cybernetics and Information Technologies ◽

10.2478/cait-2021-0016 ◽

2021 ◽

Vol 21 (2) ◽

pp. 10-28

Author(s):

Yassine Akhiat ◽

Youness Manzali ◽

Mohamed Chahhou ◽

Ahmed Zinedine

Keyword(s):

Feature Selection ◽

Variable Importance ◽

Machine Learning Algorithms ◽

Importance Measure ◽

Feature Subset ◽

Large Set ◽

Stopping Criterion ◽

Processing Step ◽

New Variant ◽

Variable Importance Measures

Abstract Feature selection is an essential pre-processing step in data mining. It aims at identifying the highly predictive feature subset out of a large set of candidate features. Several approaches for feature selection have been proposed in the literature. Random Forests (RF) are among the most used machine learning algorithms not just for their excellent prediction accuracy but also for their ability to select informative variables with their associated variable importance measures. Sometimes RF model over-fits on noisy features, which lead to choosing the noisy features as the informative variables and eliminating the significant ones. Whereas, eliminating and preventing those noisy features first, the low ranked features may become more important. In this study we propose a new variant of RF that provides unbiased variable selection where a noisy feature trick is used to address this problem. First, we add a noisy feature to a dataset. Second, the noisy feature is used as a stopping criterion. If the noisy feature is selected as the best splitting feature, then we stop the creation process because at this level, the model starts to over-fit on the noisy features. Finally, the best subset of features is selected out of the best-ranked feature regarding the Geni impurity of this new variant of RF. To test the validity and the effectiveness of the proposed method, we compare it with RF variable importance measure using eleven benchmarking datasets.

Download Full-text

Addressing Measurement Error in Random Forests using Quantitative Bias Analysis

American Journal of Epidemiology ◽

10.1093/aje/kwab010 ◽

2021 ◽

Author(s):

Tammy Jiang ◽

Jaimie L Gradus ◽

Timothy L Lash ◽

Matthew P Fox

Keyword(s):

Machine Learning ◽

Measurement Error ◽

Random Forest ◽

Random Forests ◽

Model Performance ◽

Variable Importance ◽

Bias Analysis ◽

Variable Importance Measures ◽

Quantitative Bias Analysis ◽

The Impact

Abstract Although variables are often measured with error, the impact of measurement error on machine learning predictions is seldom quantified. The purpose of this study was to assess the impact of measurement error on random forest model performance and variable importance. First, we assessed the impact of misclassification (i.e., measurement error of categorical variables) of predictors on random forest model performance (e.g., accuracy, sensitivity) and variable importance (mean decrease in accuracy) using data from the United States National Comorbidity Survey Replication (2001 - 2003). Second, we simulated datasets in which we know the true model performance and variable importance measures and could verify that quantitative bias analysis was recovering the truth in misclassified versions of the datasets. Our findings show that measurement error in the data used to construct random forests can distort model performance and variable importance measures, and that bias analysis can recover the correct results. This study highlights the utility of applying quantitative bias analysis in machine learning to quantify the impact of measurement error on study results.

Download Full-text

Ranking of a wide multidomain set of predictor variables of children obesity by machine learning variable importance techniques

Scientific Reports ◽

10.1038/s41598-021-81205-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Helena Marcos-Pasero ◽

Gonzalo Colmenarejo ◽

Elena Aguilar-Aguilar ◽

Ana Ramírez de Molina ◽

Guillermo Reglero ◽

...

Keyword(s):

Machine Learning ◽

Childhood Obesity ◽

Metabolic Diseases ◽

Variable Importance ◽

The Body ◽

Gradient Boosting ◽

Gradient Boosting Machine ◽

Potential Risk Factors ◽

Variable Importance Measures ◽

Near Future

AbstractThe increased prevalence of childhood obesity is expected to translate in the near future into a concomitant soaring of multiple cardio-metabolic diseases. Obesity has a complex, multifactorial etiology, that includes multiple and multidomain potential risk factors: genetics, dietary and physical activity habits, socio-economic environment, lifestyle, etc. In addition, all these factors are expected to exert their influence through a specific and especially convoluted way during childhood, given the fast growth along this period. Machine Learning methods are the appropriate tools to model this complexity, given their ability to cope with high-dimensional, non-linear data. Here, we have analyzed by Machine Learning a sample of 221 children (6–9 years) from Madrid, Spain. Both Random Forest and Gradient Boosting Machine models have been derived to predict the body mass index from a wide set of 190 multidomain variables (including age, sex, genetic polymorphisms, lifestyle, socio-economic, diet, exercise, and gestation ones). A consensus relative importance of the predictors has been estimated through variable importance measures, implemented robustly through an iterative process that included permutation and multiple imputation. We expect this analysis will help to shed light on the most important variables associated to childhood obesity, in order to choose better treatments for its prevention.

Download Full-text

Variable-importance Measures

Explanatory Model Analysis ◽

10.1201/9780429027192-19 ◽

2021 ◽

pp. 195-208

Author(s):

Przemyslaw Biecek ◽

Tomasz Burzykowski

Keyword(s):

Variable Importance ◽

Variable Importance Measures

Download Full-text

Variable Importance Measures based on Ensemble Learning Methods for Convective Storm Tracking

2020 Joint 11th International Conference on Soft Computing and Intelligent Systems and 21st International Symposium on Advanced Intelligent Systems (SCIS-ISIS) ◽

10.1109/scisisis50064.2020.9322692 ◽

2020 ◽

Author(s):

Hansoo Lee ◽

Jonggeun Kim ◽

Seunghwan Jung ◽

Minseok Kim ◽

Baekcheon Kim ◽

...

Keyword(s):

Ensemble Learning ◽

Variable Importance ◽

Convective Storm ◽

Learning Methods ◽

Storm Tracking ◽

Variable Importance Measures

Download Full-text

Process Variable Importance Analysis by Use of Random Forests in a Shapley Regression Framework

Minerals ◽

10.3390/min10050420 ◽

2020 ◽

Vol 10 (5) ◽

pp. 420

Author(s):

Chris Aldrich

Keyword(s):

Random Forest ◽

Case Studies ◽

Random Forests ◽

Performance Indicator ◽

Variable Importance ◽

Predictor Variables ◽

Importance Measure ◽

Variable Importance Measure ◽

Operational Variables ◽

Variable Importance Measures

Linear regression is often used as a diagnostic tool to understand the relative contributions of operational variables to some key performance indicator or response variable. However, owing to the nature of plant operations, predictor variables tend to be correlated, often highly so, and this can lead to significant complications in assessing the importance of these variables. Shapley regression is seen as the only axiomatic approach to deal with this problem but has almost exclusively been used with linear models to date. In this paper, the approach is extended to random forests, and the results are compared with some of the empirical variable importance measures widely used with these models, i.e., permutation and Gini variable importance measures. Four case studies are considered, of which two are based on simulated data and two on real world data from the mineral process industries. These case studies suggest that the random forest Shapley variable importance measure may be a more reliable indicator of the influence of predictor variables than the other measures that were considered. Moreover, the results obtained with the Gini variable importance measure was as reliable or better than that obtained with the permutation measure of the random forest.

Download Full-text

Random Forest Regression for Optimizing Variable Planting Rates for Corn and Soybean Using High-Resolution Topographical and Soil Data

10.1101/2020.02.17.952556 ◽

2020 ◽

Cited By ~ 1

Author(s):

Margaret R. Krause ◽

Savanna Crossman ◽

Todd DuMond ◽

Rodman Lott ◽

Jason Swede ◽

...

Keyword(s):

New York ◽

Random Forest ◽

Variable Importance ◽

Yield Response ◽

Variable Rate ◽

Random Forest Regression ◽

Precise Control ◽

Soil Features ◽

Complex Interactions ◽

Variable Importance Measures

ABSTRACTIn recent years, planting machinery that enables precise control of the planting rates has become available for corn (Zea mays L.) and soybean (Glycine max L.). With increasingly available topographical and soil information, there is a growing interest in developing variable rate planting strategies to exploit variation in the agri-landscape in order to maximize production. A random forest regression-based approach was developed to model the interactions between planting rate, topography, and soil characteristics and their effects on yield based on on-farm variable rate planting trials for corn and soybean conducted at 27 sites in New York between 2014 and 2018 (57 site-years) in collaboration with the New York Corn and Soybean Growers Association. Planting rate ranked highly in terms of random forest regression variable importance while explaining relatively minimal yield variation in the linear context, indicating that yield response to planting rate likely depends on complex interactions with agri-landscape features. Models were moderately predictive of yield within site-years and across years at a particular site, while the ability to predict yield across sites was low. Relatedly, variable importance measures for the topographical and soil features varied considerably across sites. Together, these results suggest that local testing may provide the most accurate optimized planting rate designs due to the unique set of conditions at each site. The proposed method was extended to identify the optimal variable rate planting design for maximizing yield at each site given the topographical and soil data, and empirical validation of the resulting designs is currently underway.

Download Full-text

On the behaviour of permutation‐based variable importance measures in random forest clustering

Journal of Chemometrics ◽

10.1002/cem.3178 ◽

2019 ◽

Vol 33 (8) ◽

Author(s):

Stefano Nembrini ◽

Tiziano Frigoli

Keyword(s):

Random Forest ◽

Variable Importance ◽

Variable Importance Measures

Download Full-text

On the behaviour of permutation‐based variable importance measures in random forest clustering

Journal of Chemometrics ◽

10.1002/cem.3135 ◽

2019 ◽

Vol 33 (8) ◽

Author(s):

Stefano Nembrini

Keyword(s):

Random Forest ◽

Variable Importance ◽

Variable Importance Measures

Download Full-text

variable importance measures
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance

A New Noisy Random Forest Based Method for Feature Selection

Addressing Measurement Error in Random Forests using Quantitative Bias Analysis

Ranking of a wide multidomain set of predictor variables of children obesity by machine learning variable importance techniques

Variable-importance Measures

Variable Importance Measures based on Ensemble Learning Methods for Convective Storm Tracking

Process Variable Importance Analysis by Use of Random Forests in a Shapley Regression Framework

Random Forest Regression for Optimizing Variable Planting Rates for Corn and Soybean Using High-Resolution Topographical and Soil Data

On the behaviour of permutation‐based variable importance measures in random forest clustering

On the behaviour of permutation‐based variable importance measures in random forest clustering

Export Citation Format

variable importance measuresRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance

A New Noisy Random Forest Based Method for Feature Selection

Addressing Measurement Error in Random Forests using Quantitative Bias Analysis

Ranking of a wide multidomain set of predictor variables of children obesity by machine learning variable importance techniques

Variable-importance Measures

Variable Importance Measures based on Ensemble Learning Methods for Convective Storm Tracking

Process Variable Importance Analysis by Use of Random Forests in a Shapley Regression Framework

Random Forest Regression for Optimizing Variable Planting Rates for Corn and Soybean Using High-Resolution Topographical and Soil Data

On the behaviour of permutation‐based variable importance measures in random forest clustering

On the behaviour of permutation‐based variable importance measures in random forest clustering

variable importance measures
Recently Published Documents