Process Variable Importance Analysis by Use of Random Forests in a Shapley Regression Framework

Linear regression is often used as a diagnostic tool to understand the relative contributions of operational variables to some key performance indicator or response variable. However, owing to the nature of plant operations, predictor variables tend to be correlated, often highly so, and this can lead to significant complications in assessing the importance of these variables. Shapley regression is seen as the only axiomatic approach to deal with this problem but has almost exclusively been used with linear models to date. In this paper, the approach is extended to random forests, and the results are compared with some of the empirical variable importance measures widely used with these models, i.e., permutation and Gini variable importance measures. Four case studies are considered, of which two are based on simulated data and two on real world data from the mineral process industries. These case studies suggest that the random forest Shapley variable importance measure may be a more reliable indicator of the influence of predictor variables than the other measures that were considered. Moreover, the results obtained with the Gini variable importance measure was as reliable or better than that obtained with the permutation measure of the random forest.

Download Full-text

Inference of genetic networks using random forests: performance improvement using a new variable importance measure

10.21203/rs.3.rs-737867/v1 ◽

2021 ◽

Author(s):

Shuhei Kimura ◽

Yahiro Takeda ◽

Masato Tokuhisa ◽

Mariko Okada

Keyword(s):

Random Forest ◽

Random Forests ◽

Network Inference ◽

Genetic Network ◽

Variable Importance ◽

Genetic Networks ◽

Importance Measure ◽

Random Input ◽

Inference Method ◽

Variable Importance Measure

Abstract Background: Among the various methods so far proposed for genetic network inference, this study focuses on the random-forest-based methods. Confidence values are assigned to all of the candidate regulations when taking the random-forest-based approach. To our knowledge, all of the random-forest-based methods make the assignments using the standard variable importance measure defined in tree-based machine learning techniques. We think however that this measure has drawbacks in the inference of genetic networks. Results: In this study we therefore propose an alternative measure, what we call ``the random-input variable importance measure,'' and design a new inference method that uses the proposed measure in place of the standard measure in the existing random-forest-based inference method. We show, through numerical experiments, that the use of the random-input variable importance measure improves the performance of the existing random-forest-based inference method by as much as 45.5% with respect to the area under the recall-precision curve (AURPC). Conclusion: This study proposed the random-input variable importance measure for the inference of genetic networks. The use of our measure improved the performance of the random-forest-based inference method. In this study, we checked the performance of the proposed measure only on several genetic network inference problems. However, the experimental results suggest that the proposed measure will work well in other applications of random forests.

Download Full-text

Variable Importance Measure System Based on Advanced Random Forest

Computer Modeling in Engineering & Sciences ◽

10.32604/cmes.2021.015378 ◽

2021 ◽

Vol 128 (1) ◽

pp. 65-85

Author(s):

Shufang Song ◽

Ruyang He ◽

Zhaoyin Shi ◽

Weiya Zhang

Keyword(s):

Random Forest ◽

Variable Importance ◽

Importance Measure ◽

Variable Importance Measure

Download Full-text

An AUC-based permutation variable importance measure for random forests

BMC Bioinformatics ◽

10.1186/1471-2105-14-119 ◽

2013 ◽

Vol 14 (1) ◽

Cited By ~ 96

Author(s):

Silke Janitza ◽

Carolin Strobl ◽

Anne-Laure Boulesteix

Keyword(s):

Random Forests ◽

Variable Importance ◽

Importance Measure ◽

Variable Importance Measure

Download Full-text

A new variable importance measure for random forests with missing data

Statistics and Computing ◽

10.1007/s11222-012-9349-1 ◽

2012 ◽

Vol 24 (1) ◽

pp. 21-34 ◽

Cited By ~ 78

Author(s):

Alexander Hapfelmeier ◽

Torsten Hothorn ◽

Kurt Ulm ◽

Carolin Strobl

Keyword(s):

Missing Data ◽

Random Forests ◽

Variable Importance ◽

Importance Measure ◽

Variable Importance Measure

Download Full-text

Improved Variable Importance Measure of Random Forest via Combining of Proximity Measure and Support Vector Machine for Stable Feature Selection

Journal of Information and Computational Science ◽

10.12733/jics20105854 ◽

2015 ◽

Vol 12 (8) ◽

pp. 3241-3252 ◽

Cited By ~ 1

Author(s):

Huazhen Wang

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

Random Forest ◽

Variable Importance ◽

Importance Measure ◽

Support Vector ◽

Proximity Measure ◽

Variable Importance Measure ◽

Stable Feature

Download Full-text

Attribute-oriented Classification with Variable Importance using Random Forest Model

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1297.0782s319 ◽

2019 ◽

Vol 8 (2S3) ◽

pp. 1630-1635

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Algorithm ◽

Large Data ◽

Variable Importance ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Importance Measure ◽

Variable Importance Measure

In the present century, various classification issues are raised with large data and most commonly used machine learning algorithms are failed in the classification process to get accurate results. Datamining techniques like ensemble, which is made up of individual classifiers for the classification process and to generate the new data as well. Random forest is one of the ensemble supervised machine learning technique and essentially used in numerous machine learning applications such as the classification of text and image data. It is popular since it collects more relevant features such as variable importance measure, Out-of-bag error etc. For the viable learning and classification of random forest, it is required to reduce the number of decision trees (Pruning) in the random forest. In this paper, we have presented systematic overview of random forest algorithm along with its application areas. In addition, we presented a brief review of machine learning algorithm proposed in the recent years. Animal classification is considered as an important problem and most of the recent studies are classifying the animals by taking the image dataset. But, very less work has been done on attribute-oriented animal classification and poses many challenges in the process of extracting the accurate features. We have taken a real-time dataset from the Kaggle to classify the animal by collecting the more relevant features with the help of variable importance measure metric and compared with the other popular machine learning models.

Download Full-text

Addressing Measurement Error in Random Forests using Quantitative Bias Analysis

American Journal of Epidemiology ◽

10.1093/aje/kwab010 ◽

2021 ◽

Author(s):

Tammy Jiang ◽

Jaimie L Gradus ◽

Timothy L Lash ◽

Matthew P Fox

Keyword(s):

Machine Learning ◽

Measurement Error ◽

Random Forest ◽

Random Forests ◽

Model Performance ◽

Variable Importance ◽

Bias Analysis ◽

Variable Importance Measures ◽

Quantitative Bias Analysis ◽

The Impact

Abstract Although variables are often measured with error, the impact of measurement error on machine learning predictions is seldom quantified. The purpose of this study was to assess the impact of measurement error on random forest model performance and variable importance. First, we assessed the impact of misclassification (i.e., measurement error of categorical variables) of predictors on random forest model performance (e.g., accuracy, sensitivity) and variable importance (mean decrease in accuracy) using data from the United States National Comorbidity Survey Replication (2001 - 2003). Second, we simulated datasets in which we know the true model performance and variable importance measures and could verify that quantitative bias analysis was recovering the truth in misclassified versions of the datasets. Our findings show that measurement error in the data used to construct random forests can distort model performance and variable importance measures, and that bias analysis can recover the correct results. This study highlights the utility of applying quantitative bias analysis in machine learning to quantify the impact of measurement error on study results.

Download Full-text

On the behaviour of permutation‐based variable importance measures in random forest clustering

Journal of Chemometrics ◽

10.1002/cem.3135 ◽

2019 ◽

Vol 33 (8) ◽

Author(s):

Stefano Nembrini

Keyword(s):

Random Forest ◽

Variable Importance ◽

Variable Importance Measures

Download Full-text

Letter to the Editor: On the stability and ranking of predictors from random forest variable importance measures

Briefings in Bioinformatics ◽

10.1093/bib/bbr016 ◽

2011 ◽

Vol 12 (4) ◽

pp. 369-373 ◽

Cited By ~ 66

Author(s):

K. K. Nicodemus

Keyword(s):

Random Forest ◽

Variable Importance ◽

Letter To The Editor ◽

Variable Importance Measures ◽

The Stability

Download Full-text

Making use of open geo-environmental and agricultural datasets to model NO3 pollution in groundwater bodies

10.5194/egusphere-egu2020-22527 ◽

2020 ◽

Author(s):

Christian Schneider

Keyword(s):

Ground Water ◽

Crop Production ◽

Spatial Scales ◽

Variable Importance ◽

Water Bodies ◽

Importance Measure ◽

Variable Importance Measure ◽

Main Challenge ◽

Scale Levels ◽

The Impact

<div> <div> <div> <p>In Germany a vast amount of spatial geo-environmental as well as climatic datasets is available. But anthropic data on land-use and agriculture are still very sparse making it difficult to assess the environmental impacts of different agricultural practices. Recently, some data on spatial pattern of crop production as well as livestock production was made publicly available. It opened up the opportunity to model the impact of agriculture on nitrate leaching into groundwater bodies.</p> <p>A high share of groundwater bodies in Germany contains nitrate levels above the legal threshold of 50 mg l<sup>-1</sup>. Our study aims to answer the question: to what extend different types of agriculture are contributing to NO<sub>3</sub> leaching into ground water bodies in relation to environmental factors.</p> <p>We use the random forest (RF) machine learning algorithm to model and predict nitrate exceedance in ground water bodies. The advantage of the RF algorithm is that it has a high predictive accuracy, it is able to use metric as well as multi-level categorical datasets and it calculates a variable importance measure for each predictor used in a model. It therefore gives a measure to which extend each predictor contributes to the accuracy of the model. For this study we applied the RF classification as well as the RF regression algorithms on different spatial scales.</p> <p>Out of 56 environmental predictor datasets which are of potential importance for NO<sub>3</sub> transport into groundwater bodies 22 where chosen to model NO<sub>3</sub>-exceedance.<br>A recursive variable elimination scheme was applied to calculate minimum predictor sets based on variable importance. In the end the predictor set which resulted in the most accurate NO<sub>3</sub> prediction was identified and used to model groundwater pollution.</p> <p>RF-modeling proofed to be successful on all three scale levels with OBB accuracy between 0.82 and 0.95. At all scale levels environmental co-variables played a major role in predicting NO<sub>3</sub>-exceedance. But the RF variable importance measure could also be used to identify the contribution of agricultural predictors to NO<sub>3</sub> exceedance and to quantitatively proof our hypotheses.</p> <p>On main challenge was to identify the influence of data quality on the RF variable importance measure.</p> </div> </div> </div>

Download Full-text