scholarly journals Process Variable Importance Analysis by Use of Random Forests in a Shapley Regression Framework

Minerals ◽  
2020 ◽  
Vol 10 (5) ◽  
pp. 420
Author(s):  
Chris Aldrich

Linear regression is often used as a diagnostic tool to understand the relative contributions of operational variables to some key performance indicator or response variable. However, owing to the nature of plant operations, predictor variables tend to be correlated, often highly so, and this can lead to significant complications in assessing the importance of these variables. Shapley regression is seen as the only axiomatic approach to deal with this problem but has almost exclusively been used with linear models to date. In this paper, the approach is extended to random forests, and the results are compared with some of the empirical variable importance measures widely used with these models, i.e., permutation and Gini variable importance measures. Four case studies are considered, of which two are based on simulated data and two on real world data from the mineral process industries. These case studies suggest that the random forest Shapley variable importance measure may be a more reliable indicator of the influence of predictor variables than the other measures that were considered. Moreover, the results obtained with the Gini variable importance measure was as reliable or better than that obtained with the permutation measure of the random forest.

2021 ◽  
Author(s):  
Shuhei Kimura ◽  
Yahiro Takeda ◽  
Masato Tokuhisa ◽  
Mariko Okada

Abstract Background: Among the various methods so far proposed for genetic network inference, this study focuses on the random-forest-based methods. Confidence values are assigned to all of the candidate regulations when taking the random-forest-based approach. To our knowledge, all of the random-forest-based methods make the assignments using the standard variable importance measure defined in tree-based machine learning techniques. We think however that this measure has drawbacks in the inference of genetic networks. Results: In this study we therefore propose an alternative measure, what we call ``the random-input variable importance measure,'' and design a new inference method that uses the proposed measure in place of the standard measure in the existing random-forest-based inference method. We show, through numerical experiments, that the use of the random-input variable importance measure improves the performance of the existing random-forest-based inference method by as much as 45.5% with respect to the area under the recall-precision curve (AURPC). Conclusion: This study proposed the random-input variable importance measure for the inference of genetic networks. The use of our measure improved the performance of the random-forest-based inference method. In this study, we checked the performance of the proposed measure only on several genetic network inference problems. However, the experimental results suggest that the proposed measure will work well in other applications of random forests.


2021 ◽  
Vol 128 (1) ◽  
pp. 65-85
Author(s):  
Shufang Song ◽  
Ruyang He ◽  
Zhaoyin Shi ◽  
Weiya Zhang

2012 ◽  
Vol 24 (1) ◽  
pp. 21-34 ◽  
Author(s):  
Alexander Hapfelmeier ◽  
Torsten Hothorn ◽  
Kurt Ulm ◽  
Carolin Strobl

2019 ◽  
Vol 8 (2S3) ◽  
pp. 1630-1635

In the present century, various classification issues are raised with large data and most commonly used machine learning algorithms are failed in the classification process to get accurate results. Datamining techniques like ensemble, which is made up of individual classifiers for the classification process and to generate the new data as well. Random forest is one of the ensemble supervised machine learning technique and essentially used in numerous machine learning applications such as the classification of text and image data. It is popular since it collects more relevant features such as variable importance measure, Out-of-bag error etc. For the viable learning and classification of random forest, it is required to reduce the number of decision trees (Pruning) in the random forest. In this paper, we have presented systematic overview of random forest algorithm along with its application areas. In addition, we presented a brief review of machine learning algorithm proposed in the recent years. Animal classification is considered as an important problem and most of the recent studies are classifying the animals by taking the image dataset. But, very less work has been done on attribute-oriented animal classification and poses many challenges in the process of extracting the accurate features. We have taken a real-time dataset from the Kaggle to classify the animal by collecting the more relevant features with the help of variable importance measure metric and compared with the other popular machine learning models.


Author(s):  
Tammy Jiang ◽  
Jaimie L Gradus ◽  
Timothy L Lash ◽  
Matthew P Fox

Abstract Although variables are often measured with error, the impact of measurement error on machine learning predictions is seldom quantified. The purpose of this study was to assess the impact of measurement error on random forest model performance and variable importance. First, we assessed the impact of misclassification (i.e., measurement error of categorical variables) of predictors on random forest model performance (e.g., accuracy, sensitivity) and variable importance (mean decrease in accuracy) using data from the United States National Comorbidity Survey Replication (2001 - 2003). Second, we simulated datasets in which we know the true model performance and variable importance measures and could verify that quantitative bias analysis was recovering the truth in misclassified versions of the datasets. Our findings show that measurement error in the data used to construct random forests can distort model performance and variable importance measures, and that bias analysis can recover the correct results. This study highlights the utility of applying quantitative bias analysis in machine learning to quantify the impact of measurement error on study results.


2020 ◽  
Author(s):  
Christian Schneider

<div> <div> <div> <p>In Germany a vast amount of spatial geo-environmental as well as climatic datasets is available. But anthropic data on land-use and agriculture are still very sparse making it difficult to assess the environmental impacts of different agricultural practices. Recently, some data on spatial pattern of crop production as well as livestock production was made publicly available. It opened up the opportunity to model the impact of agriculture on nitrate leaching into groundwater bodies.</p> <p>A high share of groundwater bodies in Germany contains nitrate levels above the legal threshold of 50 mg l<sup>-1</sup>. Our study aims to answer the question: to what extend different types of agriculture are contributing to NO<sub>3</sub> leaching into ground water bodies in relation to environmental factors.</p> <p>We use the random forest (RF) machine learning algorithm to model and predict nitrate exceedance in ground water bodies. The advantage of the RF algorithm is that it has a high predictive accuracy, it is able to use metric as well as multi-level categorical datasets and it calculates a variable importance measure for each predictor used in a model. It therefore gives a measure to which extend each predictor contributes to the accuracy of the model. For this study we applied the RF classification as well as the RF regression algorithms on different spatial scales.</p> <p>Out of 56 environmental predictor datasets which are of potential importance for NO<sub>3</sub> transport into groundwater bodies 22 where chosen to model NO<sub>3</sub>-exceedance.<br>A recursive variable elimination scheme was applied to calculate minimum predictor sets based on variable importance. In the end the predictor set which resulted in the most accurate NO<sub>3</sub> prediction was identified and used to model groundwater pollution.</p> <p>RF-modeling proofed to be successful on all three scale levels with OBB accuracy between 0.82 and 0.95. At all scale levels environmental co-variables played a major role in predicting NO<sub>3</sub>-exceedance. But the RF variable importance measure could also be used to identify the contribution of agricultural predictors to NO<sub>3</sub> exceedance and to quantitatively proof our hypotheses.</p> <p>On main challenge was to identify the influence of data quality on the RF variable importance measure.</p> </div> </div> </div>


Sign in / Sign up

Export Citation Format

Share Document