variable importance
Recently Published Documents


TOTAL DOCUMENTS

296
(FIVE YEARS 113)

H-INDEX

30
(FIVE YEARS 5)

2022 ◽  
Vol 14 (1) ◽  
pp. 211
Author(s):  
Pengxiang Zhao ◽  
Zohreh Masoumi ◽  
Maryam Kalantari ◽  
Mahtab Aflaki ◽  
Ali Mansourian

Landslides often cause significant casualties and economic losses, and therefore landslide susceptibility mapping (LSM) has become increasingly urgent and important. The potential of deep learning (DL) like convolutional neural networks (CNN) based on landslide causative factors has not been fully explored yet. The main target of this study is the investigation of a GIS-based LSM in Zanjan, Iran and to explore the most important causative factor of landslides in the case study area. Different machine learning (ML) methods have been employed and compared to select the best results in the case study area. The CNN is compared with four ML algorithms, including random forest (RF), artificial neural network (ANN), support vector machine (SVM), and logistic regression (LR). To do so, sixteen landslide causative factors have been extracted and their related spatial layers have been prepared. Then, the algorithms were trained with related landslide and non-landslide points. The results illustrate that the five ML algorithms performed suitably (precision = 82.43–85.6%, AUC = 0.934–0.967). The RF algorithm achieves the best result, while the CNN, SVM, the ANN, and the LR have the best results after RF, respectively, in this case study. Moreover, variable importance analysis results indicate that slope and topographic curvature contribute more to the prediction. The results would be beneficial to planning strategies for landslide risk management.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Yuan Zhou ◽  
Botao Fa ◽  
Ting Wei ◽  
Jianle Sun ◽  
Zhangsheng Yu ◽  
...  

AbstractInvestigation of the genetic basis of traits or clinical outcomes heavily relies on identifying relevant variables in molecular data. However, characteristics such as high dimensionality and complex correlation structures of these data hinder the development of related methods, resulting in the inclusion of false positives and negatives. We developed a variable importance measure method, termed the ECAR scores, that evaluates the importance of variables in the dataset. Based on this score, ranking and selection of variables can be achieved simultaneously. Unlike most current approaches, the ECAR scores aim to rank the influential variables as high as possible while maintaining the grouping property, instead of selecting the ones that are merely predictive. The ECAR scores’ performance is tested and compared to other methods on simulated, semi-synthetic, and real datasets. Results showed that the ECAR scores improve the CAR scores in terms of accuracy of variable selection and high-rank variables’ predictive power. It also outperforms other classic methods such as lasso and stability selection when there is a high degree of correlation among influential variables. As an application, we used the ECAR scores to analyze genes associated with forced expiratory volume in the first second in patients with lung cancer and reported six associated genes.


Author(s):  
Brian D. Williamson ◽  
Peter B. Gilbert ◽  
Noah R. Simon ◽  
Marco Carone

2021 ◽  
Vol 31 (6) ◽  
Author(s):  
Giles Hooker ◽  
Lucas Mentch ◽  
Siyu Zhou

AbstractThis paper reviews and advocates against the use of permute-and-predict (PaP) methods for interpreting black box functions. Methods such as the variable importance measures proposed for random forests, partial dependence plots, and individual conditional expectation plots remain popular because they are both model-agnostic and depend only on the pre-trained model output, making them computationally efficient and widely available in software. However, numerous studies have found that these tools can produce diagnostics that are highly misleading, particularly when there is strong dependence among features. The purpose of our work here is to (i) review this growing body of literature, (ii) provide further demonstrations of these drawbacks along with a detailed explanation as to why they occur, and (iii) advocate for alternative measures that involve additional modeling. In particular, we describe how breaking dependencies between features in hold-out data places undue emphasis on sparse regions of the feature space by forcing the original model to extrapolate to regions where there is little to no data. We explore these effects across various model setups and find support for previous claims in the literature that PaP metrics can vastly over-emphasize correlated features in both variable importance measures and partial dependence plots. As an alternative, we discuss and recommend more direct approaches that involve measuring the change in model performance after muting the effects of the features under investigation.


2021 ◽  
Vol 13 (20) ◽  
pp. 4188
Author(s):  
Micah Russell ◽  
Jan U. H. Eitel ◽  
Timothy E. Link ◽  
Carlos A. Silva

Forest canopies exert significant controls over the spatial distribution of snow cover. Canopy snow interception efficiency is controlled by intrinsic processes (e.g., canopy structure), extrinsic processes (e.g., meteorological conditions), and the interaction of intrinsic-extrinsic factors (i.e., air temperature and branch stiffness). In hydrological models, intrinsic processes governing snow interception are typically represented by two-dimensional metrics like the leaf area index (LAI). To improve snow interception estimates and their scalability, new approaches are needed for better characterizing the three-dimensional distribution of canopy elements. Airborne laser scanning (ALS) provides a potential means of achieving this, with recent research focused on using ALS-derived metrics that describe forest spacing to predict interception storage. A wide range of canopy structural metrics that describe individual trees can also be extracted from ALS, although relatively little is known about which of them, and in what combination, best describes intrinsic canopy properties known to affect snow interception. The overarching goal of this study was to identify important ALS-derived canopy structural metrics that could help to further improve our ability to characterize intrinsic factors affecting snow interception. Specifically, we sought to determine how much variance in canopy intercepted snow volume can be explained by ALS-derived crown metrics, and what suite of existing and novel crown metrics most strongly affects canopy intercepted snow volume. To achieve this, we first used terrestrial laser scanning (TLS) to quantify snow interception on 14 trees. We then used these snow interception measurements to fit a random forest model with ALS-derived crown metrics as predictors. Next, we bootstrapped 1000 calculations of variable importance (percent increase in mean squared error when a given explanatory variable is removed), keeping nine canopy metrics for the final model that exceeded a variable importance threshold of 0.2. ALS-derived canopy metrics describing intrinsic tree structure explained approximately two-thirds of the snow interception variability (R2 ≥ 0.65, RMSE ≤ 0.52 m3, relative RMSE ≤ 48%) in our study when extrinsic factors were kept as constant as possible. For comparison, a generalized linear mixed-effects model predicting snow interception volume from LAI alone had a marginal R2 = 0.01. The three most important predictor variables were canopy length, whole-tree volume, and unobstructed returns (a novel metric). These results suggest that a suite of intrinsic variables may be used to map interception potential across larger areas and provide an improvement to interception estimates based on LAI.


2021 ◽  
Vol 9 ◽  
Author(s):  
Brett Snider ◽  
Edward A. McBean ◽  
John Yawney ◽  
S. Andrew Gadsden ◽  
Bhumi Patel
Keyword(s):  

2021 ◽  
Vol 13 (18) ◽  
pp. 3790
Author(s):  
Khang Chau ◽  
Meredith Franklin ◽  
Huikyo Lee ◽  
Michael Garay ◽  
Olga Kalashnikova

Exposure to fine particulate matter (PM2.5) air pollution has been shown in numerous studies to be associated with detrimental health effects. However, the ability to conduct epidemiological assessments can be limited due to challenges in generating reliable PM2.5 estimates, particularly in parts of the world such as the Middle East where measurements are scarce and extreme meteorological events such as sandstorms are frequent. In order to supplement exposure modeling efforts under such conditions, satellite-retrieved aerosol optical depth (AOD) has proven to be useful due to its global coverage. By using AODs from the Multiangle Implementation of Atmospheric Correction (MAIAC) of the MODerate Resolution Imaging Spectroradiometer (MODIS) and the Multiangle Imaging Spectroradiometer (MISR) combined with meteorological and assimilated aerosol information from the Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2), we constructed machine learning models to predict PM2.5 in the area surrounding the Persian Gulf, including Kuwait, Bahrain, and the United Arab Emirates (U.A.E). Our models showed regional differences in predictive performance, with better results in the U.A.E. (median test R2 = 0.66) than Kuwait (median test R2 = 0.51). Variable importance also differed by region, where satellite-retrieved AOD variables were more important for predicting PM2.5 in Kuwait than in the U.A.E. Divergent trends in the temporal and spatial autocorrelations of PM2.5 and AOD in the two regions offered possible explanations for differences in predictive performance and variable importance. In a test of model transferability, we found that models trained in one region and applied to another did not predict PM2.5 well, even if the transferred model had better performance. Overall the results of our study suggest that models developed over large geographic areas could generate PM2.5 estimates with greater uncertainty than could be obtained by taking a regional modeling approach. Furthermore, development of methods to better incorporate spatial and temporal autocorrelations in machine learning models warrants further examination.


2021 ◽  
Vol 50 (Supplement_1) ◽  
Author(s):  
Stephanie Long ◽  
Genevieve Lefebvre ◽  
Tibor Schuster

Abstract Background Advances in causal inference have helped explain the longstanding birthweight and obesity paradoxes: selection bias due to conditioning on a collider variable i.e. collider-stratification bias (CSB). The lessons learned have critical implications for the interpretation of machine learning (ML), including decision trees and random forests (RFs), that implicitly condition on input variables. RFs are a popular approach for identifying important “predictors” from large data through variable importance, defined by the average decrease in prediction accuracy. While CSB has become a recognized concern when estimating exposure-outcome effects, knowledge of its impact on ML’s variable importance measures (VIMs) is limited. Applying the causal inference framework, we investigated the accuracy of RFs’ VIMs in data-mechanisms prone to CSB. Methods A Monte Carlo simulation study was conducted, with binary outcome and collider variables generated from logistic models. Two exposure variables stochastically determined the outcome and a collider variable, independent of the outcome. VIMs from RFs were compared to the known causal relevance of the input variables on the outcome. Results While variable importance of true exposure variables was not systematically affected by CSB, validity of VIMs can be affected, leading to erroneous selection of collider variables, causally independent of the outcome, as outcome predictors. Conclusions In presence of CSB, VIMs are not valid measures of the causal relevance of variables and may mislead selection of truly important factors that affect the outcome. Key messages ML must consider causal data-generating mechanisms otherwise it may lead to erroneous assessment of variable importance regarding outcome prediction.


Sign in / Sign up

Export Citation Format

Share Document