Monitoring forest health using hyperspectral imagery: Does feature selection improve the performance of machine-learning techniques?
This study analyzed highly-correlated, feature-rich datasets from hyperspectral remote sensing data using multiple machine and statistical-learning methods.<br> The effect of filter-based feature-selection methods on predictive performance was compared.<br> Also, the effect of multiple expert-based and data-driven feature sets, derived from the reflectance data, was investigated.<br> Defoliation of trees (%) was modeled as a function of reflectance, and variable importance was assessed using permutation-based feature importance.<br> Overall support vector machine (SVM) outperformed others such as random forest (RF), extreme gradient boosting (XGBoost), lasso (L1) and ridge (L2) regression by at least three percentage points.<br> The combination of certain feature sets showed small increases in predictive performance while no substantial differences between individual feature sets were observed.<br> For some combinations of learners and feature sets, filter methods achieved better predictive performances than the unfiltered feature sets, while ensemble filters did not have a substantial impact on performance.<br><br> Permutation-based feature importance estimated features around the red edge to be most important for the models.<br> However, the presence of features in the near-infrared region (800 nm - 1000 nm) was essential to achieve the best performances.<br><br> More training data and replication in similar benchmarking studies is needed for more generalizable conclusions.<br> Filter methods have the potential to be helpful in high-dimensional situations and are able to improve the interpretation of feature effects in fitted models, which is an essential constraint in environmental modeling studies.