scholarly journals Monitoring Forest Health Using Hyperspectral Imagery: Does Feature Selection Improve the Performance of Machine-Learning Techniques?

2021 ◽  
Vol 13 (23) ◽  
pp. 4832
Author(s):  
Patrick Schratz ◽  
Jannes Muenchow ◽  
Eugenia Iturritxa ◽  
José Cortés ◽  
Bernd Bischl ◽  
...  

This study analyzed highly correlated, feature-rich datasets from hyperspectral remote sensing data using multiple statistical and machine-learning methods. The effect of filter-based feature selection methods on predictive performance was compared. In addition, the effect of multiple expert-based and data-driven feature sets, derived from the reflectance data, was investigated. Defoliation of trees (%), derived from in situ measurements from fall 2016, was modeled as a function of reflectance. Variable importance was assessed using permutation-based feature importance. Overall, the support vector machine (SVM) outperformed other algorithms, such as random forest (RF), extreme gradient boosting (XGBoost), and lasso (L1) and ridge (L2) regressions by at least three percentage points. The combination of certain feature sets showed small increases in predictive performance, while no substantial differences between individual feature sets were observed. For some combinations of learners and feature sets, filter methods achieved better predictive performances than using no feature selection. Ensemble filters did not have a substantial impact on performance. The most important features were located around the red edge. Additional features in the near-infrared region (800–1000 nm) were also essential to achieve the overall best performances. Filter methods have the potential to be helpful in high-dimensional situations and are able to improve the interpretation of feature effects in fitted models, which is an essential constraint in environmental modeling studies. Nevertheless, more training data and replication in similar benchmarking studies are needed to be able to generalize the results.

2020 ◽  
Author(s):  
Patrick Schratz ◽  
Jannes Muenchow ◽  
Eugenia Iturritxa ◽  
José Cortés ◽  
Bernd Bischl ◽  
...  

This study analyzed highly-correlated, feature-rich datasets from hyperspectral remote sensing data using multiple machine and statistical-learning methods.<br> The effect of filter-based feature-selection methods on predictive performance was compared.<br> Also, the effect of multiple expert-based and data-driven feature sets, derived from the reflectance data, was investigated.<br> Defoliation of trees (%) was modeled as a function of reflectance, and variable importance was assessed using permutation-based feature importance.<br> Overall support vector machine (SVM) outperformed others such as random forest (RF), extreme gradient boosting (XGBoost), lasso (L1) and ridge (L2) regression by at least three percentage points.<br> The combination of certain feature sets showed small increases in predictive performance while no substantial differences between individual feature sets were observed.<br> For some combinations of learners and feature sets, filter methods achieved better predictive performances than the unfiltered feature sets, while ensemble filters did not have a substantial impact on performance.<br><br> Permutation-based feature importance estimated features around the red edge to be most important for the models.<br> However, the presence of features in the near-infrared region (800 nm - 1000 nm) was essential to achieve the best performances.<br><br> More training data and replication in similar benchmarking studies is needed for more generalizable conclusions.<br> Filter methods have the potential to be helpful in high-dimensional situations and are able to improve the interpretation of feature effects in fitted models, which is an essential constraint in environmental modeling studies.


2020 ◽  
Author(s):  
Patrick Schratz ◽  
Jannes Muenchow ◽  
Eugenia Iturritxa ◽  
José Cortés ◽  
Bernd Bischl ◽  
...  

This study analyzed highly-correlated, feature-rich datasets from hyperspectral remote sensing data using multiple machine and statistical-learning methods.<br> The effect of filter-based feature-selection methods on predictive performance was compared.<br> Also, the effect of multiple expert-based and data-driven feature sets, derived from the reflectance data, was investigated.<br> Defoliation of trees (%) was modeled as a function of reflectance, and variable importance was assessed using permutation-based feature importance.<br> Overall support vector machine (SVM) outperformed others such as random forest (RF), extreme gradient boosting (XGBoost), lasso (L1) and ridge (L2) regression by at least three percentage points.<br> The combination of certain feature sets showed small increases in predictive performance while no substantial differences between individual feature sets were observed.<br> For some combinations of learners and feature sets, filter methods achieved better predictive performances than the unfiltered feature sets, while ensemble filters did not have a substantial impact on performance.<br><br> Permutation-based feature importance estimated features around the red edge to be most important for the models.<br> However, the presence of features in the near-infrared region (800 nm - 1000 nm) was essential to achieve the best performances.<br><br> More training data and replication in similar benchmarking studies is needed for more generalizable conclusions.<br> Filter methods have the potential to be helpful in high-dimensional situations and are able to improve the interpretation of feature effects in fitted models, which is an essential constraint in environmental modeling studies.


Author(s):  
Harsha A K

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.


2020 ◽  
Vol 9 (9) ◽  
pp. 507
Author(s):  
Sanjiwana Arjasakusuma ◽  
Sandiaga Swahyu Kusuma ◽  
Stuart Phinn

Machine learning has been employed for various mapping and modeling tasks using input variables from different sources of remote sensing data. For feature selection involving high- spatial and spectral dimensionality data, various methods have been developed and incorporated into the machine learning framework to ensure an efficient and optimal computational process. This research aims to assess the accuracy of various feature selection and machine learning methods for estimating forest height using AISA (airborne imaging spectrometer for applications) hyperspectral bands (479 bands) and airborne light detection and ranging (lidar) height metrics (36 metrics), alone and combined. Feature selection and dimensionality reduction using Boruta (BO), principal component analysis (PCA), simulated annealing (SA), and genetic algorithm (GA) in combination with machine learning algorithms such as multivariate adaptive regression spline (MARS), extra trees (ET), support vector regression (SVR) with radial basis function, and extreme gradient boosting (XGB) with trees (XGbtree and XGBdart) and linear (XGBlin) classifiers were evaluated. The results demonstrated that the combinations of BO-XGBdart and BO-SVR delivered the best model performance for estimating tropical forest height by combining lidar and hyperspectral data, with R2 = 0.53 and RMSE = 1.7 m (18.4% of nRMSE and 0.046 m of bias) for BO-XGBdart and R2 = 0.51 and RMSE = 1.8 m (15.8% of nRMSE and −0.244 m of bias) for BO-SVR. Our study also demonstrated the effectiveness of BO for variables selection; it could reduce 95% of the data to select the 29 most important variables from the initial 516 variables from lidar metrics and hyperspectral data.


2020 ◽  
Vol 12 (23) ◽  
pp. 3925
Author(s):  
Ivan Pilaš ◽  
Mateo Gašparović ◽  
Alan Novkinić ◽  
Damir Klobučar

The presented study demonstrates a bi-sensor approach suitable for rapid and precise up-to-date mapping of forest canopy gaps for the larger spatial extent. The approach makes use of Unmanned Aerial Vehicle (UAV) red, green and blue (RGB) images on smaller areas for highly precise forest canopy mask creation. Sentinel-2 was used as a scaling platform for transferring information from the UAV to a wider spatial extent. Various approaches to an improvement in the predictive performance were examined: (I) the highest R2 of the single satellite index was 0.57, (II) the highest R2 using multiple features obtained from the single-date, S-2 image was 0.624, and (III) the highest R2 on the multitemporal set of S-2 images was 0.697. Satellite indices such as Atmospherically Resistant Vegetation Index (ARVI), Infrared Percentage Vegetation Index (IPVI), Normalized Difference Index (NDI45), Pigment-Specific Simple Ratio Index (PSSRa), Modified Chlorophyll Absorption Ratio Index (MCARI), Color Index (CI), Redness Index (RI), and Normalized Difference Turbidity Index (NDTI) were the dominant predictors in most of the Machine Learning (ML) algorithms. The more complex ML algorithms such as the Support Vector Machines (SVM), Random Forest (RF), Stochastic Gradient Boosting (GBM), Extreme Gradient Boosting (XGBoost), and Catboost that provided the best performance on the training set exhibited weaker generalization capabilities. Therefore, a simpler and more robust Elastic Net (ENET) algorithm was chosen for the final map creation.


2020 ◽  
Vol 8 ◽  
Author(s):  
Yasutaka Kuniyoshi ◽  
Haruka Tokutake ◽  
Natsuki Takahashi ◽  
Azusa Kamura ◽  
Sumie Yasuda ◽  
...  

We constructed an optimal machine learning (ML) method for predicting intravenous immunoglobulin (IVIG) resistance in children with Kawasaki disease (KD) using commonly available clinical and laboratory variables. We retrospectively collected 98 clinical records of hospitalized children with KD (2–109 months of age). We found that 20 (20%) children were resistant to initial IVIG therapy. We trained three ML techniques, including logistic regression, linear support vector machine, and eXtreme gradient boosting with 10 variables against IVIG resistance. Moreover, we estimated the predictive performance based on nested 5-fold cross-validation (CV). We also selected variables using the recursive feature elimination method and performed the nested 5-fold CV with selected variables in a similar manner. We compared ML models with the existing system regardless of their predictive performance. Results of the area under the receiver operator characteristic curve were in the range of 0.58–0.60 in the all-variable model and 0.60–0.75 in the select model. The specificities were more than 0.90 and higher than those in existing scoring systems, but the sensitivities were lower. Three ML models based on demographics and routine laboratory variables did not provide reliable performance. This is possibly the first study that has attempted to establish a better predictive model. Additional biomarkers are probably needed to generate an effective prediction model.


2021 ◽  
Vol 12 ◽  
Author(s):  
Negar Darabi ◽  
Niyousha Hosseinichimeh ◽  
Anthony Noto ◽  
Ramin Zand ◽  
Vida Abedi

Background and Purpose: Hospital readmissions impose a substantial burden on the healthcare system. Reducing readmissions after stroke could lead to improved quality of care especially since stroke is associated with a high rate of readmission. The goal of this study is to enhance our understanding of the predictors of 30-day readmission after ischemic stroke and develop models to identify high-risk individuals for targeted interventions.Methods: We used patient-level data from electronic health records (EHR), five machine learning algorithms (random forest, gradient boosting machine, extreme gradient boosting–XGBoost, support vector machine, and logistic regression-LR), data-driven feature selection strategy, and adaptive sampling to develop 15 models of 30-day readmission after ischemic stroke. We further identified important clinical variables.Results: We included 3,184 patients with ischemic stroke (mean age: 71 ± 13.90 years, men: 51.06%). Among the 61 clinical variables included in the model, the National Institutes of Health Stroke Scale score above 24, insert indwelling urinary catheter, hypercoagulable state, and percutaneous gastrostomy had the highest importance score. The Model's AUC (area under the curve) for predicting 30-day readmission was 0.74 (95%CI: 0.64–0.78) with PPV of 0.43 when the XGBoost algorithm was used with ROSE-sampling. The balance between specificity and sensitivity improved through the sampling strategy. The best sensitivity was achieved with LR when optimized with feature selection and ROSE-sampling (AUC: 0.64, sensitivity: 0.53, specificity: 0.69).Conclusions: Machine learning-based models can be designed to predict 30-day readmission after stroke using structured data from EHR. Among the algorithms analyzed, XGBoost with ROSE-sampling had the best performance in terms of AUC while LR with ROSE-sampling and feature selection had the best sensitivity. Clinical variables highly associated with 30-day readmission could be targeted for personalized interventions. Depending on healthcare systems' resources and criteria, models with optimized performance metrics can be implemented to improve outcomes.


2020 ◽  
Vol 10 (3) ◽  
pp. 1151
Author(s):  
Hanna Kim ◽  
Young-Seob Jeong ◽  
Ah Reum Kang ◽  
Woohyun Jung ◽  
Yang Hoon Chung ◽  
...  

Tachycardia is defined as a heart rate greater than 100 bpm for more than 1 min. Tachycardia often occurs after endotracheal intubation and can cause serious complication in patients with cardiovascular disease. The ability to predict post-intubation tachycardia would help clinicians by notifying a potential event to pre-treat. In this paper, we predict the potential post-intubation tachycardia. Given electronic medical record and vital signs collected before tracheal intubation, we predict whether post-intubation tachycardia will occur within 10 min. Of 1931 available patient datasets, 257 remained after filtering those with inappropriate data such as outliers and inappropriate annotations. Three feature sets were designed using feature selection algorithms, and two additional feature sets were defined by statistical inspection or manual examination. The five feature sets were compared with various machine learning models such as naïve Bayes classifiers, logistic regression, random forest, support vector machines, extreme gradient boosting, and artificial neural networks. Parameters of the models were optimized for each feature set. By 10-fold cross validation, we found that an logistic regression model with eight-dimensional hand-crafted features achieved an accuracy of 80.5%, recall of 85.1%, precision of 79.9%, an F1 score of 79.9%, and an area under the receiver operating characteristic curve of 0.85.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Moojung Kim ◽  
Young Jae Kim ◽  
Sung Jin Park ◽  
Kwang Gi Kim ◽  
Pyung Chun Oh ◽  
...  

Abstract Background Annual influenza vaccination is an important public health measure to prevent influenza infections and is strongly recommended for cardiovascular disease (CVD) patients, especially in the current coronavirus disease 2019 (COVID-19) pandemic. The aim of this study is to develop a machine learning model to identify Korean adult CVD patients with low adherence to influenza vaccination Methods Adults with CVD (n = 815) from a nationally representative dataset of the Fifth Korea National Health and Nutrition Examination Survey (KNHANES V) were analyzed. Among these adults, 500 (61.4%) had answered "yes" to whether they had received seasonal influenza vaccinations in the past 12 months. The classification process was performed using the logistic regression (LR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB) machine learning techniques. Because the Ministry of Health and Welfare in Korea offers free influenza immunization for the elderly, separate models were developed for the < 65 and ≥ 65 age groups. Results The accuracy of machine learning models using 16 variables as predictors of low influenza vaccination adherence was compared; for the ≥ 65 age group, XGB (84.7%) and RF (84.7%) have the best accuracies, followed by LR (82.7%) and SVM (77.6%). For the < 65 age group, SVM has the best accuracy (68.4%), followed by RF (64.9%), LR (63.2%), and XGB (61.4%). Conclusions The machine leaning models show comparable performance in classifying adult CVD patients with low adherence to influenza vaccination.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Arturo Moncada-Torres ◽  
Marissa C. van Maaren ◽  
Mathijs P. Hendriks ◽  
Sabine Siesling ◽  
Gijs Geleijnse

AbstractCox Proportional Hazards (CPH) analysis is the standard for survival analysis in oncology. Recently, several machine learning (ML) techniques have been adapted for this task. Although they have shown to yield results at least as good as classical methods, they are often disregarded because of their lack of transparency and little to no explainability, which are key for their adoption in clinical settings. In this paper, we used data from the Netherlands Cancer Registry of 36,658 non-metastatic breast cancer patients to compare the performance of CPH with ML techniques (Random Survival Forests, Survival Support Vector Machines, and Extreme Gradient Boosting [XGB]) in predicting survival using the $$c$$ c -index. We demonstrated that in our dataset, ML-based models can perform at least as good as the classical CPH regression ($$c$$ c -index $$\sim \,0.63$$ ∼ 0.63 ), and in the case of XGB even better ($$c$$ c -index $$\sim 0.73$$ ∼ 0.73 ). Furthermore, we used Shapley Additive Explanation (SHAP) values to explain the models’ predictions. We concluded that the difference in performance can be attributed to XGB’s ability to model nonlinearities and complex interactions. We also investigated the impact of specific features on the models’ predictions as well as their corresponding insights. Lastly, we showed that explainable ML can generate explicit knowledge of how models make their predictions, which is crucial in increasing the trust and adoption of innovative ML techniques in oncology and healthcare overall.


Sign in / Sign up

Export Citation Format

Share Document