Machine learning-based feature importance approach for sensitivity analysis of steel frames

Mapping Intimacies ◽

10.31224/osf.io/mvkf3 ◽

2021 ◽

Author(s):

Hyeyoung Koh ◽

Hannah Beth Blum

Keyword(s):

Machine Learning ◽

Sensitivity Analysis ◽

Feature Selection ◽

Large Scale ◽

Failure Modes ◽

Model Development ◽

Predictive Performance ◽

Computational Effort ◽

Structural Systems ◽

Feature Importance

This study presents a machine learning-based approach for sensitivity analysis to examine how parameters affect a given structural response while accounting for uncertainty. Reliability-based sensitivity analysis involves repeated evaluations of the performance function incorporating uncertainties to estimate the influence of a model parameter, which can lead to prohibitive computational costs. This challenge is exacerbated for large-scale engineering problems which often carry a large quantity of uncertain parameters. The proposed approach is based on feature selection algorithms that rank feature importance and remove redundant predictors during model development which improve model generality and training performance by focusing only on the significant features. The approach allows performing sensitivity analysis of structural systems by providing feature rankings with reduced computational effort. The proposed approach is demonstrated with two designs of a two-bay, two-story planar steel frame with different failure modes: inelastic instability of a single member and progressive yielding. The feature variables in the data are uncertainties including material yield strength, Young’s modulus, frame sway imperfection, and residual stress. The Monte Carlo sampling method is utilized to generate random realizations of the frames from published distributions of the feature parameters, and the response variable is the frame ultimate strength obtained from finite element analyses. Decision trees are trained to identify important features. Feature rankings are derived by four feature selection techniques including impurity-based, permutation, SHAP, and Spearman's correlation. Predictive performance of the model including the important features are discussed using the evaluation metric for imbalanced datasets, Matthews correlation coefficient. Finally, the results are compared with those from reliability-based sensitivity analysis on the same example frames to show the validity of the feature selection approach. As the proposed machine learning-based approach produces the same results as the reliability-based sensitivity analysis with improved computational efficiency and accuracy, it could be extended to other structural systems.

Download Full-text

Machine Learning-Based Sensitivity of Steel Frames with Highly Imbalanced and High-Dimensional Data

10.31224/osf.io/gjf96 ◽

2021 ◽

Author(s):

Hyeyoung Koh ◽

Hannah Beth Blum

Keyword(s):

Machine Learning ◽

Failure Modes ◽

Predictive Accuracy ◽

Steel Frames ◽

Model Performance ◽

Class Imbalance ◽

Predictive Performance ◽

Sensitivity Study ◽

High Dimensional ◽

Feature Importance

The machine learning-based feature selection approach is presented to estimate the effect of uncertainties and identify failure modes of structures that incorporate a low failure probability and high-dimensional uncertainties. As structures are designed to have few failures, a dataset classified based on the failure status becomes imbalanced, which poses a challenge for the predictive modeling of machine learning classifiers. Moreover, in order to improve the accuracy and efficiency of the model performance, it is necessary to determine the critical factors and redundant factors, especially for a large feature set. This study benchmarks the novel method for sensitivity analysis by using datasets that exacerbate the problems involved in class imbalance and large input features. This study investigates two planar steel frames with spatially uncorrelated properties between structural members. Geometric and material properties are considered as uncertainties, such as material yield stress, Young's modulus, frame sway, and residual stress. Six feature importance techniques including ANOVA, mRMR, Spearman's rank, impurity-based, permutation, and SHAP are employed to measure the feature importance and identify parameters germane to the prediction of structural failures. Logistic regression and decision tree models are trained on the important feature set, and the predictive performance is evaluated. The use of the feature importance approach for structures with a low probability of failure and a large number of uncertain parameters is validated by showing identical results with the reliability-based sensitivity study and appropriate predictive accuracy.

Download Full-text

iBitter-Fuse: A Novel Sequence-Based Bitter Peptide Predictor by Fusing Multi-View Features

International Journal of Molecular Sciences ◽

10.3390/ijms22168958 ◽

2021 ◽

Vol 22 (16) ◽

pp. 8958

Author(s):

Phasit Charoenkwan ◽

Chanin Nantasenamat ◽

Md. Mehedi Hasan ◽

Mohammad Ali Moni ◽

Pietro Lio’ ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

De Novo ◽

Predictive Performance ◽

Support Vector ◽

Sufficient Information ◽

Self Assessment ◽

Accurate Identification ◽

Bitter Peptides ◽

Accurate Performance

Accurate identification of bitter peptides is of great importance for better understanding their biochemical and biophysical properties. To date, machine learning-based methods have become effective approaches for providing a good avenue for identifying potential bitter peptides from large-scale protein datasets. Although few machine learning-based predictors have been developed for identifying the bitterness of peptides, their prediction performances could be improved. In this study, we developed a new predictor (named iBitter-Fuse) for achieving more accurate identification of bitter peptides. In the proposed iBitter-Fuse, we have integrated a variety of feature encoding schemes for providing sufficient information from different aspects, namely consisting of compositional information and physicochemical properties. To enhance the predictive performance, the customized genetic algorithm utilizing self-assessment-report (GA-SAR) was employed for identifying informative features followed by inputting optimal ones into a support vector machine (SVM)-based classifier for developing the final model (iBitter-Fuse). Benchmarking experiments based on both 10-fold cross-validation and independent tests indicated that the iBitter-Fuse was able to achieve more accurate performance as compared to state-of-the-art methods. To facilitate the high-throughput identification of bitter peptides, the iBitter-Fuse web server was established and made freely available online. It is anticipated that the iBitter-Fuse will be a useful tool for aiding the discovery and de novo design of bitter peptides

Download Full-text

The Influence of Inhomogeneous Input Data from Different Waves on Predictive Model Development for COVID-19 ICU Patients (Preprint)

10.2196/preprints.31539 ◽

2021 ◽

Author(s):

Sebastian Johannes Fritsch ◽

Konstantin Sharafutdinov ◽

Moein Einollahzadeh Samadi ◽

Gernot Marx ◽

Andreas Schuppert ◽

...

Keyword(s):

Machine Learning ◽

Convex Hull ◽

Prediction Models ◽

Model Development ◽

Predictive Performance ◽

Support Vector ◽

Good Prediction ◽

The Impact ◽

Second Wave ◽

Over Time

BACKGROUND During the course of the COVID-19 pandemic, a variety of machine learning models were developed to predict different aspects of the disease, such as long-term causes, organ dysfunction or ICU mortality. The number of training datasets used has increased significantly over time. However, these data now come from different waves of the pandemic, not always addressing the same therapeutic approaches over time as well as changing outcomes between two waves. The impact of these changes on model development has not yet been studied. OBJECTIVE The aim of the investigation was to examine the predictive performance of several models trained with data from one wave predicting the second wave´s data and the impact of a pooling of these data sets. Finally, a method for comparison of different datasets for heterogeneity is introduced. METHODS We used two datasets from wave one and two to develop several predictive models for mortality of the patients. Four classification algorithms were used: logistic regression (LR), support vector machine (SVM), random forest classifier (RF) and AdaBoost classifier (ADA). We also performed a mutual prediction on the data of that wave which was not used for training. Then, we compared the performance of models when a pooled dataset from two waves was used. The populations from the different waves were checked for heterogeneity using a convex hull analysis. RESULTS 63 patients from wave one (03-06/2020) and 54 from wave two (08/2020-01/2021) were evaluated. For both waves separately, we found models reaching sufficient accuracies up to 0.79 AUROC (95%-CI 0.76-0.81) for SVM on the first wave and up 0.88 AUROC (95%-CI 0.86-0.89) for RF on the second wave. After the pooling of the data, the AUROC decreased relevantly. In the mutual prediction, models trained on second wave´s data showed, when applied on first wave´s data, a good prediction for non-survivors but an insufficient classification for survivors. The opposite situation (training: first wave, test: second wave) revealed the inverse behaviour with models correctly classifying survivors and incorrectly predicting non-survivors. The convex hull analysis for the first and second wave populations showed a more inhomogeneous distribution of underlying data when compared to randomly selected sets of patients of the same size. CONCLUSIONS Our work demonstrates that a larger dataset is not a universal solution to all machine learning problems in clinical settings. Rather, it shows that inhomogeneous data used to develop models can lead to serious problems. With the convex hull analysis, we offer a solution for this problem. The outcome of such an analysis can raise concerns if the pooling of different datasets would cause inhomogeneous patterns preventing a better predictive performance.

Download Full-text

Dynamic Response Optimization of Mechanical Systems With Multiplier Methods

Journal of Mechanisms Transmissions and Automation in Design ◽

10.1115/1.3258974 ◽

1989 ◽

Vol 111 (1) ◽

pp. 73-80 ◽

Cited By ~ 28

Author(s):

J. K. Paeng ◽

J. S. Arora

Keyword(s):

Sensitivity Analysis ◽

Dynamic Response ◽

Large Scale ◽

Unconstrained Minimization ◽

Design Sensitivity ◽

Computational Effort ◽

Design Sensitivity Analysis ◽

Multiplier Methods ◽

Dynamic Response Optimization ◽

Response Optimization

A basic hypothesis of this paper is that the multiplier methods can be effective and efficient for dynamic response optimization of large scale systems. The methods have been previously shown to be inefficient compared to the primal methods for static response applications. However, they can be more efficient for dynamic response applications because they collapse all time-dependent constraints and the cost function to one functional. This can result in substantial savings in the computational effort during design sensitivity analysis. To investigate this hypothesis, an augmented functional for the dynamic response optimization problem is defined. Design sensitivity analysis for the functional is developed and three example problems are solved to investigate computational aspects of the multiplier methods. It is concluded that multiplier methods can be effective for dynamic response problems but need numerical refinements to avoid convergence difficulties in unconstrained minimization.

Download Full-text

Human and machine learning pipelines for responsible clinical prediction using high-dimensional data

10.21203/rs.3.pex-1655/v1 ◽

2021 ◽

Author(s):

Herdiantri Sufriyana ◽

Yu Wei Wu ◽

Emily Chia-Yu Su

Keyword(s):

Machine Learning ◽

High Dimensional Data ◽

Model Development ◽

Healthcare Providers ◽

Predictive Performance ◽

High Dimensional ◽

Clinical Prediction ◽

Data Collection And Analysis ◽

Medical Histories ◽

Feature Discovery

Abstract This protocol aims to develop, validate, and deploy a prediction model using high dimensional data by both human and machine learning. The applicability is intended for clinical prediction in healthcare providers, including but not limited to those using medical histories from electronic health records. This protocol applies diverse approaches to improve both predictive performance and interpretability while maintaining the generalizability of model evaluation. However, some steps require expensive computational capacity; otherwise, these will take longer time. The key stages consist of designs of data collection and analysis, feature discovery and quality control, and model development, validation, and deployment.

Download Full-text

Monitoring forest health using hyperspectral imagery: Does feature selection improve the performance of machine-learning techniques?

10.36227/techrxiv.12794267.v1 ◽

2020 ◽

Author(s):

Patrick Schratz ◽

Jannes Muenchow ◽

Eugenia Iturritxa ◽

José Cortés ◽

Bernd Bischl ◽

...

Keyword(s):

Feature Selection ◽

Predictive Performance ◽

Environmental Modeling ◽

Gradient Boosting ◽

Support Vector ◽

Substantial Impact ◽

Feature Sets ◽

Filter Methods ◽

Extreme Gradient Boosting ◽

Feature Importance

This study analyzed highly-correlated, feature-rich datasets from hyperspectral remote sensing data using multiple machine and statistical-learning methods. The effect of filter-based feature-selection methods on predictive performance was compared. Also, the effect of multiple expert-based and data-driven feature sets, derived from the reflectance data, was investigated. Defoliation of trees (%) was modeled as a function of reflectance, and variable importance was assessed using permutation-based feature importance. Overall support vector machine (SVM) outperformed others such as random forest (RF), extreme gradient boosting (XGBoost), lasso (L1) and ridge (L2) regression by at least three percentage points. The combination of certain feature sets showed small increases in predictive performance while no substantial differences between individual feature sets were observed. For some combinations of learners and feature sets, filter methods achieved better predictive performances than the unfiltered feature sets, while ensemble filters did not have a substantial impact on performance. Permutation-based feature importance estimated features around the red edge to be most important for the models. However, the presence of features in the near-infrared region (800 nm - 1000 nm) was essential to achieve the best performances. More training data and replication in similar benchmarking studies is needed for more generalizable conclusions. Filter methods have the potential to be helpful in high-dimensional situations and are able to improve the interpretation of feature effects in fitted models, which is an essential constraint in environmental modeling studies.

Download Full-text

A novel feature selection method based on global sensitivity analysis with application in machine learning-based prediction model

Applied Soft Computing ◽

10.1016/j.asoc.2019.105859 ◽

2019 ◽

Vol 85 ◽

pp. 105859 ◽

Cited By ~ 15

Author(s):

Pin Zhang

Keyword(s):

Machine Learning ◽

Sensitivity Analysis ◽

Feature Selection ◽

Prediction Model ◽

Global Sensitivity Analysis ◽

Feature Selection Method ◽

Selection Method ◽

Global Sensitivity

Download Full-text

Up-to-Date Feature Selection Methods for Scalable and Efficient Machine Learning

Efficiency and Scalability Methods for Computational Intellect ◽

10.4018/978-1-4666-3942-3.ch001 ◽

2013 ◽

pp. 1-26 ◽

Cited By ~ 2

Author(s):

Amparo Alonso-Betanzos ◽

Verónica Bolón-Canedo ◽

Diego Fernández-Francos ◽

Iago Porto-Díaz ◽

Noelia Sánchez-Maroño

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Real World ◽

Large Scale ◽

High Dimensionality ◽

Selection Methods ◽

Learning Methods ◽

Large Databases ◽

Efficient Machine ◽

Processing Techniques

With the advent of high dimensionality, machine learning researchers are now interested not only in accuracy, but also in scalability of algorithms. When dealing with large databases, pre-processing techniques are required to reduce input dimensionality and machine learning can take advantage of feature selection, which consists of selecting the relevant features and discarding irrelevant ones with a minimum degradation in performance. In this chapter, we will review the most up-to-date feature selection methods, focusing on their scalability properties. Moreover, we will show how these learning methods are enhanced when applied to large scale datasets and, finally, some examples of the application of feature selection in real world databases will be shown.

Download Full-text

Feature selection using autoencoders with Bayesian methods to high-dimensional data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211348 ◽

2021 ◽

pp. 1-10

Author(s):

Lei Shu ◽

Kun Huang ◽

Wenhao Jiang ◽

Wenming Wu ◽

Hongling Liu

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Bayesian Methods ◽

Large Scale ◽

High Dimensional Data ◽

Hybrid Approach ◽

High Dimensional ◽

Real World Data ◽

Learning Tasks ◽

Low Dimensional

It is easy to lead to poor generalization in machine learning tasks using real-world data directly, since such data is usually high-dimensional dimensionality and limited. Through learning the low dimensional representations of high-dimensional data, feature selection can retain useful features for machine learning tasks. Using these useful features effectively trains machine learning models. Hence, it is a challenge for feature selection from high-dimensional data. To address this issue, in this paper, a hybrid approach consisted of an autoencoder and Bayesian methods is proposed for a novel feature selection. Firstly, Bayesian methods are embedded in the proposed autoencoder as a special hidden layer. This of doing is to increase the precision during selecting non-redundant features. Then, the other hidden layers of the autoencoder are used for non-redundant feature selection. Finally, compared with the mainstream approaches for feature selection, the proposed method outperforms them. We find that the way consisted of autoencoders and probabilistic correction methods is more meaningful than that of stacking architectures or adding constraints to autoencoders as regards feature selection. We also demonstrate that stacked autoencoders are more suitable for large-scale feature selection, however, sparse autoencoders are beneficial for a smaller number of feature selection. We indicate that the value of the proposed method provides a theoretical reference to analyze the optimality of feature selection.

Download Full-text

Tumor Grade and Overall Survival Prediction of Gliomas Using Radiomics

Scientific Programming ◽

10.1155/2021/9913466 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Jianming Ye ◽

He Huang ◽

Weiwei Jiang ◽

Xiaomei Xu ◽

Chun Xie ◽

...

Keyword(s):

Machine Learning ◽

Overall Survival ◽

Feature Selection ◽

Model Performance ◽

Predictive Performance ◽

Tumor Grade ◽

Survival Prediction ◽

Lower Grade ◽

Feature Selection Technique ◽

Mri Scans

Glioma is one of the most common and deadly malignant brain tumors originating from glial cells. For personalized treatment, an accurate preoperative prognosis for glioma patients is highly desired. Recently, various machine learning-based approaches have been developed to predict the prognosis based on preoperative magnetic resonance imaging (MRI) radiomics, which extract quantitative features from radiographic images. However, major challenges remain for methodologic developments to optimize feature extraction and provide rapid information flow in clinical settings. This study investigates two machine learning-based prognosis prediction tasks using radiomic features extracted from preoperative multimodal MRI brain data: (i) prediction of tumor grade (higher-grade vs. lower-grade gliomas) from preoperative MRI scans and (ii) prediction of patient overall survival (OS) in higher-grade gliomas (<12 months vs. > 12 months) from preoperative MRI scans. Specifically, these two tasks utilize the conventional machine learning-based models built with various classifiers. Moreover, feature selection methods are applied to increase model performance and decrease computational costs. In the experiments, models are evaluated in terms of their predictive performance and stability using a bootstrap approach. Experimental results show that classifier choice and feature selection technique plays a significant role in model performance and stability for both tasks; a variability analysis indicates that classification method choice is the most dominant source of performance variation for both tasks.

Download Full-text