scholarly journals Machine Learning-Based Sensitivity of Steel Frames with Highly Imbalanced and High-Dimensional Data

2021 ◽  
Author(s):  
Hyeyoung Koh ◽  
Hannah Beth Blum

The machine learning-based feature selection approach is presented to estimate the effect of uncertainties and identify failure modes of structures that incorporate a low failure probability and high-dimensional uncertainties. As structures are designed to have few failures, a dataset classified based on the failure status becomes imbalanced, which poses a challenge for the predictive modeling of machine learning classifiers. Moreover, in order to improve the accuracy and efficiency of the model performance, it is necessary to determine the critical factors and redundant factors, especially for a large feature set. This study benchmarks the novel method for sensitivity analysis by using datasets that exacerbate the problems involved in class imbalance and large input features. This study investigates two planar steel frames with spatially uncorrelated properties between structural members. Geometric and material properties are considered as uncertainties, such as material yield stress, Young's modulus, frame sway, and residual stress. Six feature importance techniques including ANOVA, mRMR, Spearman's rank, impurity-based, permutation, and SHAP are employed to measure the feature importance and identify parameters germane to the prediction of structural failures. Logistic regression and decision tree models are trained on the important feature set, and the predictive performance is evaluated. The use of the feature importance approach for structures with a low probability of failure and a large number of uncertain parameters is validated by showing identical results with the reliability-based sensitivity study and appropriate predictive accuracy.

2021 ◽  
Author(s):  
Hyeyoung Koh ◽  
Hannah Beth Blum

This study presents a machine learning-based approach for sensitivity analysis to examine how parameters affect a given structural response while accounting for uncertainty. Reliability-based sensitivity analysis involves repeated evaluations of the performance function incorporating uncertainties to estimate the influence of a model parameter, which can lead to prohibitive computational costs. This challenge is exacerbated for large-scale engineering problems which often carry a large quantity of uncertain parameters. The proposed approach is based on feature selection algorithms that rank feature importance and remove redundant predictors during model development which improve model generality and training performance by focusing only on the significant features. The approach allows performing sensitivity analysis of structural systems by providing feature rankings with reduced computational effort. The proposed approach is demonstrated with two designs of a two-bay, two-story planar steel frame with different failure modes: inelastic instability of a single member and progressive yielding. The feature variables in the data are uncertainties including material yield strength, Young’s modulus, frame sway imperfection, and residual stress. The Monte Carlo sampling method is utilized to generate random realizations of the frames from published distributions of the feature parameters, and the response variable is the frame ultimate strength obtained from finite element analyses. Decision trees are trained to identify important features. Feature rankings are derived by four feature selection techniques including impurity-based, permutation, SHAP, and Spearman's correlation. Predictive performance of the model including the important features are discussed using the evaluation metric for imbalanced datasets, Matthews correlation coefficient. Finally, the results are compared with those from reliability-based sensitivity analysis on the same example frames to show the validity of the feature selection approach. As the proposed machine learning-based approach produces the same results as the reliability-based sensitivity analysis with improved computational efficiency and accuracy, it could be extended to other structural systems.


2020 ◽  
Author(s):  
John Booth ◽  
Ben Margetts ◽  
William Bryant ◽  
Richard Issitt ◽  
John Ciaran Hutchinson ◽  
...  

Introduction: Sudden unexpected death in infancy (SUDI) represents the commonest presentation of postneonatal death, yet despite full postmortem examination (autopsy), the cause of death is only determined in around 45% of cases, the majority remaining unexplained. In order to aid counselling and understand how to improve the investigation, we explored whether machine learning could be used to derive data driven insights for prediction of infant autopsy outcome. Methods: A paediatric autopsy database containing >7,000 cases in total with >300 variables per case, was analysed with cases categorised both by stage of examination (external, internal and internal with histology), and autopsy outcome classified as explained-(medical cause of death identified) or unexplained. For the purposes of this study only cases from infant and child deaths aged ≤ 2 years were included (N=3100). Following this, decision tree, random forest, and gradient boosting models were iteratively trained and evaluated for each stage of the post-mortem examination and compared using predictive accuracy metrics. Results: Data from 3,100 infant and young child autopsies were included. The naive decision tree model using initial external examination data had a predictive performance of 68% for determining whether a medical cause of death could be identified. Model performance increased when internal examination data was included and a core set of data items were identified using model feature importance as key variables for determining autopsy outcome. The most effective model was the XG Boost, with overall predictive performance of 80%, demonstrating age at death, and cardiovascular or respiratory histological findings as the most important variables associated with determining cause of death. Conclusion: This study demonstrates the feasibility of using machine learning models to objectively determine component importance of complex medical procedures, in this case infant autopsy, to inform clinical practice. It further highlights the value of collecting routine clinical procedural data according to defined standards. This approach can be applied to a wide range of clinical and operational healthcare scenarios providing objective, evidence-based information for uses such counselling, decision making and policy development.


2021 ◽  
pp. 1-10
Author(s):  
I. Krug ◽  
J. Linardon ◽  
C. Greenwood ◽  
G. Youssef ◽  
J. Treasure ◽  
...  

Abstract Background Despite a wide range of proposed risk factors and theoretical models, prediction of eating disorder (ED) onset remains poor. This study undertook the first comparison of two machine learning (ML) approaches [penalised logistic regression (LASSO), and prediction rule ensembles (PREs)] to conventional logistic regression (LR) models to enhance prediction of ED onset and differential ED diagnoses from a range of putative risk factors. Method Data were part of a European Project and comprised 1402 participants, 642 ED patients [52% with anorexia nervosa (AN) and 40% with bulimia nervosa (BN)] and 760 controls. The Cross-Cultural Risk Factor Questionnaire, which assesses retrospectively a range of sociocultural and psychological ED risk factors occurring before the age of 12 years (46 predictors in total), was used. Results All three statistical approaches had satisfactory model accuracy, with an average area under the curve (AUC) of 86% for predicting ED onset and 70% for predicting AN v. BN. Predictive performance was greatest for the two regression methods (LR and LASSO), although the PRE technique relied on fewer predictors with comparable accuracy. The individual risk factors differed depending on the outcome classification (EDs v. non-EDs and AN v. BN). Conclusions Even though the conventional LR performed comparably to the ML approaches in terms of predictive accuracy, the ML methods produced more parsimonious predictive models. ML approaches offer a viable way to modify screening practices for ED risk that balance accuracy against participant burden.


2019 ◽  
Author(s):  
Donald Salami ◽  
Carla Alexandra Sousa ◽  
Maria do Rosário Oliveira Martins ◽  
César Capinha

ABSTRACTThe geographical spread of dengue is a global public health concern. This is largely mediated by the importation of dengue from endemic to non-endemic areas via the increasing connectivity of the global air transport network. The dynamic nature and intrinsic heterogeneity of the air transport network make it challenging to predict dengue importation.Here, we explore the capabilities of state-of-the-art machine learning algorithms to predict dengue importation. We trained four machine learning classifiers algorithms, using a 6-year historical dengue importation data for 21 countries in Europe and connectivity indices mediating importation and air transport network centrality measures. Predictive performance for the classifiers was evaluated using the area under the receiving operating characteristic curve, sensitivity, and specificity measures. Finally, we applied practical model-agnostic methods, to provide an in-depth explanation of our optimal model’s predictions on a global and local scale.Our best performing model achieved high predictive accuracy, with an area under the receiver operating characteristic score of 0.94 and a maximized sensitivity score of 0.88. The predictor variables identified as most important were the source country’s dengue incidence rate, population size, and volume of air passengers. Network centrality measures, describing the positioning of European countries within the air travel network, were also influential to the predictions.We demonstrated the high predictive performance of a machine learning model in predicting dengue importation and the utility of the model-agnostic methods to offer a comprehensive understanding of the reasons behind the predictions. Similar approaches can be utilized in the development of an operational early warning surveillance system for dengue importation.


Author(s):  
Kacper Sokol ◽  
Peter Flach

Understanding data, models and predictions is important for machine learning applications. Due to the limitations of our spatial perception and intuition, analysing high-dimensional data is inherently difficult. Furthermore, black-box models achieving high predictive accuracy are widely used, yet the logic behind their predictions is often opaque. Use of textualisation -- a natural language narrative of selected phenomena -- can tackle these shortcomings. When extended with argumentation theory we could envisage machine learning models and predictions arguing persuasively for their choices.


2019 ◽  
Vol 2019 ◽  
pp. 1-9 ◽  
Author(s):  
Patricio Wolff ◽  
Manuel Graña ◽  
Sebastián A. Ríos ◽  
Maria Begoña Yarza

Background. Hospital readmission prediction in pediatric hospitals has received little attention. Studies have focused on the readmission frequency analysis stratified by disease and demographic/geographic characteristics but there are no predictive modeling approaches, which may be useful to identify preventable readmissions that constitute a major portion of the cost attributed to readmissions.Objective. To assess the all-cause readmission predictive performance achieved by machine learning techniques in the emergency department of a pediatric hospital in Santiago, Chile.Materials. An all-cause admissions dataset has been collected along six consecutive years in a pediatric hospital in Santiago, Chile. The variables collected are the same used for the determination of the child’s treatment administrative cost.Methods. Retrospective predictive analysis of 30-day readmission was formulated as a binary classification problem. We report classification results achieved with various model building approaches after data curation and preprocessing for correction of class imbalance. We compute repeated cross-validation (RCV) with decreasing number of folders to assess performance and sensitivity to effect of imbalance in the test set and training set size.Results. Increase in recall due to SMOTE class imbalance correction is large and statistically significant. The Naive Bayes (NB) approach achieves the best AUC (0.65); however the shallow multilayer perceptron has the best PPV and f-score (5.6 and 10.2, resp.). The NB and support vector machines (SVM) give comparable results if we consider AUC, PPV, and f-score ranking for all RCV experiments. High recall of deep multilayer perceptron is due to high false positive ratio. There is no detectable effect of the number of folds in the RCV on the predictive performance of the algorithms.Conclusions. We recommend the use of Naive Bayes (NB) with Gaussian distribution model as the most robust modeling approach for pediatric readmission prediction, achieving the best results across all training dataset sizes. The results show that the approach could be applied to detect preventable readmissions.


2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Hooman Zabeti ◽  
Nick Dexter ◽  
Amir Hosein Safari ◽  
Nafiseh Sedaghat ◽  
Maxwell Libbrecht ◽  
...  

Abstract Motivation Prediction of drug resistance and identification of its mechanisms in bacteria such as Mycobacterium tuberculosis, the etiological agent of tuberculosis, is a challenging problem. Solving this problem requires a transparent, accurate, and flexible predictive model. The methods currently used for this purpose rarely satisfy all of these criteria. On the one hand, approaches based on testing strains against a catalogue of previously identified mutations often yield poor predictive performance; on the other hand, machine learning techniques typically have higher predictive accuracy, but often lack interpretability and may learn patterns that produce accurate predictions for the wrong reasons. Current interpretable methods may either exhibit a lower accuracy or lack the flexibility needed to generalize them to previously unseen data. Contribution In this paper we propose a novel technique, inspired by group testing and Boolean compressed sensing, which yields highly accurate predictions, interpretable results, and is flexible enough to be optimized for various evaluation metrics at the same time. Results We test the predictive accuracy of our approach on five first-line and seven second-line antibiotics used for treating tuberculosis. We find that it has a higher or comparable accuracy to that of commonly used machine learning models, and is able to identify variants in genes with previously reported association to drug resistance. Our method is intrinsically interpretable, and can be customized for different evaluation metrics. Our implementation is available at github.com/hoomanzabeti/INGOT_DR and can be installed via The Python Package Index (Pypi) under ingotdr. This package is also compatible with most of the tools in the Scikit-learn machine learning library.


2022 ◽  
Vol 9 (1) ◽  
pp. 0-0

This article investigates the impact of data-complexity and team-specific characteristics on machine learning competition scores. Data from five real-world binary classification competitions hosted on Kaggle.com were analyzed. The data-complexity characteristics were measured in four aspects including standard measures, sparsity measures, class imbalance measures, and feature-based measures. The results showed that the higher the level of the data-complexity characteristics was, the lower the predictive ability of the machine learning model was as well. Our empirical evidence revealed that the imbalance ratio of the target variable was the most important factor and exhibited a nonlinear relationship with the model’s predictive abilities. The imbalance ratio adversely affected the predictive performance when it reached a certain level. However, mixed results were found for the impact of team-specific characteristics measured by team size, team expertise, and the number of submissions on team performance. For high-performing teams, these factors had no impact on team score.


2021 ◽  
Author(s):  
Herdiantri Sufriyana ◽  
Yu Wei Wu ◽  
Emily Chia-Yu Su

Abstract This protocol aims to develop, validate, and deploy a prediction model using high dimensional data by both human and machine learning. The applicability is intended for clinical prediction in healthcare providers, including but not limited to those using medical histories from electronic health records. This protocol applies diverse approaches to improve both predictive performance and interpretability while maintaining the generalizability of model evaluation. However, some steps require expensive computational capacity; otherwise, these will take longer time. The key stages consist of designs of data collection and analysis, feature discovery and quality control, and model development, validation, and deployment.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Donald Salami ◽  
Carla Alexandra Sousa ◽  
Maria do Rosário Oliveira Martins ◽  
César Capinha

Abstract The geographical spread of dengue is a global public health concern. This is largely mediated by the importation of dengue from endemic to non-endemic areas via the increasing connectivity of the global air transport network. The dynamic nature and intrinsic heterogeneity of the air transport network make it challenging to predict dengue importation. Here, we explore the capabilities of state-of-the-art machine learning algorithms to predict dengue importation. We trained four machine learning classifiers algorithms, using a 6-year historical dengue importation data for 21 countries in Europe and connectivity indices mediating importation and air transport network centrality measures. Predictive performance for the classifiers was evaluated using the area under the receiving operating characteristic curve, sensitivity, and specificity measures. Finally, we applied practical model-agnostic methods, to provide an in-depth explanation of our optimal model’s predictions on a global and local scale. Our best performing model achieved high predictive accuracy, with an area under the receiver operating characteristic score of 0.94 and a maximized sensitivity score of 0.88. The predictor variables identified as most important were the source country’s dengue incidence rate, population size, and volume of air passengers. Network centrality measures, describing the positioning of European countries within the air travel network, were also influential to the predictions. We demonstrated the high predictive performance of a machine learning model in predicting dengue importation and the utility of the model-agnostic methods to offer a comprehensive understanding of the reasons behind the predictions. Similar approaches can be utilized in the development of an operational early warning surveillance system for dengue importation.


Sign in / Sign up

Export Citation Format

Share Document