Feature Importance Methods: Details and Usage Examples

A convolution based computational approach towards DNA N6-methyladenine site identification and motif extraction in rice genome

Scientific Reports ◽

10.1038/s41598-021-89850-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Chowdhury Rafeed Rahman ◽

Ruhul Amin ◽

Swakkhar Shatabda ◽

Md. Sadrul Islam Toaha

Keyword(s):

Operating Characteristic ◽

Adenine Nucleotide ◽

Characteristic Curve ◽

Rice Genome ◽

Plant Genome ◽

Feature Importance ◽

Importance Analysis ◽

Motif Extraction ◽

Site Identification ◽

By Products

AbstractDNA N6-methylation (6mA) in Adenine nucleotide is a post replication modification responsible for many biological functions. Automated and accurate computational methods can help to identify 6mA sites in long genomes saving significant time and money. Our study develops a convolutional neural network (CNN) based tool i6mA-CNN capable of identifying 6mA sites in the rice genome. Our model coordinates among multiple types of features such as PseAAC (Pseudo Amino Acid Composition) inspired customized feature vector, multiple one hot representations and dinucleotide physicochemical properties. It achieves auROC (area under Receiver Operating Characteristic curve) score of 0.98 with an overall accuracy of 93.97% using fivefold cross validation on benchmark dataset. Finally, we evaluate our model on three other plant genome 6mA site identification test datasets. Results suggest that our proposed tool is able to generalize its ability of 6mA site identification on plant genomes irrespective of plant species. An algorithm for potential motif extraction and a feature importance analysis procedure are two by products of this research. Web tool for this research can be found at: https://cutt.ly/dgp3QTR.

Download Full-text

Unbiased Measurement of Feature Importance in Tree-Based Methods

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3429445 ◽

2021 ◽

Vol 15 (2) ◽

pp. 1-21

Author(s):

Zhengze Zhou ◽

Giles Hooker

Keyword(s):

Feature Importance

Download Full-text

Machine learning augmented predictive and generative model for rupture life in ferritic and austenitic steels

npj Materials Degradation ◽

10.1038/s41529-021-00166-5 ◽

2021 ◽

Vol 5 (1) ◽

Author(s):

Osman Mamun ◽

Madison Wenzlick ◽

Arun Sathanur ◽

Jeffrey Hawk ◽

Ram Devanathan

Keyword(s):

Pearson Correlation ◽

Rupture Life ◽

Model Performance ◽

Austenitic Stainless Steels ◽

Generative Model ◽

Austenitic Steels ◽

Gradient Boosting ◽

Variational Autoencoder ◽

Feature Importance ◽

Boosting Algorithm

AbstractThe Larson–Miller parameter (LMP) offers an efficient and fast scheme to estimate the creep rupture life of alloy materials for high-temperature applications; however, poor generalizability and dependence on the constant C often result in sub-optimal performance. In this work, we show that the direct rupture life parameterization without intermediate LMP parameterization, using a gradient boosting algorithm, can be used to train ML models for very accurate prediction of rupture life in a variety of alloys (Pearson correlation coefficient >0.9 for 9–12% Cr and >0.8 for austenitic stainless steels). In addition, the Shapley value was used to quantify feature importance, making the model interpretable by identifying the effect of various features on the model performance. Finally, a variational autoencoder-based generative model was built by conditioning on the experimental dataset to sample hypothetical synthetic candidate alloys from the learnt joint distribution not existing in both 9–12% Cr ferritic–martensitic alloys and austenitic stainless steel datasets.

Download Full-text

Feature-Weighted Sampling for Proper Evaluation of Classification Models

Applied Sciences ◽

10.3390/app11052039 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2039

Author(s):

Hyunseok Shin ◽

Sejong Oh

Keyword(s):

Random Sampling ◽

Sampling Method ◽

Classification Model ◽

Training Set ◽

Test Set ◽

Feature Importance ◽

Proper Training ◽

Machine Learning Applications ◽

Test Sets ◽

The Given

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.

Download Full-text

A Robust Method to Measure the Global Feature Importance of Complex Prediction Models

IEEE Access ◽

10.1109/access.2021.3049412 ◽

2021 ◽

pp. 1-1

Author(s):

Xiaohang Zhang ◽

Ling Wu ◽

Zhengren Li ◽

Huayuan Liu

Keyword(s):

Prediction Models ◽

Global Feature ◽

Robust Method ◽

Feature Importance

Download Full-text

AB0652 MACHINE LEARNING TO PREDICT EARLY TNF INHIBITOR USERS IN PATIENTS WITH ANKYLOSING SPONDYLITIS

Annals of the Rheumatic Diseases ◽

10.1136/annrheumdis-2020-eular.3743 ◽

2020 ◽

Vol 79 (Suppl 1) ◽

pp. 1620.1-1621

Author(s):

J. Lee ◽

H. Kim ◽

S. Y. Kang ◽

S. Lee ◽

Y. H. Eun ◽

...

Keyword(s):

Machine Learning ◽

Ankylosing Spondylitis ◽

Tnf Inhibitors ◽

Tnf Inhibitor ◽

Ann Model ◽

Learning Models ◽

Feature Importance ◽

Importance Analysis ◽

Baseline Characteristics ◽

Machine Learning Models

Background:Tumor necrosis factor (TNF) inhibitors are important drugs in treating patients with ankylosing spondylitis (AS). However, they are not used as a first-line treatment for AS. There is an insufficient treatment response to the first-line treatment, non-steroidal anti-inflammatory drugs (NSAIDs), in over 40% of patients. If we can predict who will need TNF inhibitors at an earlier phase, adequate treatment can be provided at an appropriate time and potential damages can be avoided. There is no precise predictive model at present. Recently, various machine learning methods show great performances in predictions using clinical data.Objectives:We aim to generate an artificial neural network (ANN) model to predict early TNF inhibitor users in patients with ankylosing spondylitis.Methods:The baseline demographic and laboratory data of patients who visited Samsung Medical Center rheumatology clinic from Dec. 2003 to Sep. 2018 were analyzed. Patients were divided into two groups: early TNF inhibitor users treated by TNF inhibitors within six months of their follow-up (early-TNF users), and the others (non-early-TNF users). Machine learning models were formulated to predict the early-TNF users using the baseline data. Additionally, feature importance analysis was performed to delineate significant baseline characteristics.Results:The numbers of early-TNF and non-early-TNF users were 90 and 509, respectively. The best performing ANN model utilized 3 hidden layers with 50 hidden nodes each; its performance (area under curve (AUC) = 0.75) was superior to logistic regression model, support vector machine, and random forest model (AUC = 0.72, 0.65, and 0.71, respectively) in predicting early-TNF users. Feature importance analysis revealed erythrocyte sedimentation rate (ESR), C-reactive protein (CRP), and height as the top significant baseline characteristics for predicting early-TNF users. Among these characteristics, height was revealed by machine learning models but not by conventional statistical techniques.Conclusion:Our model displayed superior performance in predicting early TNF users compared with logistic regression and other machine learning models. Machine learning can be a vital tool in predicting treatment response in various rheumatologic diseases.Disclosure of Interests:None declared

Download Full-text