modelBuildR: an R package for model building and feature selection with erroneous classifications

PeerJ ◽

10.7717/peerj.10849 ◽

2021 ◽

Vol 9 ◽

pp. e10849

Author(s):

Maximilian Knoll ◽

Jennifer Furkel ◽

Juergen Debus ◽

Amir Abdollahi

Keyword(s):

Feature Selection ◽

Cross Validation ◽

Model Building ◽

Linear Models ◽

Binary Classification ◽

Ground Truth ◽

R Package ◽

Methylation Array ◽

Survival Difference ◽

Error Probabilities

Background Model building is a crucial part of omics based biomedical research to transfer classifications and obtain insights into underlying mechanisms. Feature selection is often based on minimizing error between model predictions and given classification (maximizing accuracy). Human ratings/classifications, however, might be error prone, with discordance rates between experts of 5–15%. We therefore evaluate if a feature pre-filtering step might improve identification of features associated with true underlying groups. Methods Data was simulated for up to 100 samples and up to 10,000 features, 10% of which were associated with the ground truth comprising 2–10 normally distributed populations. Binary and semi-quantitative ratings with varying error probabilities were used as classification. For feature preselection standard cross-validation (V2) was compared to a novel heuristic (V1) applying univariate testing, multiplicity adjustment and cross-validation on switched dependent (classification) and independent (features) variables. Preselected features were used to train logistic regression/linear models (backward selection, AIC). Predictions were compared against the ground truth (ROC, multiclass-ROC). As use case, multiple feature selection/classification methods were benchmarked against the novel heuristic to identify prognostically different G-CIMP negative glioblastoma tumors from the TCGA-GBM 450 k methylation array data cohort, starting from a fuzzy umap based rough and erroneous separation. Results V1 yielded higher median AUC ranks for two true groups (ground truth), with smaller differences for true graduated differences (3–10 groups). Lower fractions of models were successfully fit with V1. Median AUCs for binary classification and two true groups were 0.91 (range: 0.54–1.00) for V1 (Benjamini-Hochberg) and 0.70 (0.28–1.00) for V2, 13% (n = 616) of V2 models showed AUCs < = 50% for 25 samples and 100 features. For larger numbers of features and samples, median AUCs were 0.75 (range 0.59–1.00) for V1 and 0.54 (range 0.32–0.75) for V2. In the TCGA-GBM data, modelBuildR allowed best prognostic separation of patients with highest median overall survival difference (7.51 months) followed a difference of 6.04 months for a random forest based method. Conclusions The proposed heuristic is beneficial for the retrieval of features associated with two true groups classified with errors. We provide the R package modelBuildR to simplify (comparative) evaluation/application of the proposed heuristic (http://github.com/mknoll/modelBuildR).

Download Full-text

An Empirical Study of Univariate and Genetic Algorithm-Based Feature Selection in Binary Classification with Microarray Data

Cancer Informatics ◽

10.1177/117693510600200016 ◽

2006 ◽

Vol 2 ◽

pp. 117693510600200 ◽

Cited By ~ 16

Author(s):

Michael Lecocke ◽

Kenneth Hess

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Empirical Study ◽

Selection Bias ◽

Microarray Data ◽

Cross Validation ◽

Binary Classification ◽

Error Rates ◽

Misclassification Error ◽

Univariate Approach

Background We consider both univariate- and multivariate-based feature selection for the problem of binary classification with microarray data. The idea is to determine whether the more sophisticated multivariate approach leads to better misclassification error rates because of the potential to consider jointly significant subsets of genes (but without overfitting the data). Methods We present an empirical study in which 10-fold cross-validation is applied externally to both a univariate-based and two multivariate- (genetic algorithm (GA)-) based feature selection processes. These procedures are applied with respect to three supervised learning algorithms and six published two-class microarray datasets. Results Considering all datasets, and learning algorithms, the average 10-fold external cross-validation error rates for the univariate-, single-stage GA-, and two-stage GA-based processes are 14.2%, 14.6%, and 14.2%, respectively. We also find that the optimism bias estimates from the GA analyses were half that of the univariate approach, but the selection bias estimates from the GA analyses were 2.5 times that of the univariate results. Conclusions We find that the 10-fold external cross-validation misclassification error rates were very comparable. Further, we find that a two-stage GA approach did not demonstrate a significant advantage over a 1-stage approach. We also find that the univariate approach had higher optimism bias and lower selection bias compared to both GA approaches.

Download Full-text

GPSeqClus: an r package for sequential clustering of animal location data for model building, model application, and field site investigations

Methods in Ecology and Evolution ◽

10.1111/2041-210x.13572 ◽

2021 ◽

Author(s):

Justin G. Clapp ◽

Joseph D. Holbrook ◽

Daniel J. Thompson

Keyword(s):

Model Building ◽

R Package ◽

Field Site ◽

Model Application ◽

Location Data ◽

Building Model ◽

Site Investigations ◽

Sequential Clustering

Download Full-text

A multilayer multimodal detection and prediction model based on explainable artificial intelligence for Alzheimer’s disease

Scientific Reports ◽

10.1038/s41598-021-82098-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Shaker El-Sappagh ◽

Jose M. Alonso ◽

S. M. Riazul Islam ◽

Ahmad M. Sultan ◽

Kyung Sup Kwak

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Cross Validation ◽

Disease Risk ◽

Binary Classification ◽

Fuzzy Rule ◽

Large Set ◽

Detection Model ◽

Multi Class Classification ◽

Clinical Measures

AbstractAlzheimer’s disease (AD) is the most common type of dementia. Its diagnosis and progression detection have been intensively studied. Nevertheless, research studies often have little effect on clinical practice mainly due to the following reasons: (1) Most studies depend mainly on a single modality, especially neuroimaging; (2) diagnosis and progression detection are usually studied separately as two independent problems; and (3) current studies concentrate mainly on optimizing the performance of complex machine learning models, while disregarding their explainability. As a result, physicians struggle to interpret these models, and feel it is hard to trust them. In this paper, we carefully develop an accurate and interpretable AD diagnosis and progression detection model. This model provides physicians with accurate decisions along with a set of explanations for every decision. Specifically, the model integrates 11 modalities of 1048 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) real-world dataset: 294 cognitively normal, 254 stable mild cognitive impairment (MCI), 232 progressive MCI, and 268 AD. It is actually a two-layer model with random forest (RF) as classifier algorithm. In the first layer, the model carries out a multi-class classification for the early diagnosis of AD patients. In the second layer, the model applies binary classification to detect possible MCI-to-AD progression within three years from a baseline diagnosis. The performance of the model is optimized with key markers selected from a large set of biological and clinical measures. Regarding explainability, we provide, for each layer, global and instance-based explanations of the RF classifier by using the SHapley Additive exPlanations (SHAP) feature attribution framework. In addition, we implement 22 explainers based on decision trees and fuzzy rule-based systems to provide complementary justifications for every RF decision in each layer. Furthermore, these explanations are represented in natural language form to help physicians understand the predictions. The designed model achieves a cross-validation accuracy of 93.95% and an F1-score of 93.94% in the first layer, while it achieves a cross-validation accuracy of 87.08% and an F1-Score of 87.09% in the second layer. The resulting system is not only accurate, but also trustworthy, accountable, and medically applicable, thanks to the provided explanations which are broadly consistent with each other and with the AD medical literature. The proposed system can help to enhance the clinical understanding of AD diagnosis and progression processes by providing detailed insights into the effect of different modalities on the disease risk.

Download Full-text

Gene set enrichment analysis for genome-wide DNA methylation data

Genome Biology ◽

10.1186/s13059-021-02388-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jovana Maksimovic ◽

Alicia Oshlack ◽

Belinda Phipson

Keyword(s):

Dna Methylation ◽

Enrichment Analysis ◽

R Package ◽

Gene Set Enrichment Analysis ◽

Methylation Array ◽

Gene Set ◽

Genome Wide ◽

Genome Methylation ◽

Unbiased Gene ◽

Gene Set Testing

AbstractDNA methylation is one of the most commonly studied epigenetic marks, due to its role in disease and development. Illumina methylation arrays have been extensively used to measure methylation across the human genome. Methylation array analysis has primarily focused on preprocessing, normalization, and identification of differentially methylated CpGs and regions. GOmeth and GOregion are new methods for performing unbiased gene set testing following differential methylation analysis. Benchmarking analyses demonstrate GOmeth outperforms other approaches, and GOregion is the first method for gene set testing of differentially methylated regions. Both methods are publicly available in the missMethyl Bioconductor R package.

Download Full-text

Wide spectrum feature selection (WiSe) for regression model building

Computers & Chemical Engineering ◽

10.1016/j.compchemeng.2018.10.005 ◽

2019 ◽

Vol 121 ◽

pp. 99-110 ◽

Cited By ~ 5

Author(s):

Ricardo Rendall ◽

Ivan Castillo ◽

Alix Schmidt ◽

Swee-Teng Chin ◽

Leo H. Chiang ◽

...

Keyword(s):

Feature Selection ◽

Regression Model ◽

Model Building ◽

Wide Spectrum ◽

Spectrum Feature

Download Full-text

tscount: An R Package for Analysis of Count Time Series Following Generalized Linear Models

Journal of Statistical Software ◽

10.18637/jss.v082.i05 ◽

2017 ◽

Vol 82 (5) ◽

Cited By ~ 28

Author(s):

Tobias Liboschik ◽

Konstantinos Fokianos ◽

Roland Fried

Keyword(s):

Time Series ◽

Generalized Linear Models ◽

Linear Models ◽

R Package ◽

Count Time Series

Download Full-text

GsymPoint: An R Package to Estimate the Generalized Symmetry Point, an Optimal Cut-off Point for Binary Classification in Continuous Diagnostic Tests

The R Journal ◽

10.32614/rj-2017-015 ◽

2017 ◽

Vol 9 (1) ◽

pp. 262 ◽

Cited By ~ 1

Author(s):

Mónica López-Ratón ◽

Elisa,M. Molanes-López ◽

Emilio Letón ◽

Carmen Cadarso-Suárez

Keyword(s):

Diagnostic Tests ◽

Binary Classification ◽

R Package ◽

Symmetry Point ◽

Generalized Symmetry

Download Full-text

LPI-HyADBS: a hybrid framework for lncRNA-protein interaction prediction integrating feature selection and classification

BMC Bioinformatics ◽

10.1186/s12859-021-04485-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Liqian Zhou ◽

Qi Duan ◽

Xiongfei Tian ◽

He Xu ◽

Jianxin Tang ◽

...

Keyword(s):

Feature Selection ◽

Protein Interactions ◽

Cross Validation ◽

Rna Binding ◽

Rna Binding Proteins ◽

Biological Information ◽

Hybrid Framework ◽

Protein Interaction Prediction ◽

Single Dataset ◽

Feature Selection Approach

Abstract Background Long noncoding RNAs (lncRNAs) have dense linkages with a plethora of important cellular activities. lncRNAs exert functions by linking with corresponding RNA-binding proteins. Since experimental techniques to detect lncRNA-protein interactions (LPIs) are laborious and time-consuming, a few computational methods have been reported for LPI prediction. However, computation-based LPI identification methods have the following limitations: (1) Most methods were evaluated on a single dataset, and researchers may thus fail to measure their generalization ability. (2) The majority of methods were validated under cross validation on lncRNA-protein pairs, did not investigate the performance under other cross validations, especially for cross validation on independent lncRNAs and independent proteins. (3) lncRNAs and proteins have abundant biological information, how to select informative features need to further investigate. Results Under a hybrid framework (LPI-HyADBS) integrating feature selection based on AdaBoost, and classification models including deep neural network (DNN), extreme gradient Boost (XGBoost), and SVM with a penalty Coefficient of misclassification (C-SVM), this work focuses on finding new LPIs. First, five datasets are arranged. Each dataset contains lncRNA sequences, protein sequences, and an LPI network. Second, biological features of lncRNAs and proteins are acquired based on Pyfeat. Third, the obtained features of lncRNAs and proteins are selected based on AdaBoost and concatenated to depict each LPI sample. Fourth, DNN, XGBoost, and C-SVM are used to classify lncRNA-protein pairs based on the concatenated features. Finally, a hybrid framework is developed to integrate the classification results from the above three classifiers. LPI-HyADBS is compared to six classical LPI prediction approaches (LPI-SKF, LPI-NRLMF, Capsule-LPI, LPI-CNNCP, LPLNP, and LPBNI) on five datasets under 5-fold cross validations on lncRNAs, proteins, lncRNA-protein pairs, and independent lncRNAs and independent proteins. The results show LPI-HyADBS has the best LPI prediction performance under four different cross validations. In particular, LPI-HyADBS obtains better classification ability than other six approaches under the constructed independent dataset. Case analyses suggest that there is relevance between ZNF667-AS1 and Q15717. Conclusions Integrating feature selection approach based on AdaBoost, three classification techniques including DNN, XGBoost, and C-SVM, this work develops a hybrid framework to identify new linkages between lncRNAs and proteins.

Download Full-text

Feature Selection Models Based on Hybrid Firefly Algorithm with Mutation Operator for Network Intrusion Detection

International Journal of Intelligent Engineering and Systems ◽

10.22266/ijies2021.0228.19 ◽

2021 ◽

Vol 14 (1) ◽

pp. 192-202

Author(s):

Karrar Alwan ◽

◽

Ahmed AbuEl-Atta ◽

Hala Zayed ◽

◽

...

Keyword(s):

Feature Selection ◽

Intrusion Detection ◽

Intrusion Detection System ◽

Firefly Algorithm ◽

Detection System ◽

Binary Classification ◽

Feature Reduction ◽

Detection Accuracy ◽

Multi Classification ◽

Modified Firefly Algorithm

Accurate intrusion detection is necessary to preserve network security. However, developing efficient intrusion detection system is a complex problem due to the nonlinear nature of the intrusion attempts, the unpredictable behaviour of network traffic, and the large number features in the problem space. Hence, selecting the most effective and discriminating feature is highly important. Additionally, eliminating irrelevant features can improve the detection accuracy as well as reduce the learning time of machine learning algorithms. However, feature reduction is an NPhard problem. Therefore, several metaheuristics have been employed to determine the most effective feature subset within reasonable time. In this paper, two intrusion detection models are built based on a modified version of the firefly algorithm to achieve the feature selection task. The first and, the second models have been used for binary and multiclass classification, respectively. The modified firefly algorithm employed a mutation operation to avoid trapping into local optima through enhancing the exploration capabilities of the original firefly. The significance of the selected features is evaluated using a Naïve Bayes classifier over a benchmark standard dataset, which contains different types of attacks. The obtained results revealed the superiority of the modified firefly algorithm against the original firefly algorithm in terms of the classification accuracy and the number of selected features under different scenarios. Additionally, the results assured the superiority of the proposed intrusion detection system against other recently proposed systems in both binary classification and multi-classification scenarios. The proposed system has 96.51% and 96.942% detection accuracy in binary classification and multi-classification, respectively. Moreover, the proposed system reduced the number of attributes from 41 to 9 for binary classification and to 10 for multi-classification.

Download Full-text

Canopy Top, Height and Photosynthetic Pigment Estimation Using Parrot Sequoia Multispectral Imagery and the Unmanned Aerial Vehicle (UAV): Norway Spruce Forest Case Study

10.20944/preprints202101.0255.v1 ◽

2021 ◽

Author(s):

Veronika Kopačková-Strnadová ◽

Lucie Koucká ◽

Jan Jelenek ◽

Zuzana Lhotakova ◽

Filip Oulehle

Keyword(s):

Norway Spruce ◽

Unmanned Aerial Vehicle ◽

Linear Models ◽

Photosynthetic Pigment ◽

Ground Truth ◽

Needle Age ◽

Vegetation Indexes ◽

Pigment Contents ◽

Aerial Vehicle ◽

Uav Images

Remote sensing is one of the modern methods that have significantly developed over the last two decades and nowadays provides a new means for forest monitoring. High spatial and temporal resolutions are demanded for accurate and timely monitoring of forests. In this study multi-spectral Unmanned Aerial Vehicle (UAV) images were used to estimate canopy parameters (definition of crown extent, top and height as well as photosynthetic pigment contents). The UAV images in Green, Red, Red-Edge and NIR bands were acquired by Parrot Sequoia camera over selected sites in two small catchments (Czech Republic) covered dominantly by Norway spruce monocultures. Individual tree extents, together with tree tops and heights, were derived from the Canopy Height Model (CHM). In addition, the following were tested i) to what extent can the linear relationship be established between selected vegetation indexes (NDVI and NDVIred edge) derived for individual trees and the corresponding ground truth (e.g., biochemically assessed needle photosynthetic pigment contents), and ii) whether needle age selection as a ground truth and crown light conditions affect the validity of linear models. The results of the conducted statistical analysis show that the two vegetation indexes (NDVI and NDVIred edge) tested here have a potential to assess photosynthetic pigments in Norway spruce forests at a semi-quantitative level, however the needle-age selection as a ground truth was revealed to be a very important factor. The only usable results were obtained for linear models when using the 2nd year needle pigment contents as a ground truth. On the other hand, the illumination conditions of the crown proved to have very little effect on the model’s validity. No study was found to directly compare these results conducted on coniferous forest stands. This shows that there is a further need for studies dealing with a quantitative estimation of the biochemical variables of nature coniferous forests when employing spectral data acquired by the UAV platform at a very high spatial resolution.

Download Full-text