Feature selection with the R package MXM

Feature (or variable) selection is the process of identifying the minimal set of features with the highest predictive performance on the target variable of interest. Numerous feature selection algorithms have been developed over the years, but only few have been implemented in R as a package. The R package MXM is such an example, which not only offers a variety of feature selection algorithms, but has unique features that make it advantageous over its competitors: a) it contains feature selection algorithms that can treat numerous types of target variables, including continuous, percentages, time to event (survival), binary, nominal, ordinal, clustered, counts, left censored, etc; b) it contains a variety of regression models to plug into the feature selection algorithms; c) it includes an algorithm for detecting multiple solutions (many sets of equivalent features); and d) it includes memory efficient algorithms for high volume data, data that cannot be loaded into R. In this paper we qualitatively compare MXM with other relevant packages and discuss its advantages and disadvantages. We also provide a demonstration of its algorithms using real high-dimensional data from various applications.

Download Full-text

A framework for meta-analysis of prediction model studies with binary and time-to-event outcomes

Statistical Methods in Medical Research ◽

10.1177/0962280218785504 ◽

2018 ◽

Vol 28 (9) ◽

pp. 2768-2786 ◽

Cited By ~ 13

Author(s):

Thomas PA Debray ◽

Johanna AAG Damen ◽

Richard D Riley ◽

Kym Snell ◽

Johannes B Reitsma ◽

...

Keyword(s):

Prediction Model ◽

Prediction Models ◽

Meta Analysis ◽

Predictive Performance ◽

R Package ◽

Small Samples ◽

Time To Event ◽

Model Studies ◽

Framingham Risk ◽

Study Heterogeneity

It is widely recommended that any developed—diagnostic or prognostic—prediction model is externally validated in terms of its predictive performance measured by calibration and discrimination. When multiple validations have been performed, a systematic review followed by a formal meta-analysis helps to summarize overall performance across multiple settings, and reveals under which circumstances the model performs suboptimal (alternative poorer) and may need adjustment. We discuss how to undertake meta-analysis of the performance of prediction models with either a binary or a time-to-event outcome. We address how to deal with incomplete availability of study-specific results (performance estimates and their precision), and how to produce summary estimates of the c-statistic, the observed:expected ratio and the calibration slope. Furthermore, we discuss the implementation of frequentist and Bayesian meta-analysis methods, and propose novel empirically-based prior distributions to improve estimation of between-study heterogeneity in small samples. Finally, we illustrate all methods using two examples: meta-analysis of the predictive performance of EuroSCORE II and of the Framingham Risk Score. All examples and meta-analysis models have been implemented in our newly developed R package “metamisc”.

Download Full-text

Particle swarm optimisation for feature selection: A hybrid filter-wrapper approach

10.26686/wgtn.14273612.v1 ◽

2021 ◽

Author(s):

T Butler-Yeoman ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Classification Accuracy ◽

Computational Cost ◽

Particle Swarm ◽

Classification Performance ◽

Particle Swarm Optimisation ◽

Computational Time ◽

Advantages And Disadvantages ◽

Benchmark Datasets ◽

Selection Algorithms

© 2015 IEEE. Feature selection is an important pre-processing step, which can reduce the dimensionality of a dataset and increase the accuracy and efficiency of a learning/classification algorithm. However, existing feature selection algorithms mainly wrappers and filters have their own advantages and disadvantages. This paper proposes two filter-wrapper hybrid feature selection algorithms based on particle swarm optimisation (PSO), where the first algorithm named FastPSO combined filter and wrapper into the search process of PSO for feature selection with most of the evaluations as filters and a small number of evaluations as wrappers. The second algorithm named RapidPSO further reduced the number of wrapper evaluations. Theoretical analysis on FastPSO and RapidPSO is conducted to investigate their complexity. FastPSO and RapidPSO are compared with a pure wrapper algorithm named WrapperPSO and a pure filter algorithm named FilterPSO on nine benchmark datasets of varying difficulty. The experimental results show that both FastPSO and RapidPSO can successfully reduce the number of features and simultaneously increase the classification performance over using all features. The two proposed algorithms maintain the high classification performance achieved by WrapperPSO and significantly reduce the computational time, although the number of features is larger. At the same time, they increase the classification accuracy of FilterPSO and reduce the number of features, but increased the computational cost. FastPSO outperformed RapidPSO in terms of the classification accuracy and the number of features, but increased the computational time, which shows the trade-off between the efficiency and effectiveness. © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Download Full-text

FamEvent: An R Package for Generating and Modeling Time-to-Event Data in Family Designs

Journal of Statistical Software ◽

10.18637/jss.v097.i07 ◽

2021 ◽

Vol 97 (7) ◽

Author(s):

Yun-Hee Choi ◽

Laurent Briollais ◽

Wenqing He ◽

Karen Kopciuk

Keyword(s):

R Package ◽

Event Data ◽

Time To Event ◽

Time To Event Data

Download Full-text

Cox-sMBPLS: An Algorithm for Disease Survival Prediction and Multi-Omics Module Discovery Incorporating Cis-Regulatory Quantitative Effects

Frontiers in Genetics ◽

10.3389/fgene.2021.701405 ◽

2021 ◽

Vol 12 ◽

Author(s):

Nasim Vahabi ◽

Caitrin W. McDonough ◽

Ankit A. Desai ◽

Larisa H. Cavallari ◽

Julio D. Duarte ◽

...

Keyword(s):

Feature Selection ◽

Disease Onset ◽

Prediction Performance ◽

Biological Information ◽

Survival Prediction ◽

Simulation Studies ◽

Time To Event ◽

Time To Event Data ◽

Proposed Model ◽

Module Discovery

BackgroundThe development of high-throughput techniques has enabled profiling a large number of biomolecules across a number of molecular compartments. The challenge then becomes to integrate such multimodal Omics data to gain insights into biological processes and disease onset and progression mechanisms. Further, given the high dimensionality of such data, incorporating prior biological information on interactions between molecular compartments when developing statistical models for data integration is beneficial, especially in settings involving a small number of samples.ResultsWe develop a supervised model for time to event data (e.g., death, biochemical recurrence) that simultaneously accounts for redundant information within Omics profiles and leverages prior biological associations between them through a multi-block PLS framework. The interactions between data from different molecular compartments (e.g., epigenome, transcriptome, methylome, etc.) were captured by using cis-regulatory quantitative effects in the proposed model. The model, coined Cox-sMBPLS, exhibits superior prediction performance and improved feature selection based on both simulation studies and analysis of data from heart failure patients.ConclusionThe proposed supervised Cox-sMBPLS model can effectively incorporate prior biological information in the survival prediction system, leading to improved prediction performance and feature selection. It also enables the identification of multi-Omics modules of biomolecules that impact the patients’ survival probability and also provides insights into potential relevant risk factors that merit further investigation.

Download Full-text

Analyzing Response Times and Other Types of Time-to-Event Data Using Event History Analysis: A Tool for Mental Chronometry and Cognitive Psychophysiology

i-Perception ◽

10.1177/2041669520978673 ◽

2020 ◽

Vol 11 (6) ◽

pp. 204166952097867

Author(s):

Sven Panis ◽

Filipp Schmidt ◽

Maximilian P. Wolkersdorfer ◽

Thomas Schmidt

Keyword(s):

Discrete Time ◽

Event History Analysis ◽

Masked Priming ◽

Event History ◽

Event Data ◽

Time To Event ◽

Advantages And Disadvantages ◽

Time To Event Data ◽

History Analysis ◽

Speed Accuracy

In this Methods article, we discuss and illustrate a unifying, principled way to analyze response time data from psychological experiments—and all other types of time-to-event data. We advocate the general application of discrete-time event history analysis (EHA) which is a well-established, intuitive longitudinal approach to statistically describe and model the shape of time-to-event distributions. After discussing the theoretical background behind the so-called hazard function of event occurrence in both continuous and discrete time units, we illustrate how to calculate and interpret the descriptive statistics provided by discrete-time EHA using two example data sets (masked priming, visual search). In case of discrimination data, the hazard analysis of response occurrence can be extended with a microlevel speed-accuracy trade-off analysis. We then discuss different approaches for obtaining inferential statistics. We consider the advantages and disadvantages of a principled use of discrete-time EHA for time-to-event data compared to (a) comparing means with analysis of variance, (b) other distributional methods available in the literature such as delta plots and continuous-time EHA methods, and (c) only fitting parametric distributions or computational models to empirical data. We conclude that statistically controlling for the passage of time during data analysis is equally important as experimental control during the design of an experiment, to understand human behavior in our experimental paradigms.

Download Full-text

Particle swarm optimisation for feature selection: A hybrid filter-wrapper approach

10.26686/wgtn.14273612 ◽

2021 ◽

Author(s):

T Butler-Yeoman ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Classification Accuracy ◽

Computational Cost ◽

Particle Swarm ◽

Classification Performance ◽

Particle Swarm Optimisation ◽

Computational Time ◽

Advantages And Disadvantages ◽

Benchmark Datasets ◽

Selection Algorithms

© 2015 IEEE. Feature selection is an important pre-processing step, which can reduce the dimensionality of a dataset and increase the accuracy and efficiency of a learning/classification algorithm. However, existing feature selection algorithms mainly wrappers and filters have their own advantages and disadvantages. This paper proposes two filter-wrapper hybrid feature selection algorithms based on particle swarm optimisation (PSO), where the first algorithm named FastPSO combined filter and wrapper into the search process of PSO for feature selection with most of the evaluations as filters and a small number of evaluations as wrappers. The second algorithm named RapidPSO further reduced the number of wrapper evaluations. Theoretical analysis on FastPSO and RapidPSO is conducted to investigate their complexity. FastPSO and RapidPSO are compared with a pure wrapper algorithm named WrapperPSO and a pure filter algorithm named FilterPSO on nine benchmark datasets of varying difficulty. The experimental results show that both FastPSO and RapidPSO can successfully reduce the number of features and simultaneously increase the classification performance over using all features. The two proposed algorithms maintain the high classification performance achieved by WrapperPSO and significantly reduce the computational time, although the number of features is larger. At the same time, they increase the classification accuracy of FilterPSO and reduce the number of features, but increased the computational cost. FastPSO outperformed RapidPSO in terms of the classification accuracy and the number of features, but increased the computational time, which shows the trade-off between the efficiency and effectiveness. © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Download Full-text

A tutorial for joint modeling of longitudinal and time-to-event data in R

Quantitative and Computational Methods in Behavioral Sciences ◽

10.5964/qcmb.2979 ◽

2021 ◽

Vol 1 ◽

Author(s):

Sezen Cekic ◽

Stephen Aichele ◽

Andreas M. Brandmaier ◽

Ylva Köhncke ◽

Paolo Ghisletta

Keyword(s):

Joint Modeling ◽

R Package ◽

Model Parameters ◽

Event Data ◽

Discrete Events ◽

Time To Event ◽

Time To Event Data ◽

Comparison Choice ◽

Previous Decade ◽

Behavioural Sciences

In biostatistics and medical research, longitudinal data are often composed of repeated assessments of a variable and dichotomous indicators to mark an event of interest. Consequently, joint modeling of longitudinal and time-to-event data has generated much interest in these disciplines over the previous decade. In behavioural sciences, too, often we are interested in relating individual trajectories and discrete events. Yet, joint modeling is rarely applied in behavioural sciences more generally. This tutorial presents an overview and general framework for joint modeling of longitudinal and time-to-event data, and fully illustrates its application in the context of a behavioral study with the JMbayes R package. In particular, the tutorial discusses practical topics, such as model selection and comparison, choice of joint modeling parameterization and interpretation of model parameters. In the end, this tutorial aims at introducing didactically the theory related to joint modeling and to introduce novice analysts to the use of the JMbayes package.

Download Full-text

Extending greedy feature selection algorithms to multiple solutions

Data Mining and Knowledge Discovery ◽

10.1007/s10618-020-00731-7 ◽

2021 ◽

Author(s):

Giorgos Borboudakis ◽

Ioannis Tsamardinos

Keyword(s):

Feature Selection ◽

Multiple Solutions ◽

Predictive Performance ◽

Computationally Efficient ◽

Statistical Equivalence ◽

Alternative Approach ◽

All Solutions ◽

Greedy Methods ◽

A generalized additive model approach to time-to-event analysis

Statistical Modelling ◽

10.1177/1471082x17748083 ◽

2018 ◽

Vol 18 (3-4) ◽

pp. 299-321 ◽

Cited By ~ 9

Author(s):

Andreas Bender ◽

Andreas Groll ◽

Fabian Scheipl

Keyword(s):

Additive Model ◽

R Package ◽

Event Analysis ◽

Time To Event ◽

Time To Event Data ◽

Generalized Additive Mixed Models ◽

Additive Mixed Models ◽

Inference Methods ◽

Processing Steps ◽

Model Approach

Abstract: This tutorial article demonstrates how time-to-event data can be modelled in a very flexible way by taking advantage of advanced inference methods that have recently been developed for generalized additive mixed models. In particular, we describe the necessary pre-processing steps for transforming such data into a suitable format and show how a variety of effects, including a smooth nonlinear baseline hazard, and potentially nonlinear and nonlinearly time-varying effects, can be estimated and interpreted. We also present useful graphical tools for model evaluation and interpretation of the estimated effects. Throughout, we demonstrate this approach using various application examples. The article is accompanied by a new R -package called pammtools implementing all of the tools described here.

Download Full-text