scholarly journals Feature selection with the R package MXM

F1000Research ◽  
2019 ◽  
Vol 7 ◽  
pp. 1505 ◽  
Author(s):  
Michail Tsagris ◽  
Ioannis Tsamardinos

Feature (or variable) selection is the process of identifying the minimal set of features with the highest predictive performance on the target variable of interest. Numerous feature selection algorithms have been developed over the years, but only few have been implemented in R and made publicly available R as packages while offering few options. The R package MXM offers a variety of feature selection algorithms, and has unique features that make it advantageous over its competitors: a) it contains feature selection algorithms that can treat numerous types of target variables, including continuous, percentages, time to event (survival), binary, nominal, ordinal, clustered, counts, left censored, etc; b) it contains a variety of regression models that can be plugged into the feature selection algorithms (for example with time to event data the user can choose among Cox, Weibull, log logistic or exponential regression); c) it includes an algorithm for detecting multiple solutions (many sets of statistically equivalent features, plain speaking, two features can carry statistically equivalent information when substituting one with the other does not effect the inference or the conclusions); and d) it includes memory efficient algorithms for high volume data, data that cannot be loaded into R (In a 16GB RAM terminal for example, R cannot directly load data of 16GB size. By utilizing the proper package, we load the data and then perform feature selection.). In this paper, we qualitatively compare MXM with other relevant feature selection packages and discuss its advantages and disadvantages. Further, we provide a demonstration of MXM’s algorithms using real high-dimensional data from various applications.

F1000Research ◽  
2018 ◽  
Vol 7 ◽  
pp. 1505 ◽  
Author(s):  
Michail Tsagris ◽  
Ioannis Tsamardinos

Feature (or variable) selection is the process of identifying the minimal set of features with the highest predictive performance on the target variable of interest. Numerous feature selection algorithms have been developed over the years, but only few have been implemented in R as a package. The R package MXM is such an example, which not only offers a variety of feature selection algorithms, but has unique features that make it advantageous over its competitors: a) it contains feature selection algorithms that can treat numerous types of target variables, including continuous, percentages, time to event (survival), binary, nominal, ordinal, clustered, counts, left censored, etc; b) it contains a variety of regression models to plug into the feature selection algorithms; c) it includes an algorithm for detecting multiple solutions (many sets of equivalent features); and d) it includes memory efficient algorithms for high volume data, data that cannot be loaded into R. In this paper we qualitatively compare MXM with other relevant packages and discuss its advantages and disadvantages. We also provide a demonstration of its algorithms using real high-dimensional data from various applications.


2018 ◽  
Vol 28 (9) ◽  
pp. 2768-2786 ◽  
Author(s):  
Thomas PA Debray ◽  
Johanna AAG Damen ◽  
Richard D Riley ◽  
Kym Snell ◽  
Johannes B Reitsma ◽  
...  

It is widely recommended that any developed—diagnostic or prognostic—prediction model is externally validated in terms of its predictive performance measured by calibration and discrimination. When multiple validations have been performed, a systematic review followed by a formal meta-analysis helps to summarize overall performance across multiple settings, and reveals under which circumstances the model performs suboptimal (alternative poorer) and may need adjustment. We discuss how to undertake meta-analysis of the performance of prediction models with either a binary or a time-to-event outcome. We address how to deal with incomplete availability of study-specific results (performance estimates and their precision), and how to produce summary estimates of the c-statistic, the observed:expected ratio and the calibration slope. Furthermore, we discuss the implementation of frequentist and Bayesian meta-analysis methods, and propose novel empirically-based prior distributions to improve estimation of between-study heterogeneity in small samples. Finally, we illustrate all methods using two examples: meta-analysis of the predictive performance of EuroSCORE II and of the Framingham Risk Score. All examples and meta-analysis models have been implemented in our newly developed R package “metamisc”.


2021 ◽  
Author(s):  
T Butler-Yeoman ◽  
Bing Xue ◽  
Mengjie Zhang

© 2015 IEEE. Feature selection is an important pre-processing step, which can reduce the dimensionality of a dataset and increase the accuracy and efficiency of a learning/classification algorithm. However, existing feature selection algorithms mainly wrappers and filters have their own advantages and disadvantages. This paper proposes two filter-wrapper hybrid feature selection algorithms based on particle swarm optimisation (PSO), where the first algorithm named FastPSO combined filter and wrapper into the search process of PSO for feature selection with most of the evaluations as filters and a small number of evaluations as wrappers. The second algorithm named RapidPSO further reduced the number of wrapper evaluations. Theoretical analysis on FastPSO and RapidPSO is conducted to investigate their complexity. FastPSO and RapidPSO are compared with a pure wrapper algorithm named WrapperPSO and a pure filter algorithm named FilterPSO on nine benchmark datasets of varying difficulty. The experimental results show that both FastPSO and RapidPSO can successfully reduce the number of features and simultaneously increase the classification performance over using all features. The two proposed algorithms maintain the high classification performance achieved by WrapperPSO and significantly reduce the computational time, although the number of features is larger. At the same time, they increase the classification accuracy of FilterPSO and reduce the number of features, but increased the computational cost. FastPSO outperformed RapidPSO in terms of the classification accuracy and the number of features, but increased the computational time, which shows the trade-off between the efficiency and effectiveness. © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.


2021 ◽  
Vol 97 (7) ◽  
Author(s):  
Yun-Hee Choi ◽  
Laurent Briollais ◽  
Wenqing He ◽  
Karen Kopciuk

2021 ◽  
Vol 12 ◽  
Author(s):  
Nasim Vahabi ◽  
Caitrin W. McDonough ◽  
Ankit A. Desai ◽  
Larisa H. Cavallari ◽  
Julio D. Duarte ◽  
...  

BackgroundThe development of high-throughput techniques has enabled profiling a large number of biomolecules across a number of molecular compartments. The challenge then becomes to integrate such multimodal Omics data to gain insights into biological processes and disease onset and progression mechanisms. Further, given the high dimensionality of such data, incorporating prior biological information on interactions between molecular compartments when developing statistical models for data integration is beneficial, especially in settings involving a small number of samples.ResultsWe develop a supervised model for time to event data (e.g., death, biochemical recurrence) that simultaneously accounts for redundant information within Omics profiles and leverages prior biological associations between them through a multi-block PLS framework. The interactions between data from different molecular compartments (e.g., epigenome, transcriptome, methylome, etc.) were captured by using cis-regulatory quantitative effects in the proposed model. The model, coined Cox-sMBPLS, exhibits superior prediction performance and improved feature selection based on both simulation studies and analysis of data from heart failure patients.ConclusionThe proposed supervised Cox-sMBPLS model can effectively incorporate prior biological information in the survival prediction system, leading to improved prediction performance and feature selection. It also enables the identification of multi-Omics modules of biomolecules that impact the patients’ survival probability and also provides insights into potential relevant risk factors that merit further investigation.


i-Perception ◽  
2020 ◽  
Vol 11 (6) ◽  
pp. 204166952097867
Author(s):  
Sven Panis ◽  
Filipp Schmidt ◽  
Maximilian P. Wolkersdorfer ◽  
Thomas Schmidt

In this Methods article, we discuss and illustrate a unifying, principled way to analyze response time data from psychological experiments—and all other types of time-to-event data. We advocate the general application of discrete-time event history analysis (EHA) which is a well-established, intuitive longitudinal approach to statistically describe and model the shape of time-to-event distributions. After discussing the theoretical background behind the so-called hazard function of event occurrence in both continuous and discrete time units, we illustrate how to calculate and interpret the descriptive statistics provided by discrete-time EHA using two example data sets (masked priming, visual search). In case of discrimination data, the hazard analysis of response occurrence can be extended with a microlevel speed-accuracy trade-off analysis. We then discuss different approaches for obtaining inferential statistics. We consider the advantages and disadvantages of a principled use of discrete-time EHA for time-to-event data compared to (a) comparing means with analysis of variance, (b) other distributional methods available in the literature such as delta plots and continuous-time EHA methods, and (c) only fitting parametric distributions or computational models to empirical data. We conclude that statistically controlling for the passage of time during data analysis is equally important as experimental control during the design of an experiment, to understand human behavior in our experimental paradigms.


2021 ◽  
Author(s):  
T Butler-Yeoman ◽  
Bing Xue ◽  
Mengjie Zhang

© 2015 IEEE. Feature selection is an important pre-processing step, which can reduce the dimensionality of a dataset and increase the accuracy and efficiency of a learning/classification algorithm. However, existing feature selection algorithms mainly wrappers and filters have their own advantages and disadvantages. This paper proposes two filter-wrapper hybrid feature selection algorithms based on particle swarm optimisation (PSO), where the first algorithm named FastPSO combined filter and wrapper into the search process of PSO for feature selection with most of the evaluations as filters and a small number of evaluations as wrappers. The second algorithm named RapidPSO further reduced the number of wrapper evaluations. Theoretical analysis on FastPSO and RapidPSO is conducted to investigate their complexity. FastPSO and RapidPSO are compared with a pure wrapper algorithm named WrapperPSO and a pure filter algorithm named FilterPSO on nine benchmark datasets of varying difficulty. The experimental results show that both FastPSO and RapidPSO can successfully reduce the number of features and simultaneously increase the classification performance over using all features. The two proposed algorithms maintain the high classification performance achieved by WrapperPSO and significantly reduce the computational time, although the number of features is larger. At the same time, they increase the classification accuracy of FilterPSO and reduce the number of features, but increased the computational cost. FastPSO outperformed RapidPSO in terms of the classification accuracy and the number of features, but increased the computational time, which shows the trade-off between the efficiency and effectiveness. © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.


Author(s):  
Sezen Cekic ◽  
Stephen Aichele ◽  
Andreas M. Brandmaier ◽  
Ylva Köhncke ◽  
Paolo Ghisletta

In biostatistics and medical research, longitudinal data are often composed of repeated assessments of a variable and dichotomous indicators to mark an event of interest. Consequently, joint modeling of longitudinal and time-to-event data has generated much interest in these disciplines over the previous decade. In behavioural sciences, too, often we are interested in relating individual trajectories and discrete events. Yet, joint modeling is rarely applied in behavioural sciences more generally. This tutorial presents an overview and general framework for joint modeling of longitudinal and time-to-event data, and fully illustrates its application in the context of a behavioral study with the JMbayes R package. In particular, the tutorial discusses practical topics, such as model selection and comparison, choice of joint modeling parameterization and interpretation of model parameters. In the end, this tutorial aims at introducing didactically the theory related to joint modeling and to introduce novice analysts to the use of the JMbayes package.


Author(s):  
Giorgos Borboudakis ◽  
Ioannis Tsamardinos

AbstractMost feature selection methods identify only a single solution. This is acceptable for predictive purposes, but is not sufficient for knowledge discovery if multiple solutions exist. We propose a strategy to extend a class of greedy methods to efficiently identify multiple solutions, and show under which conditions it identifies all solutions. We also introduce a taxonomy of features that takes the existence of multiple solutions into account. Furthermore, we explore different definitions of statistical equivalence of solutions, as well as methods for testing equivalence. A novel algorithm for compactly representing and visualizing multiple solutions is also introduced. In experiments we show that (a) the proposed algorithm is significantly more computationally efficient than the TIE* algorithm, the only alternative approach with similar theoretical guarantees, while identifying similar solutions to it, and (b) that the identified solutions have similar predictive performance.


2018 ◽  
Vol 18 (3-4) ◽  
pp. 299-321 ◽  
Author(s):  
Andreas Bender ◽  
Andreas Groll ◽  
Fabian Scheipl

Abstract: This tutorial article demonstrates how time-to-event data can be modelled in a very flexible way by taking advantage of advanced inference methods that have recently been developed for generalized additive mixed models. In particular, we describe the necessary pre-processing steps for transforming such data into a suitable format and show how a variety of effects, including a smooth nonlinear baseline hazard, and potentially nonlinear and nonlinearly time-varying effects, can be estimated and interpreted. We also present useful graphical tools for model evaluation and interpretation of the estimated effects. Throughout, we demonstrate this approach using various application examples. The article is accompanied by a new R -package called pammtools implementing all of the tools described here.


Sign in / Sign up

Export Citation Format

Share Document