scholarly journals Collective feature selection to identify crucial epistatic variants

2018 ◽  
Vol 11 (1) ◽  
Author(s):  
Shefali S. Verma ◽  
Anastasia Lucas ◽  
Xinyuan Zhang ◽  
Yogasudha Veturi ◽  
Scott Dudek ◽  
...  
2018 ◽  
Author(s):  
Shefali S. Verma ◽  
Anastasia Lucas ◽  
Xinyuan Zhang ◽  
Yogasudha Veturi ◽  
Scott Dudek ◽  
...  

AbstractBackgroundMachine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample size as input, thus leading to the so-called “short fat data” problem. The efficiency of machine learning methods can be increased by limiting the number of input features. Thus, it is very important to perform variable selection before searching for epistasis. Many methods have been evaluated and proposed to perform feature selection, but no single method works best in all scenarios. We demonstrate this by conducting two separate simulation analyses to evaluate the proposed collective feature selection approach.ResultsThrough our simulation study we propose a collective feature selection approach to select features that are in the “union” of the best performing methods. We explored various parametric, non-parametric, and data mining approaches to perform feature selection. We choose our top performing methods to select the union of the resulting variables based on a user-defined percentage of variants selected from each method to take to downstream analysis. Our simulation analysis shows that non-parametric data mining approaches, such as MDR, may work best under one simulation criteria for the high effect size (penetrance) datasets, while non-parametric methods designed for feature selection, such as Ranger and Gradient boosting, work best under other simulation criteria. Thus, using a collective approach proves to be more beneficial for selecting variables with epistatic effects also in low effect size datasets and different genetic architectures. Following this, we applied our proposed collective feature selection approach to select the top 1% of variables to identify potential interacting variables associated with Body Mass Index (BMI) in ~44,000 samples obtained from Geisinger’s MyCode Community Health Initiative (on behalf of DiscovEHR collaboration).ConclusionsIn this study, we were able to show that selecting variables using a collective feature selection approach could help in selecting true positive epistatic variables more frequently than applying any single method for feature selection via simulation studies. We were able to demonstrate the effectiveness of collective feature selection along with a comparison of many methods in our simulation analysis. We also applied our method to identify non-linear networks associated with obesity.


Author(s):  
Lindsey M. Kitchell ◽  
Francisco J. Parada ◽  
Brandi L. Emerick ◽  
Tom A. Busey

2012 ◽  
Vol 19 (2) ◽  
pp. 97-111 ◽  
Author(s):  
Muhammad Ahmad ◽  
Syungyoung Lee ◽  
Ihsan Ul Haq ◽  
Qaisar Mushtaq

Author(s):  
Manpreet Kaur ◽  
Chamkaur Singh

Educational Data Mining (EDM) is an emerging research area help the educational institutions to improve the performance of their students. Feature Selection (FS) algorithms remove irrelevant data from the educational dataset and hence increases the performance of classifiers used in EDM techniques. This paper present an analysis of the performance of feature selection algorithms on student data set. .In this papers the different problems that are defined in problem formulation. All these problems are resolved in future. Furthermore the paper is an attempt of playing a positive role in the improvement of education quality, as well as guides new researchers in making academic intervention.


2012 ◽  
Vol 57 (3) ◽  
pp. 829-835 ◽  
Author(s):  
Z. Głowacz ◽  
J. Kozik

The paper describes a procedure for automatic selection of symptoms accompanying the break in the synchronous motor armature winding coils. This procedure, called the feature selection, leads to choosing from a full set of features describing the problem, such a subset that would allow the best distinguishing between healthy and damaged states. As the features the spectra components amplitudes of the motor current signals were used. The full spectra of current signals are considered as the multidimensional feature spaces and their subspaces are tested. Particular subspaces are chosen with the aid of genetic algorithm and their goodness is tested using Mahalanobis distance measure. The algorithm searches for such a subspaces for which this distance is the greatest. The algorithm is very efficient and, as it was confirmed by research, leads to good results. The proposed technique is successfully applied in many other fields of science and technology, including medical diagnostics.


Sign in / Sign up

Export Citation Format

Share Document