scholarly journals Collective feature selection to identify crucial epistatic variants

2018 ◽  
Author(s):  
Shefali S. Verma ◽  
Anastasia Lucas ◽  
Xinyuan Zhang ◽  
Yogasudha Veturi ◽  
Scott Dudek ◽  
...  

AbstractBackgroundMachine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample size as input, thus leading to the so-called “short fat data” problem. The efficiency of machine learning methods can be increased by limiting the number of input features. Thus, it is very important to perform variable selection before searching for epistasis. Many methods have been evaluated and proposed to perform feature selection, but no single method works best in all scenarios. We demonstrate this by conducting two separate simulation analyses to evaluate the proposed collective feature selection approach.ResultsThrough our simulation study we propose a collective feature selection approach to select features that are in the “union” of the best performing methods. We explored various parametric, non-parametric, and data mining approaches to perform feature selection. We choose our top performing methods to select the union of the resulting variables based on a user-defined percentage of variants selected from each method to take to downstream analysis. Our simulation analysis shows that non-parametric data mining approaches, such as MDR, may work best under one simulation criteria for the high effect size (penetrance) datasets, while non-parametric methods designed for feature selection, such as Ranger and Gradient boosting, work best under other simulation criteria. Thus, using a collective approach proves to be more beneficial for selecting variables with epistatic effects also in low effect size datasets and different genetic architectures. Following this, we applied our proposed collective feature selection approach to select the top 1% of variables to identify potential interacting variables associated with Body Mass Index (BMI) in ~44,000 samples obtained from Geisinger’s MyCode Community Health Initiative (on behalf of DiscovEHR collaboration).ConclusionsIn this study, we were able to show that selecting variables using a collective feature selection approach could help in selecting true positive epistatic variables more frequently than applying any single method for feature selection via simulation studies. We were able to demonstrate the effectiveness of collective feature selection along with a comparison of many methods in our simulation analysis. We also applied our method to identify non-linear networks associated with obesity.

2020 ◽  
Vol 10 (22) ◽  
pp. 8093
Author(s):  
Jun Wang ◽  
Yuanyuan Xu ◽  
Hengpeng Xu ◽  
Zhe Sun ◽  
Zhenglu Yang ◽  
...  

Feature selection has devoted a consistently great amount of effort to dimension reduction for various machine learning tasks. Existing feature selection models focus on selecting the most discriminative features for learning targets. However, this strategy is weak in handling two kinds of features, that is, the irrelevant and redundant ones, which are collectively referred to as noisy features. These features may hamper the construction of optimal low-dimensional subspaces and compromise the learning performance of downstream tasks. In this study, we propose a novel multi-label feature selection approach by embedding label correlations (dubbed ELC) to address these issues. Particularly, we extract label correlations for reliable label space structures and employ them to steer feature selection. In this way, label and feature spaces can be expected to be consistent and noisy features can be effectively eliminated. An extensive experimental evaluation on public benchmarks validated the superiority of ELC.


Malware is a serious threat to individuals and users. The security researchers present various solutions, striving to achieve efficient malware detection. Malware attackers devise detection avoidance techniques to escape from detection systems. The key challenge is that growth of malware increases every hour, leading to large damages to users’ privacy. The training process takes much longer time, mining the unnecessary features. Feature Selection is effective in achieving unique feature set in detecting malware. In this paper, we propose a malware detection system using hybrid feature selection approach to detect malware efficiently with a reduced feature set. Machine learning based classification is performed on eight classifiers with two malware datasets. The experiments were done without and with feature selection. The empirical results show that the classification using selected feature set and XGB classifier identifies malware efficiently with an accuracy of 98.9% and 99.26% for the two datasets.


2021 ◽  
Vol 5 (4) ◽  
pp. 395
Author(s):  
Muhammad Aqil Haqeemi Azmi ◽  
Cik Feresa Mohd Foozy ◽  
Khairul Amin Mohamad Sukri ◽  
Nurul Azma Abdullah ◽  
Isredza Rahmi A. Hamid ◽  
...  

Distributed Denial of Service (DDoS) attacks are dangerous attacks that can cause disruption to server, system or application layer. It will flood the target server with the amount of Internet traffic that the server could not afford at one time. Therefore, it is possible that the server will not work if it is affected by this DDoS attack. Due to this attack, the network security environment becomes insecure with the possibility of this attack. In recent years, the cases related to DDoS attacks have increased. Although previously there has been a lot of research on DDoS attacks, cases of DDoS attacks still exist. Therefore, the research on feature selection approach has been done in effort to detect the DDoS attacks by using machine learning techniques. In this paper, to detect DDoS attacks, features have been selected from the UNSW-NB 15 dataset by using Information Gain and Data Reduction method. To classify the selected features, ANN, Naïve Bayes, and Decision Table algorithms were used to test the dataset. To evaluate the result of the experiment, the parameters of Accuracy, Precision, True Positive and False Positive evaluated the results and classed the data into attacks and normal class. Hence, the good features have been obtained based on the experiments. To ensure the selected features are good or not, the results of classification have been compared with the past research that used the same UNSW-NB 15 dataset. To conclude, the accuracy of ANN, Naïve Bayes and Decision Table classifiers has been increased by using this feature selection approach compared to the past research.


Sign in / Sign up

Export Citation Format

Share Document