A Study of Feature Selection and Dimensionality Reduction Methods for Classification-Based Phishing Detection System

2021 ◽  
Vol 11 (1) ◽  
pp. 1-35
Author(s):  
Amit Singh ◽  
Abhishek Tiwari

Phishing was introduced in 1996, and now phishing is the biggest cybercrime challenge. Phishing is an abstract way to deceive users over the internet. Purpose of phishers is to extract the sensitive information of the user. Researchers have been working on solutions of phishing problem, but the parallel evolution of cybercrime techniques have made it a tough nut to crack. Recently, machine learning-based solutions are widely adopted to tackle the menace of phishing. This survey paper studies various feature selection method and dimensionality reduction methods and sees how they perform with machine learning-based classifier. The selection of features is vital for developing a good performance machine learning model. This work is comparing three broad categories of feature selection methods, namely filter, wrapper, and embedded feature selection methods, to reduce the dimensionality of data. The effectiveness of these methods has been assessed on several machine learning classifiers using k-fold cross-validation score, accuracy, precision, recall, and time.

Symmetry ◽  
2020 ◽  
Vol 12 (9) ◽  
pp. 1424 ◽  
Author(s):  
Saleh Alabdulwahab ◽  
BongKyo Moon

The detection accuracy and model building time of machine learning (ML) classifiers are vital aspects for an intrusion detection system (IDS) to predict attacks in real life. Recently, researchers have introduced feature selection methods to increase the detection accuracy and minimize the model building time of a limited number of ML classifiers. Therefore, identifying more ML classifiers with very high detection accuracy and the lowest possible model building time is necessary. In this study, the authors tested six supervised classifiers on a full NSL-KDD training dataset (a benchmark record for Internet traffic) using 10-fold cross-validation in the Weka tool with and without feature selection/reduction methods. The authors aimed to identify more options to outperform and secure classifiers with the highest detection accuracy and lowest model building time. The results show that the feature selection/reduction methods, including the wrapper method in combination with the discretize filter, the filter method in combination with the discretize filter, and the discretize filter, can significantly decrease model building time without compromising detection accuracy. The suggested ML algorithms and feature selection/reduction methods are automated pattern recognition approaches to detect network attacks, which are within the scope of the Symmetry journal.


Electronics ◽  
2020 ◽  
Vol 9 (12) ◽  
pp. 2114
Author(s):  
Murtaza Ahmed Siddiqi ◽  
Wooguil Pak

In recent times, with the advancement in technology and revolution in digital information, networks generate massive amounts of data. Due to the massive and rapid transmission of data, keeping up with security requirements is becoming more challenging. Machine learning (ML)-based intrusion detection systems (IDSs) are considered as one of the most suitable solutions for big data security. Despite the progress in ML, unrelated features can drastically influence the performance of an IDS. Feature selection plays a significant role in improving ML-based IDSs. However, the recent growth of dimensionality in data poses quite a challenge for current feature selection and extraction methods. Due to high data dimensionality, feature selection methods suffer in terms of efficiency and effectiveness. In this paper, we are introducing a new process flow for filter-based feature selection with the help of a transformation technique. Generally, normalization or transformation is implemented before classification. In our proposed model, we implemented and evaluated the effects of normalization before feature selection. To present a clear analysis on the effects of power transformation, five different transformations were implemented and evaluated. Furthermore, we implemented and compared different feature selection methods with the proposed process flow. Results show that compared with existing process flow and feature selection methods, our proposed process flow for feature selection can locate a more relevant set of features with high efficiency and accuracy.


2021 ◽  
Vol 11 ◽  
Author(s):  
Qi Wan ◽  
Jiaxuan Zhou ◽  
Xiaoying Xia ◽  
Jianfeng Hu ◽  
Peng Wang ◽  
...  

ObjectiveTo evaluate the performance of 2D and 3D radiomics features with different machine learning approaches to classify SPLs based on magnetic resonance(MR) T2 weighted imaging (T2WI).Material and MethodsA total of 132 patients with pathologically confirmed SPLs were examined and randomly divided into training (n = 92) and test datasets (n = 40). A total of 1692 3D and 1231 2D radiomics features per patient were extracted. Both radiomics features and clinical data were evaluated. A total of 1260 classification models, comprising 3 normalization methods, 2 dimension reduction algorithms, 3 feature selection methods, and 10 classifiers with 7 different feature numbers (confined to 3–9), were compared. The ten-fold cross-validation on the training dataset was applied to choose the candidate final model. The area under the receiver operating characteristic curve (AUC), precision-recall plot, and Matthews Correlation Coefficient were used to evaluate the performance of machine learning approaches.ResultsThe 3D features were significantly superior to 2D features, showing much more machine learning combinations with AUC greater than 0.7 in both validation and test groups (129 vs. 11). The feature selection method Analysis of Variance(ANOVA), Recursive Feature Elimination(RFE) and the classifier Logistic Regression(LR), Linear Discriminant Analysis(LDA), Support Vector Machine(SVM), Gaussian Process(GP) had relatively better performance. The best performance of 3D radiomics features in the test dataset (AUC = 0.824, AUC-PR = 0.927, MCC = 0.514) was higher than that of 2D features (AUC = 0.740, AUC-PR = 0.846, MCC = 0.404). The joint 3D and 2D features (AUC=0.813, AUC-PR = 0.926, MCC = 0.563) showed similar results as 3D features. Incorporating clinical features with 3D and 2D radiomics features slightly improved the AUC to 0.836 (AUC-PR = 0.918, MCC = 0.620) and 0.780 (AUC-PR = 0.900, MCC = 0.574), respectively.ConclusionsAfter algorithm optimization, 2D feature-based radiomics models yield favorable results in differentiating malignant and benign SPLs, but 3D features are still preferred because of the availability of more machine learning algorithmic combinations with better performance. Feature selection methods ANOVA and RFE, and classifier LR, LDA, SVM and GP are more likely to demonstrate better diagnostic performance for 3D features in the current study.


Malware is a serious threat to individuals and users. The security researchers present various solutions, striving to achieve efficient malware detection. Malware attackers devise detection avoidance techniques to escape from detection systems. The key challenge is that growth of malware increases every hour, leading to large damages to users’ privacy. The training process takes much longer time, mining the unnecessary features. Feature Selection is effective in achieving unique feature set in detecting malware. In this paper, we propose a malware detection system using hybrid feature selection approach to detect malware efficiently with a reduced feature set. Machine learning based classification is performed on eight classifiers with two malware datasets. The experiments were done without and with feature selection. The empirical results show that the classification using selected feature set and XGB classifier identifies malware efficiently with an accuracy of 98.9% and 99.26% for the two datasets.


Author(s):  
Jing Tang ◽  
Minjie Mou ◽  
Yunxia Wang ◽  
Yongchao Luo ◽  
Feng Zhu

Abstract Metaproteomics suffers from the issues of dimensionality and sparsity. Data reduction methods can maximally identify the relevant subset of significant differential features and reduce data redundancy. Feature selection (FS) methods were applied to obtain the significant differential subset. So far, a variety of feature selection methods have been developed for metaproteomic study. However, due to FS’s performance depended heavily on the data characteristics of a given research, the well-suitable feature selection method must be carefully selected to obtain the reproducible differential proteins. Moreover, it is critical to evaluate the performance of each FS method according to comprehensive criteria, because the single criterion is not sufficient to reflect the overall performance of the FS method. Therefore, we developed an online tool named MetaFS, which provided 13 types of FS methods and conducted the comprehensive evaluation on the complex FS methods using four widely accepted and independent criteria. Furthermore, the function and reliability of MetaFS were systematically tested and validated via two case studies. In sum, MetaFS could be a distinguished tool for discovering the overall well-performed FS method for selecting the potential biomarkers in microbiome studies. The online tool is freely available at https://idrblab.org/metafs/.


2021 ◽  
Vol 11 (7) ◽  
pp. 3273
Author(s):  
Joana Morgado ◽  
Tania Pereira ◽  
Francisco Silva ◽  
Cláudia Freitas ◽  
Eduardo Negrão ◽  
...  

The evolution of personalized medicine has changed the therapeutic strategy from classical chemotherapy and radiotherapy to a genetic modification targeted therapy, and although biopsy is the traditional method to genetically characterize lung cancer tumor, it is an invasive and painful procedure for the patient. Nodule image features extracted from computed tomography (CT) scans have been used to create machine learning models that predict gene mutation status in a noninvasive, fast, and easy-to-use manner. However, recent studies have shown that radiomic features extracted from an extended region of interest (ROI) beyond the tumor, might be more relevant to predict the mutation status in lung cancer, and consequently may be used to significantly decrease the mortality rate of patients battling this condition. In this work, we investigated the relation between image phenotypes and the mutation status of Epidermal Growth Factor Receptor (EGFR), the most frequently mutated gene in lung cancer with several approved targeted-therapies, using radiomic features extracted from the lung containing the nodule. A variety of linear, nonlinear, and ensemble predictive classification models, along with several feature selection methods, were used to classify the binary outcome of wild-type or mutant EGFR mutation status. The results show that a comprehensive approach using a ROI that included the lung with nodule can capture relevant information and successfully predict the EGFR mutation status with increased performance compared to local nodule analyses. Linear Support Vector Machine, Elastic Net, and Logistic Regression, combined with the Principal Component Analysis feature selection method implemented with 70% of variance in the feature set, were the best-performing classifiers, reaching Area Under the Curve (AUC) values ranging from 0.725 to 0.737. This approach that exploits a holistic analysis indicates that information from more extensive regions of the lung containing the nodule allows a more complete lung cancer characterization and should be considered in future radiogenomic studies.


Author(s):  
Samrat Kumar Dey ◽  
Md. Mahbubur Rahman

Recent advancements in Software Defined Networking (SDN) makes it possible to overcome the management challenges of traditional network by logically centralizing control plane and decoupling it from forwarding plane. Through centralized controllers, SDN can prevent security breach, but it also brings in new threats and vulnerabilities. Central controller can be a single point of failure. Hence, flow-based anomaly detection system in OpenFlow Controller can secure SDN to a great extent. In this paper, we investigated two different approaches of flow-based intrusion detection system in OpenFlow Controller. The first of which is based on machine-learning algorithm where NSL-KDD dataset with feature selection ensures the accuracy of 82% with Random Forest classifier using Gain Ratio feature selection evaluator. In the later phase, the second approach is combined with Gated Recurrent Unit Long Short-Term Memory based intrusion detection model based on Deep Neural Network (DNN) where we applied an appropriate ANOVA F-Test and Recursive Feature Elimination feature selection method to improve the classifier performance and achieved an accuracy of 88%. Substantial experiments with comparative analysis clearly show that, deep learning would be a better choice for intrusion detection in OpenFlow Controller.


2022 ◽  
Vol 10 (1) ◽  
pp. 0-0

In today's world, machine learning has become a vital part of our lives. When applied to real-world applications, machine learning encounters the difficulty of high dimensional data. Unnecessary and redundant features can be found in data. The performance of classification algorithms employed in prediction is harmed by these superfluous features. The primary step in developing any decision support system is to identify critical features. In this paper, authors have proposed a hybrid feature selection method CFGA by integrating CFS (Correlation feature selection) and GA (genetic algorithm). The efficiency of proposed method is analyzed using Logistic Regression classifier on the scale of accuracy, sensitivity, specificity, precision, F-measure and execution time parameters. Proposed CFGA method is also compared to six other feature selection methods. Results demonstrate that proposed method have increased the performance of the classification system by removing irrelevant and redundant features.


Electronics ◽  
2021 ◽  
Vol 11 (1) ◽  
pp. 114
Author(s):  
Fitriani Muttakin ◽  
Jui-Tang Wang ◽  
Mulyanto Mulyanto ◽  
Jenq-Shiou Leu

Artificial intelligence, particularly machine learning, is the fastest-growing research trend in educational fields. Machine learning shows an impressive performance in many prediction models, including psychosocial education. The capability of machine learning to discover hidden patterns in large datasets encourages researchers to invent data with high-dimensional features. In contrast, not all features are needed by machine learning, and in many cases, high-dimensional features decrease the performance of machine learning. The feature selection method is one of the appropriate approaches to reducing the features to ensure machine learning works efficiently. Various selection methods have been proposed, but research to determine the essential subset feature in psychosocial education has not been established thus far. This research investigated and proposed methods to determine the best feature selection method in the domain of psychosocial education. We used a multi-criteria decision system (MCDM) approach with Additive Ratio Assessment (ARAS) to rank seven feature selection methods. The proposed model evaluated the best feature selection method using nine criteria from the performance metrics provided by machine learning. The experimental results showed that the ARAS is promising for evaluating and recommending the best feature selection method for psychosocial education data using the teacher’s psychosocial risk levels dataset.


Sign in / Sign up

Export Citation Format

Share Document