A Survey on Phishing Detection and The Importance of Feature Selection In Data Mining Classification Algorithms

Issue 4 - Journal of Science and Technology ◽

10.46243/jst.2020.v5.i6.pp11-18 ◽

2020 ◽

pp. 11-18

Keyword(s):

Data Mining ◽

Feature Selection ◽

Support Vector ◽

Classification Algorithms ◽

End User ◽

Preparation Methods ◽

Survey Paper ◽

Vector Machines ◽

Feature Selection Techniques ◽

Phishing Detection

: In this era of Internet, the issue of security of information is at its peak. One of the main threats in this cyber world is phishing attacks which is an email or website fraud method that targets the genuine webpage or an email and hacks it without the consent of the end user. There are various techniques which help to classify whether the website or an email is legitimate or fake. The major contributors in the process of detection of these phishing frauds include the classification algorithms, feature selection techniques or dataset preparation methods and the feature extraction that plays an important role in detection as well as in prevention of these attacks. This Survey Paper studies the effect of all these contributors and the approaches that are applied in the study conducted on the recent papers. Some of the classification algorithms that are implemented includes Decision tree, Random Forest , Support Vector Machines, Logistic Regression , Lazy K Star, Naive Bayes and J48 etc.

Download Full-text

CLASSIFICATION OF HIGH-DIMENSIONAL MICROARRAY DATA WITH A TWO-STEP PROCEDURE VIA A WILCOXON CRITERION AND MULTILAYER PERCEPTRON

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026811002969 ◽

2011 ◽

Vol 10 (01) ◽

pp. 1-14

Author(s):

VLADIMIR NIKULIN ◽

TIAN-HSIANG HUANG ◽

GEOFFREY J. MCLACHLAN

Keyword(s):

Data Mining ◽

Feature Selection ◽

High Dimensional ◽

Second Step ◽

Support Vector ◽

Step Procedure ◽

Leave One Out ◽

Natural Combination ◽

Feature Selection Techniques

The method presented in this paper is novel as a natural combination of two mutually dependent steps. Feature selection is a key element (first step) in our classification system, which was employed during the 2010 International RSCTC data mining (bioinformatics) Challenge. The second step may be implemented using any suitable classifier such as linear regression, support vector machine or neural networks. We conducted leave-one-out (LOO) experiments with several feature selection techniques and classifiers. Based on the LOO evaluations, we decided to use feature selection with the separation type Wilcoxon-based criterion for all final submissions. The method presented in this paper was tested successfully during the RSCTC data mining Challenge, where we achieved the top score in the Basic track.

Download Full-text

Identification of Urdu Ghazal Poets using SVM

Mehran University Research Journal of Engineering and Technology ◽

10.22581/muet1982.1904.07 ◽

2020 ◽

Vol 38 (4) ◽

pp. 935-944

Author(s):

Nida Tariq ◽

Iqra Ijaz ◽

Muhammad Kamran Malik ◽

Zubair Malik ◽

Faisal Bukhari

Keyword(s):

Feature Selection ◽

Support Vector ◽

Sentence Structure ◽

Writing Style ◽

K Nearest Neighbors ◽

Chi Square ◽

Urdu Literature ◽

Vector Machines ◽

Two Factors ◽

Feature Selection Techniques

Urdu literature has a rich tradition of poetry, with many forms, one of which is Ghazal. Urdu poetry structures are mainly of Arabic origin. It has complex and different sentence structure compared to our daily language which makes it hard to classify. Our research is focused on the identification of poets if given with ghazals as input. Previously, no one has done this type of work. Two main factors which help categorize and classify a given text are the contents and writing style. Urdu poets like Mirza Ghalib, Mir Taqi Mir, Iqbal and many others have a different writing style and the topic of interest. Our model caters these two factors, classify ghazals using different classification models such as SVM (Support Vector Machines), Decision Tree, Random forest, Naïve Bayes and KNN (K-Nearest Neighbors). Furthermore, we have also applied feature selection techniques like chi square model and L1 based feature selection. For experimentation, we have prepared a dataset of about 4000 Ghazals. We have also compared the accuracy of different classifiers and concluded the best results for the collected dataset of Ghazals.

Download Full-text

Analysis of the Inducing Factors Involved in Stem Cell Differentiation Using Feature Selection Techniques, Support Vector Machines and Decision Trees

Trends in Applied Intelligent Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-642-13022-9_30 ◽

2010 ◽

pp. 294-305

Author(s):

A. M. Trujillo ◽

Ignacio Rojas ◽

Héctor Pomares ◽

A. Prieto ◽

B. Prieto ◽

...

Keyword(s):

Feature Selection ◽

Stem Cell ◽

Cell Differentiation ◽

Support Vector Machines ◽

Decision Trees ◽

Stem Cell Differentiation ◽

Support Vector ◽

Vector Machines ◽

Inducing Factors ◽

Feature Selection Techniques

Download Full-text

Comparative Study on Heart Disease Prediction Using Feature Selection Techniques on Classification Algorithms

Applied Computational Intelligence and Soft Computing ◽

10.1155/2021/5581806 ◽

2021 ◽

Vol 2021 ◽

pp. 1-17

Author(s):

Kaushalya Dissanayake ◽

Md Gapar Md Johar

Keyword(s):

Feature Selection ◽

Heart Disease ◽

Decision Tree ◽

Recursive Feature Elimination ◽

Support Vector ◽

Classification Algorithms ◽

Feature Subset ◽

Decision Tree Classifier ◽

Tree Classifier ◽

Feature Selection Techniques

Heart disease is recognized as one of the leading factors of death rate worldwide. Biomedical instruments and various systems in hospitals have massive quantities of clinical data. Therefore, understanding the data related to heart disease is very important to improve prediction accuracy. This article has conducted an experimental evaluation of the performance of models created using classification algorithms and relevant features selected using various feature selection approaches. For results of the exploratory analysis, ten feature selection techniques, i.e., ANOVA, Chi-square, mutual information, ReliefF, forward feature selection, backward feature selection, exhaustive feature selection, recursive feature elimination, Lasso regression, and Ridge regression, and six classification approaches, i.e., decision tree, random forest, support vector machine, K-nearest neighbor, logistic regression, and Gaussian naive Bayes, have been applied to Cleveland heart disease dataset. The feature subset selected by the backward feature selection technique has achieved the highest classification accuracy of 88.52%, precision of 91.30%, sensitivity of 80.76%, and f-measure of 85.71% with the decision tree classifier.

Download Full-text

Early Detection of Red Palm Weevil, Rhynchophorus ferrugineus (Olivier), Infestation Using Data Mining

Plants ◽

10.3390/plants10010095 ◽

2021 ◽

Vol 10 (1) ◽

pp. 95

Author(s):

Heba Kurdi ◽

Amal Al-Aldawsari ◽

Isra Al-Turaiki ◽

Abdulrahman S. Aldawood

Keyword(s):

Data Mining ◽

Plant Size ◽

Support Vector ◽

Classification Algorithms ◽

Palm Tree ◽

Rhynchophorus Ferrugineus ◽

Red Palm Weevil ◽

Palm Weevil ◽

Using Data ◽

F Measure

In the past 30 years, the red palm weevil (RPW), Rhynchophorus ferrugineus (Olivier), a pest that is highly destructive to all types of palms, has rapidly spread worldwide. However, detecting infestation with the RPW is highly challenging because symptoms are not visible until the death of the palm tree is inevitable. In addition, the use of automated RPW weevil identification tools to predict infestation is complicated by a lack of RPW datasets. In this study, we assessed the capability of 10 state-of-the-art data mining classification algorithms, Naive Bayes (NB), KSTAR, AdaBoost, bagging, PART, J48 Decision tree, multilayer perceptron (MLP), support vector machine (SVM), random forest, and logistic regression, to use plant-size and temperature measurements collected from individual trees to predict RPW infestation in its early stages before significant damage is caused to the tree. The performance of the classification algorithms was evaluated in terms of accuracy, precision, recall, and F-measure using a real RPW dataset. The experimental results showed that infestations with RPW can be predicted with an accuracy up to 93%, precision above 87%, recall equals 100%, and F-measure greater than 93% using data mining. Additionally, we found that temperature and circumference are the most important features for predicting RPW infestation. However, we strongly call for collecting and aggregating more RPW datasets to run more experiments to validate these results and provide more conclusive findings.

Download Full-text

Minimax feature selection problem for constructing a classifier using support vector machines

Computational Mathematics and Mathematical Physics ◽

10.1134/s0965542510050143 ◽

2010 ◽

Vol 50 (5) ◽

pp. 917-925

Author(s):

Yu. V. Goncharov

Keyword(s):

Feature Selection ◽

Support Vector Machines ◽

Selection Problem ◽

Support Vector ◽

Feature Selection Problem ◽

Vector Machines

Download Full-text

Evaluation of Feature Selection Techniques in a Multifrequency Large Amplitude Pulse Voltammetric Electronic Tongue

Engineering Proceedings ◽

10.3390/ecsa-7-08242 ◽

2020 ◽

Vol 2 (1) ◽

pp. 62

Author(s):

Luis F. Villamil-Cubillos ◽

Jersson X. Leon-Medina ◽

Maribel Anaya ◽

Diego A. Tibaduiza

Keyword(s):

Feature Selection ◽

Large Amplitude ◽

Electronic Tongue ◽

Supervised Machine Learning ◽

Recursive Feature Elimination ◽

Support Vector ◽

Variance Filter ◽

Supervised Machine Learning Classifiers ◽

Voltammetric Electronic Tongue ◽

Feature Selection Techniques

An electronic tongue is a device composed of a sensor array that takes advantage of the cross sensitivity property of several sensors to perform classification and quantification in liquid substances. In practice, electronic tongues generate a large amount of information that needs to be correctly analyzed, to define which interactions and features are more relevant to distinguish one substance from another. This work focuses on implementing and validating feature selection methodologies in the liquid classification process of a multifrequency large amplitude pulse voltammetric (MLAPV) electronic tongue. Multi-layer perceptron neural network (MLP NN) and support vector machine (SVM) were used as supervised machine learning classifiers. Different feature selection techniques were used, such as Variance filter, ANOVA F-value, Recursive Feature Elimination and model-based selection. Both 5-fold Cross validation and GridSearchCV were used in order to evaluate the performance of the feature selection methodology by testing various configurations and determining the best one. The methodology was validated in an imbalanced MLAPV electronic tongue dataset of 13 different liquid substances, reaching a 93.85% of classification accuracy.

Download Full-text

Proposta de metodologia para a criação de etiqueta de classificação – estudo de caso: desempenho escolar

Gestão & Produção ◽

10.1590/0104-530x810-13 ◽

2016 ◽

Vol 23 (1) ◽

pp. 177-191

Author(s):

Anderson Roges Teixeira Góes ◽

Maria Teresinha Arns Steiner

Keyword(s):

Data Mining ◽

Support Vector Machines ◽

Knowledge Discovery ◽

Knowledge Discovery In Databases ◽

Support Vector ◽

Vector Machines

Resumo A qualidade na educação tem sido objeto de muita discussão, seja nas escolas e entre seus gestores, seja na mídia ou na literatura. No entanto, uma análise mais profunda na literatura parece não indicar técnicas que explorem bancos de dados com a finalidade de obter classificações para o desempenho escolar, nem tampouco há um consenso sobre o que seja “qualidade educacional”. Diante deste contexto, neste artigo, é proposta uma metodologia que se enquadra no processo KDD (Knowledge Discovery in Databases, ou seja, Descoberta de Conhecimento em Bases de Dados) para a classificação do desempenho de instituições de ensino, de forma comparativa, com base nas notas obtidas na Prova Brasil, um dos itens integrantes do Índice de Desenvolvimento da Educação Básica (IDEB) no Brasil. Para ilustrar a metodologia, esta foi aplicada às escolas públicas municipais de Araucária, PR, região metropolitana de Curitiba, PR, num total de 17, que, por ocasião da pesquisa, ofertavam Ensino Fundamental, considerando as notas obtidas pela totalidade dos alunos dos anos iniciais (1º. ao 5º. ano do ensino fundamental) e dos anos finais (6º. ao 9º. ano do ensino fundamental). Na etapa de Data Mining, principal etapa do processo KDD, foram utilizadas três técnicas de forma comparativa para o Reconhecimento de Padrões: Redes Neurais Artificiais; Support Vector Machines; e Algoritmos Genéticos. Essas técnicas apresentaram resultados satisfatórios na classificação das escolas, representados por meio de uma “Etiqueta de Classificação do Desempenho”. Por meio desta etiqueta, os gestores educacionais poderão ter melhor base para definir as medidas a serem adotadas junto a cada escola, podendo definir mais claramente as metas a serem cumpridas.

Download Full-text

High dimensional data classification and feature selection using support vector machines

European Journal of Operational Research ◽

10.1016/j.ejor.2017.08.040 ◽

2018 ◽

Vol 265 (3) ◽

pp. 993-1004 ◽

Cited By ~ 63

Author(s):

Bissan Ghaddar ◽

Joe Naoum-Sawaya

Keyword(s):

Feature Selection ◽

Support Vector Machines ◽

High Dimensional Data ◽

Data Classification ◽

High Dimensional ◽

Support Vector ◽

Vector Machines

Download Full-text

Radar Emitter Signal Recognition Based on Feature Selection and Support Vector Machines

Lecture Notes in Computer Science - Advances in Intelligent Computing ◽

10.1007/11538059_74 ◽

2005 ◽

pp. 707-716 ◽

Cited By ~ 2

Author(s):

Gexiang Zhang ◽

Zhexin Cao ◽

Yajun Gu ◽

Weidong Jin ◽

Laizhao Hu

Keyword(s):

Feature Selection ◽

Support Vector Machines ◽

Support Vector ◽

Signal Recognition ◽

Vector Machines

Download Full-text