scholarly journals A Survey on Phishing Detection and The Importance of Feature Selection In Data Mining Classification Algorithms

: In this era of Internet, the issue of security of information is at its peak. One of the main threats in this cyber world is phishing attacks which is an email or website fraud method that targets the genuine webpage or an email and hacks it without the consent of the end user. There are various techniques which help to classify whether the website or an email is legitimate or fake. The major contributors in the process of detection of these phishing frauds include the classification algorithms, feature selection techniques or dataset preparation methods and the feature extraction that plays an important role in detection as well as in prevention of these attacks. This Survey Paper studies the effect of all these contributors and the approaches that are applied in the study conducted on the recent papers. Some of the classification algorithms that are implemented includes Decision tree, Random Forest , Support Vector Machines, Logistic Regression , Lazy K Star, Naive Bayes and J48 etc.

Author(s):  
VLADIMIR NIKULIN ◽  
TIAN-HSIANG HUANG ◽  
GEOFFREY J. MCLACHLAN

The method presented in this paper is novel as a natural combination of two mutually dependent steps. Feature selection is a key element (first step) in our classification system, which was employed during the 2010 International RSCTC data mining (bioinformatics) Challenge. The second step may be implemented using any suitable classifier such as linear regression, support vector machine or neural networks. We conducted leave-one-out (LOO) experiments with several feature selection techniques and classifiers. Based on the LOO evaluations, we decided to use feature selection with the separation type Wilcoxon-based criterion for all final submissions. The method presented in this paper was tested successfully during the RSCTC data mining Challenge, where we achieved the top score in the Basic track.


Author(s):  
Nida Tariq ◽  
Iqra Ijaz ◽  
Muhammad Kamran Malik ◽  
Zubair Malik ◽  
Faisal Bukhari

Urdu literature has a rich tradition of poetry, with many forms, one of which is Ghazal. Urdu poetry structures are mainly of Arabic origin. It has complex and different sentence structure compared to our daily language which makes it hard to classify. Our research is focused on the identification of poets if given with ghazals as input. Previously, no one has done this type of work. Two main factors which help categorize and classify a given text are the contents and writing style. Urdu poets like Mirza Ghalib, Mir Taqi Mir, Iqbal and many others have a different writing style and the topic of interest. Our model caters these two factors, classify ghazals using different classification models such as SVM (Support Vector Machines), Decision Tree, Random forest, Naïve Bayes and KNN (K-Nearest Neighbors). Furthermore, we have also applied feature selection techniques like chi square model and L1 based feature selection. For experimentation, we have prepared a dataset of about 4000 Ghazals. We have also compared the accuracy of different classifiers and concluded the best results for the collected dataset of Ghazals.


2021 ◽  
Vol 2021 ◽  
pp. 1-17
Author(s):  
Kaushalya Dissanayake ◽  
Md Gapar Md Johar

Heart disease is recognized as one of the leading factors of death rate worldwide. Biomedical instruments and various systems in hospitals have massive quantities of clinical data. Therefore, understanding the data related to heart disease is very important to improve prediction accuracy. This article has conducted an experimental evaluation of the performance of models created using classification algorithms and relevant features selected using various feature selection approaches. For results of the exploratory analysis, ten feature selection techniques, i.e., ANOVA, Chi-square, mutual information, ReliefF, forward feature selection, backward feature selection, exhaustive feature selection, recursive feature elimination, Lasso regression, and Ridge regression, and six classification approaches, i.e., decision tree, random forest, support vector machine, K-nearest neighbor, logistic regression, and Gaussian naive Bayes, have been applied to Cleveland heart disease dataset. The feature subset selected by the backward feature selection technique has achieved the highest classification accuracy of 88.52%, precision of 91.30%, sensitivity of 80.76%, and f-measure of 85.71% with the decision tree classifier.


Plants ◽  
2021 ◽  
Vol 10 (1) ◽  
pp. 95
Author(s):  
Heba Kurdi ◽  
Amal Al-Aldawsari ◽  
Isra Al-Turaiki ◽  
Abdulrahman S. Aldawood

In the past 30 years, the red palm weevil (RPW), Rhynchophorus ferrugineus (Olivier), a pest that is highly destructive to all types of palms, has rapidly spread worldwide. However, detecting infestation with the RPW is highly challenging because symptoms are not visible until the death of the palm tree is inevitable. In addition, the use of automated RPW weevil identification tools to predict infestation is complicated by a lack of RPW datasets. In this study, we assessed the capability of 10 state-of-the-art data mining classification algorithms, Naive Bayes (NB), KSTAR, AdaBoost, bagging, PART, J48 Decision tree, multilayer perceptron (MLP), support vector machine (SVM), random forest, and logistic regression, to use plant-size and temperature measurements collected from individual trees to predict RPW infestation in its early stages before significant damage is caused to the tree. The performance of the classification algorithms was evaluated in terms of accuracy, precision, recall, and F-measure using a real RPW dataset. The experimental results showed that infestations with RPW can be predicted with an accuracy up to 93%, precision above 87%, recall equals 100%, and F-measure greater than 93% using data mining. Additionally, we found that temperature and circumference are the most important features for predicting RPW infestation. However, we strongly call for collecting and aggregating more RPW datasets to run more experiments to validate these results and provide more conclusive findings.


2020 ◽  
Vol 2 (1) ◽  
pp. 62
Author(s):  
Luis F. Villamil-Cubillos ◽  
Jersson X. Leon-Medina ◽  
Maribel Anaya ◽  
Diego A. Tibaduiza

An electronic tongue is a device composed of a sensor array that takes advantage of the cross sensitivity property of several sensors to perform classification and quantification in liquid substances. In practice, electronic tongues generate a large amount of information that needs to be correctly analyzed, to define which interactions and features are more relevant to distinguish one substance from another. This work focuses on implementing and validating feature selection methodologies in the liquid classification process of a multifrequency large amplitude pulse voltammetric (MLAPV) electronic tongue. Multi-layer perceptron neural network (MLP NN) and support vector machine (SVM) were used as supervised machine learning classifiers. Different feature selection techniques were used, such as Variance filter, ANOVA F-value, Recursive Feature Elimination and model-based selection. Both 5-fold Cross validation and GridSearchCV were used in order to evaluate the performance of the feature selection methodology by testing various configurations and determining the best one. The methodology was validated in an imbalanced MLAPV electronic tongue dataset of 13 different liquid substances, reaching a 93.85% of classification accuracy.


2016 ◽  
Vol 23 (1) ◽  
pp. 177-191
Author(s):  
Anderson Roges Teixeira Góes ◽  
Maria Teresinha Arns Steiner

Resumo A qualidade na educação tem sido objeto de muita discussão, seja nas escolas e entre seus gestores, seja na mídia ou na literatura. No entanto, uma análise mais profunda na literatura parece não indicar técnicas que explorem bancos de dados com a finalidade de obter classificações para o desempenho escolar, nem tampouco há um consenso sobre o que seja “qualidade educacional”. Diante deste contexto, neste artigo, é proposta uma metodologia que se enquadra no processo KDD (Knowledge Discovery in Databases, ou seja, Descoberta de Conhecimento em Bases de Dados) para a classificação do desempenho de instituições de ensino, de forma comparativa, com base nas notas obtidas na Prova Brasil, um dos itens integrantes do Índice de Desenvolvimento da Educação Básica (IDEB) no Brasil. Para ilustrar a metodologia, esta foi aplicada às escolas públicas municipais de Araucária, PR, região metropolitana de Curitiba, PR, num total de 17, que, por ocasião da pesquisa, ofertavam Ensino Fundamental, considerando as notas obtidas pela totalidade dos alunos dos anos iniciais (1º. ao 5º. ano do ensino fundamental) e dos anos finais (6º. ao 9º. ano do ensino fundamental). Na etapa de Data Mining, principal etapa do processo KDD, foram utilizadas três técnicas de forma comparativa para o Reconhecimento de Padrões: Redes Neurais Artificiais; Support Vector Machines; e Algoritmos Genéticos. Essas técnicas apresentaram resultados satisfatórios na classificação das escolas, representados por meio de uma “Etiqueta de Classificação do Desempenho”. Por meio desta etiqueta, os gestores educacionais poderão ter melhor base para definir as medidas a serem adotadas junto a cada escola, podendo definir mais claramente as metas a serem cumpridas.


Sign in / Sign up

Export Citation Format

Share Document