Comparison of feature selection techniques in classifying stroke documents

<span>The amount of digital biomedical literature grows that make most of the researchers facing the difficulties to manage and retrieve the required information from the Internet because this task is very challenging. The application of text classification on biomedical literature is one of the solutions in order to solve problem that have been faced by researchers but managing the high dimensionality of data being a common issue on text classification. Therefore, the aim of this research is to compare the techniques that could be used to select the relevant features for classifying biomedical text abstracts. This research focus on Pearson’s Correlation and Information Gain as feature selection techniques for reducing the high dimensionality of data. Towards this effort, we conduct and evaluate several experiments using 100 abstract of stroke documents that retrieved from PubMed database as datasets. This dataset underwent the text pre-processing that is crucial before proceed to feature selection phase. Features selection phase is involving Information Gain and Pearson Correlation technique. Support Vector Machine classifier is used in order to evaluate and compare the effectiveness of two feature selection techniques. For this dataset, Information Gain has outperformed Pearson’s Correlation by 3.3%. This research tends to extract the meaningful features from a subset of stroke documents that can be used for various application especially in diagnose the stroke disease.</span>

Download Full-text

Software Defect Prediction Based on Feature Subset Selection and Ensemble Classification

ECTI Transactions on Computer and Information Technology (ECTI-CIT) ◽

10.37936/ecti-cit.2020142.224489 ◽

2020 ◽

Vol 14 (2) ◽

pp. 213-228

Author(s):

Ahmad A Saifan ◽

Lina Abu-wardih

Keyword(s):

Feature Selection ◽

Ensemble Methods ◽

Feature Subset Selection ◽

Defect Prediction ◽

Software Defect Prediction ◽

Pearson’S Correlation ◽

Ensemble Models ◽

Software Defect ◽

Pearson's Correlation ◽

Feature Selection Techniques

Two primary issues have emerged in the machine learning and data mining community: how to deal with imbalanced data and how to choose appropriate features. These are of particular concern in the software engineering domain, and more specifically the field of software defect prediction. This research highlights a procedure which includes a feature selection technique to single out relevant attributes, and an ensemble technique to handle the class-imbalance issue. In order to determine the advantages of feature selection and ensemble methods we look at two potential scenarios: (1) Ensemble models constructed from the original datasets, without feature selection; (2) Ensemble models constructed from the reduced datasets after feature selection has been applied. Four feature selection techniques are employed: Principal Component Analysis (PCA), Pearson’s correlation, Greedy Stepwise Forward selection, and Information Gain (IG). The aim of this research is to assess the effectiveness of feature selection techniques using ensemble techniques. Five datasets, obtained from the PROMISE software depository, are analyzed; tentative results indicate that ensemble methods can improve the model's performance without the use of feature selection techniques. PCA feature selection and bagging based on K-NN perform better than both bagging based on SVM and boosting based on K-NN and SVM, and feature selection techniques including Pearson’s correlation, Greedy stepwise, and IG weaken the ensemble models’ performance.

Download Full-text

The Impact of Feature Selection Methods for Classifying Arabic Textual Data

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d7163.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 1333-1338

Keyword(s):

Feature Selection ◽

Text Classification ◽

Information Gain ◽

Feature Space ◽

Support Vector ◽

Selection Methods ◽

K Nearest Neighbors ◽

Chi Square ◽

Selection Algorithms ◽

The Impact

Text classification is a vital process due to the large volume of electronic articles. One of the drawbacks of text classification is the high dimensionality of feature space. Scholars developed several algorithms to choose relevant features from article text such as Chi-square (x2 ), Information Gain (IG), and Correlation (CFS). These algorithms have been investigated widely for English text, while studies for Arabic text are still limited. In this paper, we investigated four well-known algorithms: Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Decision Tree against benchmark Arabic textual datasets, called Saudi Press Agency (SPA) to evaluate the impact of feature selection methods. Using the WEKA tool, we have experimented the application of the four mentioned classification algorithms with and without feature selection algorithms. The results provided clear evidence that the three feature selection methods often improves classification accuracy by eliminating irrelevant features.

Download Full-text

CLASSIFICATION OF HIGH-DIMENSIONAL MICROARRAY DATA WITH A TWO-STEP PROCEDURE VIA A WILCOXON CRITERION AND MULTILAYER PERCEPTRON

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026811002969 ◽

2011 ◽

Vol 10 (01) ◽

pp. 1-14

Author(s):

VLADIMIR NIKULIN ◽

TIAN-HSIANG HUANG ◽

GEOFFREY J. MCLACHLAN

Keyword(s):

Data Mining ◽

Feature Selection ◽

High Dimensional ◽

Second Step ◽

Support Vector ◽

Step Procedure ◽

Leave One Out ◽

Natural Combination ◽

Feature Selection Techniques

The method presented in this paper is novel as a natural combination of two mutually dependent steps. Feature selection is a key element (first step) in our classification system, which was employed during the 2010 International RSCTC data mining (bioinformatics) Challenge. The second step may be implemented using any suitable classifier such as linear regression, support vector machine or neural networks. We conducted leave-one-out (LOO) experiments with several feature selection techniques and classifiers. Based on the LOO evaluations, we decided to use feature selection with the separation type Wilcoxon-based criterion for all final submissions. The method presented in this paper was tested successfully during the RSCTC data mining Challenge, where we achieved the top score in the Basic track.

Download Full-text

Support Vector Machine VS Information Gain: Analisis Sentimen Cyberbullying di Twitter Indonesia

Jurnal ULTIMA InfoSys ◽

10.31937/si.v11i2.1740 ◽

2020 ◽

Vol 11 (2) ◽

pp. 107-111

Author(s):

Christevan Destitus ◽

Wella Wella ◽

Suryasari Suryasari

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

Text Mining ◽

Information Gain ◽

Text Processing ◽

Support Vector ◽

Term Weighting ◽

System Process ◽

Research Stage

This study aims to clarify tweets on twitter using the Support Vector Machine and Information Gain methods. The clarification itself aims to find a hyperplane that separates the negative and positive classes. In the research stage, there is a system process, namely text mining, text processing which has stages of tokenizing, filtering, stemming, and term weighting. After that, a feature selection is made by information gain which calculates the entropy value of each word. After that, clarify based on the features that have been selected and the output is in the form of identifying whether the tweet is bully or not. The results of this study found that the Support Vector Machine and Information Gain methods have sufficiently maximum results.

Download Full-text

Evaluation of Feature Selection Techniques in a Multifrequency Large Amplitude Pulse Voltammetric Electronic Tongue

Engineering Proceedings ◽

10.3390/ecsa-7-08242 ◽

2020 ◽

Vol 2 (1) ◽

pp. 62

Author(s):

Luis F. Villamil-Cubillos ◽

Jersson X. Leon-Medina ◽

Maribel Anaya ◽

Diego A. Tibaduiza

Keyword(s):

Feature Selection ◽

Large Amplitude ◽

Electronic Tongue ◽

Supervised Machine Learning ◽

Recursive Feature Elimination ◽

Support Vector ◽

Variance Filter ◽

Supervised Machine Learning Classifiers ◽

Voltammetric Electronic Tongue ◽

Feature Selection Techniques

An electronic tongue is a device composed of a sensor array that takes advantage of the cross sensitivity property of several sensors to perform classification and quantification in liquid substances. In practice, electronic tongues generate a large amount of information that needs to be correctly analyzed, to define which interactions and features are more relevant to distinguish one substance from another. This work focuses on implementing and validating feature selection methodologies in the liquid classification process of a multifrequency large amplitude pulse voltammetric (MLAPV) electronic tongue. Multi-layer perceptron neural network (MLP NN) and support vector machine (SVM) were used as supervised machine learning classifiers. Different feature selection techniques were used, such as Variance filter, ANOVA F-value, Recursive Feature Elimination and model-based selection. Both 5-fold Cross validation and GridSearchCV were used in order to evaluate the performance of the feature selection methodology by testing various configurations and determining the best one. The methodology was validated in an imbalanced MLAPV electronic tongue dataset of 13 different liquid substances, reaching a 93.85% of classification accuracy.

Download Full-text

Feature Selection using Genetic Programming

Zambia ICT Journal ◽

10.33260/zictjournal.v3i2.62 ◽

2019 ◽

Vol 3 (2) ◽

pp. 11-18

Author(s):

George Mweshi

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

Information Gain ◽

Principal Component ◽

Search Space ◽

Classification Performance ◽

The Other ◽

Searching Strategy ◽

Mining Algorithms ◽

Feature Selection Techniques

Extracting useful and novel information from the large amount of collected data has become a necessity for corporations wishing to maintain a competitive advantage. One of the biggest issues in handling these significantly large datasets is the curse of dimensionality. As the dimension of the data increases, the performance of the data mining algorithms employed to mine the data deteriorates. This deterioration is mainly caused by the large search space created as a result of having irrelevant, noisy and redundant features in the data. Feature selection is one of the various techniques that can be used to remove these unnecessary features. Feature selection consequently reduces the dimension of the data as well as the search space which in turn increases the efficiency and the accuracy of the mining algorithms. In this paper, we investigate the ability of Genetic Programming (GP), an evolutionary algorithm searching strategy capable of automatically finding solutions in complex and large search spaces, to perform feature selection. We implement a basic GP algorithm and perform feature selection on 5 benchmark classification datasets from UCI repository. To test the competitiveness and feasibility of the GP approach, we examine the classification performance of four classifiers namely J48, Naives Bayes, PART, and Random Forests using the GP selected features, all the original features and the features selected by the other commonly used feature selection techniques i.e. principal component analysis, information gain, relief-f and cfs. The experimental results show that not only does GP select a smaller set of features from the original features, classifiers using GP selected features achieve a better classification performance than using all the original features. Furthermore, compared to the other well-known feature selection techniques, GP achieves very competitive results.

Download Full-text

A Survey on Phishing Detection and The Importance of Feature Selection In Data Mining Classification Algorithms

Issue 4 - Journal of Science and Technology ◽

10.46243/jst.2020.v5.i6.pp11-18 ◽

2020 ◽

pp. 11-18

Keyword(s):

Data Mining ◽

Feature Selection ◽

Support Vector ◽

Classification Algorithms ◽

End User ◽

Preparation Methods ◽

Survey Paper ◽

Vector Machines ◽

Feature Selection Techniques ◽

Phishing Detection

: In this era of Internet, the issue of security of information is at its peak. One of the main threats in this cyber world is phishing attacks which is an email or website fraud method that targets the genuine webpage or an email and hacks it without the consent of the end user. There are various techniques which help to classify whether the website or an email is legitimate or fake. The major contributors in the process of detection of these phishing frauds include the classification algorithms, feature selection techniques or dataset preparation methods and the feature extraction that plays an important role in detection as well as in prevention of these attacks. This Survey Paper studies the effect of all these contributors and the approaches that are applied in the study conducted on the recent papers. Some of the classification algorithms that are implemented includes Decision tree, Random Forest , Support Vector Machines, Logistic Regression , Lazy K Star, Naive Bayes and J48 etc.

Download Full-text

The Evaluation of Accuracy Performance in an Enhanced Embedded Feature Selection for Unstructured Text Classification

Iraqi Journal of Science ◽

10.24996/ijs.2020.61.12.28 ◽

2020 ◽

pp. 3397-3407

Author(s):

Nur Syafiqah Mohd Nafis ◽

Suryanti Awang

Keyword(s):

Feature Selection ◽

Text Classification ◽

Training Dataset ◽

Recursive Feature Elimination ◽

High Dimensional ◽

Significant Feature ◽

Support Vector ◽

Svm Classifier ◽

Text Documents ◽

Text Document

Text documents are unstructured and high dimensional. Effective feature selection is required to select the most important and significant feature from the sparse feature space. Thus, this paper proposed an embedded feature selection technique based on Term Frequency-Inverse Document Frequency (TF-IDF) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for unstructured and high dimensional text classificationhis technique has the ability to measure the feature’s importance in a high-dimensional text document. In addition, it aims to increase the efficiency of the feature selection. Hence, obtaining a promising text classification accuracy. TF-IDF act as a filter approach which measures features importance of the text documents at the first stage. SVM-RFE utilized a backward feature elimination scheme to recursively remove insignificant features from the filtered feature subsets at the second stage. This research executes sets of experiments using a text document retrieved from a benchmark repository comprising a collection of Twitter posts. Pre-processing processes are applied to extract relevant features. After that, the pre-processed features are divided into training and testing datasets. Next, feature selection is implemented on the training dataset by calculating the TF-IDF score for each feature. SVM-RFE is applied for feature ranking as the next feature selection step. Only top-rank features will be selected for text classification using the SVM classifier. Based on the experiments, it shows that the proposed technique able to achieve 98% accuracy that outperformed other existing techniques. In conclusion, the proposed technique able to select the significant features in the unstructured and high dimensional text document.

Download Full-text