scholarly journals Comparison of feature selection techniques in classifying stroke documents

Author(s):  
Nur Syaza Izzati Mohd Rafei ◽  
Rohayanti Hassan ◽  
RD Rohmat Saedudin ◽  
Anis Farihan Mat Raffei ◽  
Zalmiyah Zakaria ◽  
...  

<span>The amount of digital biomedical literature grows that make most of the researchers facing the difficulties to manage and retrieve the required information from the Internet because this task is very challenging. The application of text classification on biomedical literature is one of the solutions in order to solve problem that have been faced by researchers but managing the high dimensionality of data being a common issue on text classification. Therefore, the aim of this research is to compare the techniques that could be used to select the relevant features for classifying biomedical text abstracts. This research focus on Pearson’s Correlation and Information Gain as feature selection techniques for reducing the high dimensionality of data. Towards this effort, we conduct and evaluate several experiments using 100 abstract of stroke documents that retrieved from PubMed database as datasets. This dataset underwent the text pre-processing that is crucial before proceed to feature selection phase. Features selection phase is involving Information Gain and Pearson Correlation technique. Support Vector Machine classifier is used in order to evaluate and compare the effectiveness of two feature selection techniques. For this dataset, Information Gain has outperformed Pearson’s Correlation by 3.3%. This research tends to extract the meaningful features from a subset of stroke documents that can be used for various application especially in diagnose the stroke disease.</span>

Author(s):  
Ahmad A Saifan ◽  
Lina Abu-wardih

Two primary issues have emerged in the machine learning and data mining community: how to deal with imbalanced data and how to choose appropriate features. These are of particular concern in the software engineering domain, and more specifically the field of software defect prediction. This research highlights a procedure which includes a feature selection technique to single out relevant attributes, and an ensemble technique to handle the class-imbalance issue. In order to determine the advantages of feature selection and ensemble methods we look at two potential scenarios: (1) Ensemble models constructed from the original datasets, without feature selection; (2) Ensemble models constructed from the reduced datasets after feature selection has been applied. Four feature selection techniques are employed: Principal Component Analysis (PCA), Pearson’s correlation, Greedy Stepwise Forward selection, and Information Gain (IG). The aim of this research is to assess the effectiveness of feature selection techniques using ensemble techniques. Five datasets, obtained from the PROMISE software depository, are analyzed; tentative results indicate that ensemble methods can improve the model's performance without the use of feature selection techniques. PCA feature selection and bagging based on K-NN perform better than both bagging based on SVM and boosting based on K-NN and SVM, and feature selection techniques including Pearson’s correlation, Greedy stepwise, and IG weaken the ensemble models’ performance.


2019 ◽  
Vol 8 (4) ◽  
pp. 1333-1338

Text classification is a vital process due to the large volume of electronic articles. One of the drawbacks of text classification is the high dimensionality of feature space. Scholars developed several algorithms to choose relevant features from article text such as Chi-square (x2 ), Information Gain (IG), and Correlation (CFS). These algorithms have been investigated widely for English text, while studies for Arabic text are still limited. In this paper, we investigated four well-known algorithms: Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Decision Tree against benchmark Arabic textual datasets, called Saudi Press Agency (SPA) to evaluate the impact of feature selection methods. Using the WEKA tool, we have experimented the application of the four mentioned classification algorithms with and without feature selection algorithms. The results provided clear evidence that the three feature selection methods often improves classification accuracy by eliminating irrelevant features.


Author(s):  
VLADIMIR NIKULIN ◽  
TIAN-HSIANG HUANG ◽  
GEOFFREY J. MCLACHLAN

The method presented in this paper is novel as a natural combination of two mutually dependent steps. Feature selection is a key element (first step) in our classification system, which was employed during the 2010 International RSCTC data mining (bioinformatics) Challenge. The second step may be implemented using any suitable classifier such as linear regression, support vector machine or neural networks. We conducted leave-one-out (LOO) experiments with several feature selection techniques and classifiers. Based on the LOO evaluations, we decided to use feature selection with the separation type Wilcoxon-based criterion for all final submissions. The method presented in this paper was tested successfully during the RSCTC data mining Challenge, where we achieved the top score in the Basic track.


2020 ◽  
Vol 11 (2) ◽  
pp. 107-111
Author(s):  
Christevan Destitus ◽  
Wella Wella ◽  
Suryasari Suryasari

This study aims to clarify tweets on twitter using the Support Vector Machine and Information Gain methods. The clarification itself aims to find a hyperplane that separates the negative and positive classes. In the research stage, there is a system process, namely text mining, text processing which has stages of tokenizing, filtering, stemming, and term weighting. After that, a feature selection is made by information gain which calculates the entropy value of each word. After that, clarify based on the features that have been selected and the output is in the form of identifying whether the tweet is bully or not. The results of this study found that the Support Vector Machine and Information Gain methods have sufficiently maximum results.


2020 ◽  
Vol 2 (1) ◽  
pp. 62
Author(s):  
Luis F. Villamil-Cubillos ◽  
Jersson X. Leon-Medina ◽  
Maribel Anaya ◽  
Diego A. Tibaduiza

An electronic tongue is a device composed of a sensor array that takes advantage of the cross sensitivity property of several sensors to perform classification and quantification in liquid substances. In practice, electronic tongues generate a large amount of information that needs to be correctly analyzed, to define which interactions and features are more relevant to distinguish one substance from another. This work focuses on implementing and validating feature selection methodologies in the liquid classification process of a multifrequency large amplitude pulse voltammetric (MLAPV) electronic tongue. Multi-layer perceptron neural network (MLP NN) and support vector machine (SVM) were used as supervised machine learning classifiers. Different feature selection techniques were used, such as Variance filter, ANOVA F-value, Recursive Feature Elimination and model-based selection. Both 5-fold Cross validation and GridSearchCV were used in order to evaluate the performance of the feature selection methodology by testing various configurations and determining the best one. The methodology was validated in an imbalanced MLAPV electronic tongue dataset of 13 different liquid substances, reaching a 93.85% of classification accuracy.


2019 ◽  
Vol 3 (2) ◽  
pp. 11-18
Author(s):  
George Mweshi

Extracting useful and novel information from the large amount of collected data has become a necessity for corporations wishing to maintain a competitive advantage. One of the biggest issues in handling these significantly large datasets is the curse of dimensionality. As the dimension of the data increases, the performance of the data mining algorithms employed to mine the data deteriorates. This deterioration is mainly caused by the large search space created as a result of having irrelevant, noisy and redundant features in the data. Feature selection is one of the various techniques that can be used to remove these unnecessary features. Feature selection consequently reduces the dimension of the data as well as the search space which in turn increases the efficiency and the accuracy of the mining algorithms. In this paper, we investigate the ability of Genetic Programming (GP), an evolutionary algorithm searching strategy capable of automatically finding solutions in complex and large search spaces, to perform feature selection. We implement a basic GP algorithm and perform feature selection on 5 benchmark classification datasets from UCI repository. To test the competitiveness and feasibility of the GP approach, we examine the classification performance of four classifiers namely J48, Naives Bayes, PART, and Random Forests using the GP selected features, all the original features and the features selected by the other commonly used feature selection techniques i.e. principal component analysis, information gain, relief-f and cfs. The experimental results show that not only does GP select a smaller set of features from the original features, classifiers using GP selected features achieve a better classification performance than using all the original features. Furthermore, compared to the other well-known feature selection techniques, GP achieves very competitive results.


: In this era of Internet, the issue of security of information is at its peak. One of the main threats in this cyber world is phishing attacks which is an email or website fraud method that targets the genuine webpage or an email and hacks it without the consent of the end user. There are various techniques which help to classify whether the website or an email is legitimate or fake. The major contributors in the process of detection of these phishing frauds include the classification algorithms, feature selection techniques or dataset preparation methods and the feature extraction that plays an important role in detection as well as in prevention of these attacks. This Survey Paper studies the effect of all these contributors and the approaches that are applied in the study conducted on the recent papers. Some of the classification algorithms that are implemented includes Decision tree, Random Forest , Support Vector Machines, Logistic Regression , Lazy K Star, Naive Bayes and J48 etc.


2020 ◽  
pp. 3397-3407
Author(s):  
Nur Syafiqah Mohd Nafis ◽  
Suryanti Awang

Text documents are unstructured and high dimensional. Effective feature selection is required to select the most important and significant feature from the sparse feature space. Thus, this paper proposed an embedded feature selection technique based on Term Frequency-Inverse Document Frequency (TF-IDF) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for unstructured and high dimensional text classificationhis technique has the ability to measure the feature’s importance in a high-dimensional text document. In addition, it aims to increase the efficiency of the feature selection. Hence, obtaining a promising text classification accuracy. TF-IDF act as a filter approach which measures features importance of the text documents at the first stage. SVM-RFE utilized a backward feature elimination scheme to recursively remove insignificant features from the filtered feature subsets at the second stage. This research executes sets of experiments using a text document retrieved from a benchmark repository comprising a collection of Twitter posts. Pre-processing processes are applied to extract relevant features. After that, the pre-processed features are divided into training and testing datasets. Next, feature selection is implemented on the training dataset by calculating the TF-IDF score for each feature. SVM-RFE is applied for feature ranking as the next feature selection step. Only top-rank features will be selected for text classification using the SVM classifier. Based on the experiments, it shows that the proposed technique able to achieve 98% accuracy that outperformed other existing techniques. In conclusion, the proposed technique able to select the significant features in the unstructured and high dimensional text document.


Sign in / Sign up

Export Citation Format

Share Document