Spam profiles detection on social networks using computational intelligence methods: The effect of the lingual context

In online social networks, spam profiles represent one of the most serious security threats over the Internet; if they do not stop producing bad advertisements, they can be exploited by criminals for various purposes. This article addresses the nature and the characteristics of spam profiles in a social network like Twitter to improve spam detection, based on a number of publicly available language-independent features. In order to investigate the effectiveness of these features in spam detection, four datasets are extracted for four different language contexts (i.e. Arabic, English, Korean and Spanish), and a fifth is formed by combining them all. We conduct our experiments using a set of five well-known classification algorithms in spam detection field, k-Nearest Neighbours ( k-NN), Random Forest (RF), Naive Bayes (NB), Decision Tree (DT) (J48) and Multilayer Perceptron (MLP) classifiers, along with five filter-based feature selection methods, namely, Information Gain, Chi-square, ReliefF, Correlation and Significance. The results show oscillating performance of each classifier across all datasets, but improved classification results with feature selection. In addition, detailed analysis and comparisons are carried out on two different levels: in the first level, we compare the selected features’ importance among the feature selection methods, whereas in the second level, we observe the relations and the importance of the selected features across all datasets. The findings of this article lead to a better understanding of social spam and improving detection methods by considering the various important features resulting from the different lingual contexts.

Download Full-text

The Impact of Feature Selection Methods for Classifying Arabic Textual Data

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d7163.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 1333-1338

Keyword(s):

Feature Selection ◽

Text Classification ◽

Information Gain ◽

Feature Space ◽

Support Vector ◽

Selection Methods ◽

K Nearest Neighbors ◽

Chi Square ◽

Selection Algorithms ◽

The Impact

Text classification is a vital process due to the large volume of electronic articles. One of the drawbacks of text classification is the high dimensionality of feature space. Scholars developed several algorithms to choose relevant features from article text such as Chi-square (x2 ), Information Gain (IG), and Correlation (CFS). These algorithms have been investigated widely for English text, while studies for Arabic text are still limited. In this paper, we investigated four well-known algorithms: Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Decision Tree against benchmark Arabic textual datasets, called Saudi Press Agency (SPA) to evaluate the impact of feature selection methods. Using the WEKA tool, we have experimented the application of the four mentioned classification algorithms with and without feature selection algorithms. The results provided clear evidence that the three feature selection methods often improves classification accuracy by eliminating irrelevant features.

Download Full-text

Enhancing Effectiveness of Dimension Reduction in Text Classification

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213017500087 ◽

2016 ◽

Vol 26 (03) ◽

pp. 1750008 ◽

Cited By ~ 1

Author(s):

Seyyed Hossein Seyyedi ◽

Behrouz Minaei-Bidgoli

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Dimension Reduction ◽

Text Classification ◽

Field Experiments ◽

Information Gain ◽

Dimensional Space ◽

Feature Space ◽

Spam Detection ◽

Selection Methods

Nowadays, text is one prevalent forms of data and text classification is a widely used data mining task, which has various application fields. One mass-produced instance of text is email. As a communication medium, despite having a lot of advantages, email suffers from a serious problem. The number of spam emails has steadily increased in the recent years, leading to considerable irritation. Therefore, spam detection has emerged as a separate field of text classification. A primary challenge of text classification, which is more severe in spam detection and impedes the process, is high-dimensionality of feature space. Various dimension reduction methods have been proposed that produce a lower dimensional space compared to the original. These methods are divided mainly into two groups: feature selection and feature extraction. This research deals with dimension reduction in the text classification task and especially performs experiments in the spam detection field. We employ Information Gain (IG) and Chi-square Statistic (CHI) as well-known feature selection methods. Also, we propose a new feature extraction method called Sprinkled Semantic Feature Space (SSFS). Furthermore, this paper presents a new hybrid method called IG_SSFS. In IG_SSFS, we combine the selection and extraction processes to reap the benefits from both. To evaluate the mentioned methods in the spam detection field, experiments are conducted on some well-known email datasets. According to the results, SSFS demonstrated superior effectiveness over the basic selection methods in terms of improving classifiers’ performance, and IG_SSFS further enhanced the performance despite consuming less processing time.

Download Full-text

Improved ICHI square feature selection method for Arabic classifiers

International Journal of Informatics and Communication Technology (IJ-ICT) ◽

10.11591/ijict.v9i3.pp157-170 ◽

2020 ◽

Vol 9 (3) ◽

pp. 157

Author(s):

Hadeel N. Alshaer ◽

Mohammed A. Otair ◽

Laith Abualigah

Keyword(s):

Feature Selection ◽

Information Gain ◽

Feature Selection Method ◽

Selection Methods ◽

Chi Square ◽

Feature Selection Problem ◽

Time To Build ◽

Arabic Text Classification ◽

Almost All ◽

Text And Data Mining

<span>Feature selection problem is one of the main important problems in the text and data mining domain. </span><span>This paper presents a comparative study of feature selection methods for Arabic text classification. Five of the feature selection methods were selected: ICHI square, CHI square, Information Gain, Mutual Information and Wrapper. It was tested with five classification algorithms: Bayes Net, Naive Bayes, Random Forest, Decision Tree and Artificial Neural Networks. In addition, Data Collection was used in Arabic consisting of 9055 documents, which were compared by four criteria: Precision, Recall, F-measure and Time to build model. The results showed that the improved ICHI feature selection got almost all the best results in comparison with other methods.</span>

Download Full-text

Performance Assessment of Multiple Classifiers Based on Ensemble Feature Selection Scheme for Sentiment Analysis

Applied Computational Intelligence and Soft Computing ◽

10.1155/2018/8909357 ◽

2018 ◽

Vol 2018 ◽

pp. 1-12 ◽

Cited By ~ 4

Author(s):

Monalisa Ghosh ◽

Goutam Sanyal

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Gini Index ◽

Feature Vector ◽

Information Gain ◽

Feature Subset ◽

Selection Methods ◽

Prominent Feature ◽

Chi Square

Sentiment classification or sentiment analysis has been acknowledged as an open research domain. In recent years, an enormous research work is being performed in these fields by applying various numbers of methodologies. Feature generation and selection are consequent for text mining as the high-dimensional feature set can affect the performance of sentiment analysis. This paper investigates the inability or incompetency of the widely used feature selection methods (IG, Chi-square, and Gini Index) with unigram and bigram feature set on four machine learning classification algorithms (MNB, SVM, KNN, and ME). The proposed methods are evaluated on the basis of three standard datasets, namely, IMDb movie review and electronics and kitchen product review dataset. Initially, unigram and bigram features are extracted by applying n-gram method. In addition, we generate a composite features vector CompUniBi (unigram + bigram), which is sent to the feature selection methods Information Gain (IG), Gini Index (GI), and Chi-square (CHI) to get an optimal feature subset by assigning a score to each of the features. These methods offer a ranking to the features depending on their score; thus a prominent feature vector (CompIG, CompGI, and CompCHI) can be generated easily for classification. Finally, the machine learning classifiers SVM, MNB, KNN, and ME used prominent feature vector for classifying the review document into either positive or negative. The performance of the algorithm is measured by evaluation methods such as precision, recall, and F-measure. Experimental results show that the composite feature vector achieved a better performance than unigram feature, which is encouraging as well as comparable to the related research. The best results were obtained from the combination of Information Gain with SVM in terms of highest accuracy.

Download Full-text

Analysis of Feature Selection and Ensemble Classifier Methods for Intrusion Detection

International Journal of Natural Computing Research ◽

10.4018/ijncr.2018010104 ◽

2018 ◽

Vol 7 (1) ◽

pp. 57-72

Author(s):

H.P. Vinutha ◽

Poornima Basavaraju

Keyword(s):

Feature Selection ◽

Intrusion Detection ◽

Detection Rate ◽

Information Gain ◽

False Positive Rate ◽

Ensemble Classifier ◽

Ensemble Classification ◽

Chi Square ◽

Traffic Pattern ◽

Data Mining Algorithms

Day by day network security is becoming more challenging task. Intrusion detection systems (IDSs) are one of the methods used to monitor the network activities. Data mining algorithms play a major role in the field of IDS. NSL-KDD'99 dataset is used to study the network traffic pattern which helps us to identify possible attacks takes place on the network. The dataset contains 41 attributes and one class attribute categorized as normal, DoS, Probe, R2L and U2R. In proposed methodology, it is necessary to reduce the false positive rate and improve the detection rate by reducing the dimensionality of the dataset, use of all 41 attributes in detection technology is not good practices. Four different feature selection methods like Chi-Square, SU, Gain Ratio and Information Gain feature are used to evaluate the attributes and unimportant features are removed to reduce the dimension of the data. Ensemble classification techniques like Boosting, Bagging, Stacking and Voting are used to observe the detection rate separately with three base algorithms called Decision stump, J48 and Random forest.

Download Full-text

Efficient spam detection across Online Social Networks

2016 IEEE International Conference on Big Data Analysis (ICBDA) ◽

10.1109/icbda.2016.7509829 ◽

2016 ◽

Cited By ~ 10

Author(s):

Hailu Xu ◽

Weiqing Sun ◽

Ahmad Javaid

Keyword(s):

Social Networks ◽

Online Social Networks ◽

Spam Detection

Download Full-text

A multi-swarm optimisation approach for spam detection in online social networks

International Journal of Vehicle Information and Communication Systems ◽

10.1504/ijvics.2021.10035879 ◽

2021 ◽

Vol 6 (1) ◽

pp. 64

Author(s):

R. Krithiga ◽

E. Ilavarasan

Keyword(s):

Social Networks ◽

Online Social Networks ◽

Spam Detection

Download Full-text

A Robust Gene selection Method for Microarray-based Cancer Classification

Cancer Informatics ◽

10.4137/cin.s3794 ◽

2010 ◽

Vol 9 ◽

pp. CIN.S3794 ◽

Cited By ~ 21

Author(s):

Xiaosheng Wang ◽

Osamu Gotoh

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Selection ◽

Information Gain ◽

Expression Profiles ◽

Feature Selection Method ◽

Gene Expression Profiles ◽

Molecular Classification ◽

Selection Method ◽

Chi Square

Gene selection is of vital importance in molecular classification of cancer using high-dimensional gene expression data. Because of the distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and robust feature selection methods is extremely crucial. We investigated the properties of one feature selection approach proposed in our previous work, which was the generalization of the feature selection method based on the depended degree of attribute in rough sets. We compared the feature selection method with the established methods: the depended degree, chi-square, information gain, Relief-F and symmetric uncertainty, and analyzed its properties through a series of classification experiments. The results revealed that our method was superior to the canonical depended degree of attribute based method in robustness and applicability. Moreover, the method was comparable to the other four commonly used methods. More importantly, the method can exhibit the inherent classification difficulty with respect to different gene expression datasets, indicating the inherent biology of specific cancers.

Download Full-text

Informative Gene Selection and Direct Classification of Tumor Based on Chi-Square Test of Pairwise Gene Interactions

BioMed Research International ◽

10.1155/2014/589290 ◽

2014 ◽

Vol 2014 ◽

pp. 1-9 ◽

Cited By ~ 3

Author(s):

Hongyan Zhang ◽

Lanzhi Li ◽

Chao Luo ◽

Congwei Sun ◽

Yuan Chen ◽

...

Keyword(s):

Feature Selection ◽

Gene Selection ◽

Gene Interactions ◽

Selection Methods ◽

Chi Square ◽

Generalization Performance ◽

Independent Test ◽

Chi Square Test ◽

Leave One Out ◽

Tumor Gene

In efforts to discover disease mechanisms and improve clinical diagnosis of tumors, it is useful to mine profiles for informative genes with definite biological meanings and to build robust classifiers with high precision. In this study, we developed a new method for tumor-gene selection, the Chi-square test-based integrated rank gene and direct classifier (χ2-IRG-DC). First, we obtained the weighted integrated rank of gene importance from chi-square tests of single and pairwise gene interactions. Then, we sequentially introduced the ranked genes and removed redundant genes by using leave-one-out cross-validation of the chi-square test-based Direct Classifier (χ2-DC) within the training set to obtain informative genes. Finally, we determined the accuracy of independent test data by utilizing the genes obtained above withχ2-DC. Furthermore, we analyzed the robustness ofχ2-IRG-DC by comparing the generalization performance of different models, the efficiency of different feature-selection methods, and the accuracy of different classifiers. An independent test of ten multiclass tumor gene-expression datasets showed thatχ2-IRG-DC could efficiently control overfitting and had higher generalization performance. The informative genes selected byχ2-IRG-DC could dramatically improve the independent test precision of other classifiers; meanwhile, the informative genes selected by other feature selection methods also had good performance inχ2-DC.

Download Full-text

Sybil Account Detection in Online Social Networks Using Statistical Feature Selection Techniques

Futuristic Trends in Networks and Computing Technologies - Communications in Computer and Information Science ◽

10.1007/978-981-15-4451-4_48 ◽

2020 ◽

pp. 605-618

Author(s):

Amit Chauhan ◽

Manu Sood

Keyword(s):

Social Networks ◽

Feature Selection ◽

Online Social Networks ◽

Statistical Feature ◽

Feature Selection Techniques

Download Full-text