Spam profiles detection on social networks using computational intelligence methods: The effect of the lingual context

2019 ◽  
pp. 016555151986159 ◽  
Author(s):  
Ala’ M Al-Zoubi ◽  
Ja’far Alqatawna ◽  
Hossam Faris ◽  
Mohammad A Hassonah

In online social networks, spam profiles represent one of the most serious security threats over the Internet; if they do not stop producing bad advertisements, they can be exploited by criminals for various purposes. This article addresses the nature and the characteristics of spam profiles in a social network like Twitter to improve spam detection, based on a number of publicly available language-independent features. In order to investigate the effectiveness of these features in spam detection, four datasets are extracted for four different language contexts (i.e. Arabic, English, Korean and Spanish), and a fifth is formed by combining them all. We conduct our experiments using a set of five well-known classification algorithms in spam detection field, k-Nearest Neighbours ( k-NN), Random Forest (RF), Naive Bayes (NB), Decision Tree (DT) (J48) and Multilayer Perceptron (MLP) classifiers, along with five filter-based feature selection methods, namely, Information Gain, Chi-square, ReliefF, Correlation and Significance. The results show oscillating performance of each classifier across all datasets, but improved classification results with feature selection. In addition, detailed analysis and comparisons are carried out on two different levels: in the first level, we compare the selected features’ importance among the feature selection methods, whereas in the second level, we observe the relations and the importance of the selected features across all datasets. The findings of this article lead to a better understanding of social spam and improving detection methods by considering the various important features resulting from the different lingual contexts.

2019 ◽  
Vol 8 (4) ◽  
pp. 1333-1338

Text classification is a vital process due to the large volume of electronic articles. One of the drawbacks of text classification is the high dimensionality of feature space. Scholars developed several algorithms to choose relevant features from article text such as Chi-square (x2 ), Information Gain (IG), and Correlation (CFS). These algorithms have been investigated widely for English text, while studies for Arabic text are still limited. In this paper, we investigated four well-known algorithms: Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Decision Tree against benchmark Arabic textual datasets, called Saudi Press Agency (SPA) to evaluate the impact of feature selection methods. Using the WEKA tool, we have experimented the application of the four mentioned classification algorithms with and without feature selection algorithms. The results provided clear evidence that the three feature selection methods often improves classification accuracy by eliminating irrelevant features.


2016 ◽  
Vol 26 (03) ◽  
pp. 1750008 ◽  
Author(s):  
Seyyed Hossein Seyyedi ◽  
Behrouz Minaei-Bidgoli

Nowadays, text is one prevalent forms of data and text classification is a widely used data mining task, which has various application fields. One mass-produced instance of text is email. As a communication medium, despite having a lot of advantages, email suffers from a serious problem. The number of spam emails has steadily increased in the recent years, leading to considerable irritation. Therefore, spam detection has emerged as a separate field of text classification. A primary challenge of text classification, which is more severe in spam detection and impedes the process, is high-dimensionality of feature space. Various dimension reduction methods have been proposed that produce a lower dimensional space compared to the original. These methods are divided mainly into two groups: feature selection and feature extraction. This research deals with dimension reduction in the text classification task and especially performs experiments in the spam detection field. We employ Information Gain (IG) and Chi-square Statistic (CHI) as well-known feature selection methods. Also, we propose a new feature extraction method called Sprinkled Semantic Feature Space (SSFS). Furthermore, this paper presents a new hybrid method called IG_SSFS. In IG_SSFS, we combine the selection and extraction processes to reap the benefits from both. To evaluate the mentioned methods in the spam detection field, experiments are conducted on some well-known email datasets. According to the results, SSFS demonstrated superior effectiveness over the basic selection methods in terms of improving classifiers’ performance, and IG_SSFS further enhanced the performance despite consuming less processing time.


Author(s):  
Hadeel N. Alshaer ◽  
Mohammed A. Otair ◽  
Laith Abualigah

<span>Feature selection problem is one of the main important problems in the text and data mining domain. </span><span>This paper presents a comparative study of feature selection methods for Arabic text classification. Five of the feature selection methods were selected: ICHI square, CHI square, Information Gain, Mutual Information and Wrapper. It was tested with five classification algorithms: Bayes Net, Naive Bayes, Random Forest, Decision Tree and Artificial Neural Networks. In addition, Data Collection was used in Arabic consisting of 9055 documents, which were compared by four criteria: Precision, Recall, F-measure and Time to build model. The results showed that the improved ICHI feature selection got almost all the best results in comparison with other methods.</span>


2018 ◽  
Vol 2018 ◽  
pp. 1-12 ◽  
Author(s):  
Monalisa Ghosh ◽  
Goutam Sanyal

Sentiment classification or sentiment analysis has been acknowledged as an open research domain. In recent years, an enormous research work is being performed in these fields by applying various numbers of methodologies. Feature generation and selection are consequent for text mining as the high-dimensional feature set can affect the performance of sentiment analysis. This paper investigates the inability or incompetency of the widely used feature selection methods (IG, Chi-square, and Gini Index) with unigram and bigram feature set on four machine learning classification algorithms (MNB, SVM, KNN, and ME). The proposed methods are evaluated on the basis of three standard datasets, namely, IMDb movie review and electronics and kitchen product review dataset. Initially, unigram and bigram features are extracted by applying n-gram method. In addition, we generate a composite features vector CompUniBi (unigram + bigram), which is sent to the feature selection methods Information Gain (IG), Gini Index (GI), and Chi-square (CHI) to get an optimal feature subset by assigning a score to each of the features. These methods offer a ranking to the features depending on their score; thus a prominent feature vector (CompIG, CompGI, and CompCHI) can be generated easily for classification. Finally, the machine learning classifiers SVM, MNB, KNN, and ME used prominent feature vector for classifying the review document into either positive or negative. The performance of the algorithm is measured by evaluation methods such as precision, recall, and F-measure. Experimental results show that the composite feature vector achieved a better performance than unigram feature, which is encouraging as well as comparable to the related research. The best results were obtained from the combination of Information Gain with SVM in terms of highest accuracy.


2018 ◽  
Vol 7 (1) ◽  
pp. 57-72
Author(s):  
H.P. Vinutha ◽  
Poornima Basavaraju

Day by day network security is becoming more challenging task. Intrusion detection systems (IDSs) are one of the methods used to monitor the network activities. Data mining algorithms play a major role in the field of IDS. NSL-KDD'99 dataset is used to study the network traffic pattern which helps us to identify possible attacks takes place on the network. The dataset contains 41 attributes and one class attribute categorized as normal, DoS, Probe, R2L and U2R. In proposed methodology, it is necessary to reduce the false positive rate and improve the detection rate by reducing the dimensionality of the dataset, use of all 41 attributes in detection technology is not good practices. Four different feature selection methods like Chi-Square, SU, Gain Ratio and Information Gain feature are used to evaluate the attributes and unimportant features are removed to reduce the dimension of the data. Ensemble classification techniques like Boosting, Bagging, Stacking and Voting are used to observe the detection rate separately with three base algorithms called Decision stump, J48 and Random forest.


2010 ◽  
Vol 9 ◽  
pp. CIN.S3794 ◽  
Author(s):  
Xiaosheng Wang ◽  
Osamu Gotoh

Gene selection is of vital importance in molecular classification of cancer using high-dimensional gene expression data. Because of the distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and robust feature selection methods is extremely crucial. We investigated the properties of one feature selection approach proposed in our previous work, which was the generalization of the feature selection method based on the depended degree of attribute in rough sets. We compared the feature selection method with the established methods: the depended degree, chi-square, information gain, Relief-F and symmetric uncertainty, and analyzed its properties through a series of classification experiments. The results revealed that our method was superior to the canonical depended degree of attribute based method in robustness and applicability. Moreover, the method was comparable to the other four commonly used methods. More importantly, the method can exhibit the inherent classification difficulty with respect to different gene expression datasets, indicating the inherent biology of specific cancers.


2014 ◽  
Vol 2014 ◽  
pp. 1-9 ◽  
Author(s):  
Hongyan Zhang ◽  
Lanzhi Li ◽  
Chao Luo ◽  
Congwei Sun ◽  
Yuan Chen ◽  
...  

In efforts to discover disease mechanisms and improve clinical diagnosis of tumors, it is useful to mine profiles for informative genes with definite biological meanings and to build robust classifiers with high precision. In this study, we developed a new method for tumor-gene selection, the Chi-square test-based integrated rank gene and direct classifier (χ2-IRG-DC). First, we obtained the weighted integrated rank of gene importance from chi-square tests of single and pairwise gene interactions. Then, we sequentially introduced the ranked genes and removed redundant genes by using leave-one-out cross-validation of the chi-square test-based Direct Classifier (χ2-DC) within the training set to obtain informative genes. Finally, we determined the accuracy of independent test data by utilizing the genes obtained above withχ2-DC. Furthermore, we analyzed the robustness ofχ2-IRG-DC by comparing the generalization performance of different models, the efficiency of different feature-selection methods, and the accuracy of different classifiers. An independent test of ten multiclass tumor gene-expression datasets showed thatχ2-IRG-DC could efficiently control overfitting and had higher generalization performance. The informative genes selected byχ2-IRG-DC could dramatically improve the independent test precision of other classifiers; meanwhile, the informative genes selected by other feature selection methods also had good performance inχ2-DC.


Sign in / Sign up

Export Citation Format

Share Document