scholarly journals Hybrid feature selection approach to identify optimal features of profile metadata to detect social bots in Twitter

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Eiman Alothali ◽  
Kadhim Hayawi ◽  
Hany Alashwal

AbstractThe last few years have revealed that social bots in social networks have become more sophisticated in design as they adapt their features to avoid detection systems. The deceptive nature of bots to mimic human users is due to the advancement of artificial intelligence and chatbots, where these bots learn and adjust very quickly. Therefore, finding the optimal features needed to detect them is an area for further investigation. In this paper, we propose a hybrid feature selection (FS) method to evaluate profile metadata features to find these optimal features, which are evaluated using random forest, naïve Bayes, support vector machines, and neural networks. We found that the cross-validation attribute evaluation performance was the best when compared to other FS methods. Our results show that the random forest classifier with six optimal features achieved the best score of 94.3% for the area under the curve. The results maintained overall 89% accuracy, 83.8% precision, and 83.3% recall for the bot class. We found that using four features: favorites_count, verified, statuses_count, and average_tweets_per_day, achieves good performance metrics for bot detection (84.1% precision, 81.2% recall).

F1000Research ◽  
2021 ◽  
Vol 9 ◽  
pp. 1255
Author(s):  
Malik Yousef ◽  
Burcu Bakir-Gungor ◽  
Amhar Jabeer ◽  
Gokhan Goy ◽  
Rehman Qureshi ◽  
...  

In our earlier study, we proposed a novel feature selection approach, Recursive Cluster Elimination with Support Vector Machines (SVM-RCE) and implemented this approach in Matlab. Interest in this approach has grown over time and several researchers have incorporated SVM-RCE into their studies, resulting in a substantial number of scientific publications. This increased interest encouraged us to reconsider how feature selection, particularly in biological datasets, can benefit from considering the relationships of those genes in the selection process, this led to our development of SVM-RCE-R.  SVM-RCE-R, further enhances the capabilities of  SVM-RCE by the addition of  a novel user specified ranking function. This ranking function enables the user to  stipulate the weights of the accuracy, sensitivity, specificity, f-measure, area  under the curve and the precision in the ranking function This flexibility allows the user to select for greater sensitivity or greater specificity as needed for a specific project. The usefulness of SVM-RCE-R is further supported by development of the maTE tool which uses a similar approach to identify microRNA (miRNA) targets. We have also now implemented the SVM-RCE-R algorithm in Knime in order to make it easier to applyThe use of SVM-RCE-R in Knime is simple and intuitive and allows researchers to immediately begin their analysis without having to consult an information technology specialist. The input for the Knime implemented tool is an EXCEL file (or text or CSV) with a simple structure and the output is also an EXCEL file. The Knime version also incorporates new features not available in SVM-RCE. The results show that the inclusion of the ranking function has a significant impact on the performance of SVM-RCE-R. Some of the clusters that achieve high scores for a specified ranking can also have high scores in other metrics.


2019 ◽  
Vol 8 (4) ◽  
pp. 5160-5165

Feature selection is a powerful tool to identify the important characteristics of data for prediction. Feature selection, therefore, can be a tool for avoiding overfitting, improving prediction accuracy and reducing execution time. The applications of feature selection procedures are particularly important in Support vector machines, which is used for prediction in large datasets. The larger the dataset, the more computationally exhaustive and challenging it is to build a predictive model using the support vector classifier. This paper investigates how the feature selection approach based on the analysis of variance (ANOVA) can be optimized for Support Vector Machines (SVMs) to improve its execution time and accuracy. We introduce new conditions on the SVMs prior to running the ANOVA to optimize the performance of the support vector classifier. We also establish the bootstrap procedure as alternative to cross validation to perform model selection. We run our experiments using popular datasets and compare our results to existing modifications of SVMs with feature selection procedure. We propose a number of ANOVA-SVM modifications which are simple to perform, while at the same time, boost significantly the accuracy and computing time of the SVMs in comparison to existing methods like the Mixed Integer Linear Feature Selection approach.


2004 ◽  
Vol 13 (04) ◽  
pp. 791-800 ◽  
Author(s):  
HOLGER FRÖHLICH ◽  
OLIVIER CHAPELLE ◽  
BERNHARD SCHÖLKOPF

The problem of feature selection is a difficult combinatorial task in Machine Learning and of high practical relevance, e.g. in bioinformatics. Genetic Algorithms (GAs) offer a natural way to solve this problem. In this paper we present a special Genetic Algorithm, which especially takes into account the existing bounds on the generalization error for Support Vector Machines (SVMs). This new approach is compared to the traditional method of performing cross-validation and to other existing algorithms for feature selection.


Author(s):  
Ahmed Abdullah Farid ◽  
Gamal Selim ◽  
Hatem Khater

Breast cancer is a significant health issue across the world. Breast cancer is the most widely-diagnosed cancer in women; early-stage diagnosis of disease and therapies increase patient safety. This paper proposes a synthetic model set of features focused on the optimization of the genetic algorithm (CHFS-BOGA) to forecast breast cancer. This hybrid feature selection approach combines the advantages of three filter feature selection approaches with an optimize Genetic Algorithm (OGA) to select the best features to improve the performance of the classification process and scalability. We propose OGA by improving the initial population generating and genetic operators using the results of filter approaches as some prior information with using the C4.5 decision tree classifier as a fitness function instead of probability and random selection. The authors collected available updated data from Wisconsin UCI machine learning with a total of 569 rows and 32 columns. The dataset evaluated using an explorer set of weka data mining open-source software for the analysis purpose. The results show that the proposed hybrid feature selection approach significantly outperforms the single filter approaches and principal component analysis (PCA) for optimum feature selection. These characteristics are good indicators for the return prediction. The highest accuracy achieved with the proposed system before (CHFS-BOGA) using the support vector machine (SVM) classifiers was 97.3%. The highest accuracy after (CHFS-BOGA-SVM) was 98.25% on split 70.0% train, remainder test, and 100% on the full training set. Moreover, the receiver operating characteristic (ROC) curve was equal to 1.0. The results showed that the proposed (CHFS-BOGA-SVM) system was able to accurately classify the type of breast tumor, whether malignant or benign.


Author(s):  
A. Gaspar-Cunha ◽  
F. Mendes ◽  
J. Duarte ◽  
A. Vieira ◽  
B. Ribeiro ◽  
...  

In this work a Multi-Objective Evolutionary Algorithm (MOEA) was applied for feature selection in the problem of bankruptcy prediction. This algorithm maximizes the accuracy of the classifier while keeping the number of features low. A two-objective problem, that is minimization of the number of features and accuracy maximization, was fully analyzed using the Logistic Regression (LR) and Support Vector Machines (SVM) classifiers. Simultaneously, the parameters required by both classifiers were also optimized, and the validity of the methodology proposed was tested using a database containing financial statements of 1200 medium sized private French companies. Based on extensive tests, it is shown that MOEA is an efficient feature selection approach. Best results were obtained when both the accuracy and the classifiers parameters are optimized. The proposed method can provide useful information for decision makers in characterizing the financial health of a company.


F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 1255 ◽  
Author(s):  
Malik Yousef ◽  
Burcu Bakir-Gungor ◽  
Amhar Jabeer ◽  
Gokhan Goy ◽  
Rehman Qureshi ◽  
...  

In our earlier study, we proposed a novel feature selection approach, Recursive Cluster Elimination with Support Vector Machines (SVM-RCE) and implemented this approach in Matlab. Interest in this approach has grown over time and several researchers have incorporated SVM-RCE into their studies, resulting in a substantial number of scientific publications. This increased interest encouraged us to reconsider how feature selection, particularly in biological datasets, can benefit from considering the relationships of those genes in the selection process, this led to our development of SVM-RCE-R. The usefulness of SVM-RCE-R is further supported by development of maTE tool, which uses a similar approach to identify microRNA (miRNA) targets. We have now implemented the SVM-RCE-R algorithm in Knime in order to make it easier to apply and to make it more accessible to the biomedical community. The use of SVM-RCE-R in Knime is simple and intuitive, allowing researchers to immediately begin their data analysis without having to consult an information technology specialist. The input for the Knime tool is an EXCEL file (or text or CSV) with a simple structure and the output is also an EXCEL file. The Knime version also incorporates new features not available in the previous version. One of these features is a user-specific ranking function that enables the user to provide the weights of the accuracy, sensitivity, specificity, f-measure, area under curve and precision in the ranking function, allowing the user to select for greater sensitivity or greater specificity as needed. The results show that the ranking function has an impact on the performance of SVM-RCE-R. Some of the clusters that achieve high scores for a specified ranking can also have high scores in other metrics. This finding motivates future studies to suggest the optimal ranking function.


Sign in / Sign up

Export Citation Format

Share Document