An Effective Text Classifier using Machine Learning for Identifying Tweets’ Polarity Concerning Terrorist Connotation

Norah AL-Harbi;  ; Amirrudin Bin Kamsin

doi:10.5815/ijitcs.2021.05.02

An Effective Text Classifier using Machine Learning for Identifying Tweets’ Polarity Concerning Terrorist Connotation

International Journal of Information Technology and Computer Science ◽

10.5815/ijitcs.2021.05.02 ◽

2021 ◽

Vol 13 (5) ◽

pp. 19-29

Author(s):

Norah AL-Harbi ◽

◽

Amirrudin Bin Kamsin

Keyword(s):

Machine Learning ◽

Social Networking Sites ◽

Classification Accuracy ◽

Arab World ◽

Arabic Language ◽

Machine Learning Algorithms ◽

Language Models ◽

Svm Classifier ◽

The Past ◽

Linear Svm

Terrorist groups in the Arab world are using social networking sites like Twitter and Facebook to rapidly spread terror for the past few years. Detection and suspension of such accounts is a way to control the menace to some extent. This research is aimed at building an effective text classifier, using machine learning to identify the polarity of the tweets automatically. Five classifiers were chosen, which are AdB_SAMME, AdB_SAMME.R, Linear SVM, NB, and LR. These classifiers were applied on three features namely S1 (one word, unigram), S2 (word pair, bigram), and S3 (word triplet, trigram). All five classifiers evaluated samples S1, S2, and S3 in 346 preprocessed tweets. Feature extraction process utilized one of the most widely applied weighing schemes tf-idf (term frequency-inverse document frequency).The results were validated by four experts in Arabic language (three teachers and an educational supervisor in Saudi Arabia) through a questionnaire. The study found that the Linear SVM classifier yielded the best results of 99.7 % classification accuracy on S3 among all the other classifiers used. When both classification accuracy and time were considered, the NB classifier demonstrated the performance on S1 with 99.4% accuracy, which was comparable with Linear SVM. The Arab world has faced massive terrorist attacks in the past, and therefore, the research is highly significant and relevant due to its specific focus on detecting terrorism messages in Arabic. The state-of-the-art methods developed so far for tweets classification are mostly focused on analyzing English text, and hence, there was a dire need for devising machine learning algorithms for detecting Arabic terrorism messages. The innovative aspect of the model presented in the current study is that the five best classifiers were selected and applied on three language models S1, S2, and S3. The comparative analysis based on classification accuracy and time constraints proposed the best classifiers for sentiment analysis in the Arabic language.

Download Full-text

Tebyan: Fake News Detection System (Preprint)

10.2196/preprints.35982 ◽

2021 ◽

Author(s):

Lamya Alderywsh ◽

Aseel Aldawood ◽

Ashwag Alasmari ◽

Farah Aldeijy ◽

Ghadah Alqubisy ◽

...

Keyword(s):

Machine Learning ◽

Arab World ◽

Detection System ◽

Learning Algorithms ◽

Performance Measure ◽

Machine Learning Algorithms ◽

Svm Classifier ◽

Fake News ◽

Typical Type ◽

Performance Results

BACKGROUND There is a serious threat from fake news spreading in technologically advanced societies, including those in the Arab world, via deceptive machine-generated text. In the last decade, Arabic fake news identification has gained increased attention, and numerous detection approaches have revealed some ability to find fake news throughout various data sources. Nevertheless, many existing approaches overlook recent advancements in fake news detection, explicitly to incorporate machine learning algorithms system. OBJECTIVE Tebyan project aims to address the problem of fake news by developing a fake news detection system that employs machine learning algorithms to detect whether the news is fake or real in the context of Arab world. METHODS The project went through numerous phases using an iterative methodology to develop the system. This study analysis incorporated numerous stages using an iterative method to develop the system of misinformation and contextualize fake news regarding society's information. It consists of implementing the machine learning algorithms system using Python to collect genuine and fake news datasets. The study also assesses how information-exchanging behaviors can minimize and find the optimal source of authentication of the emergent news through system testing approaches. RESULTS The study revealed that the main deliverable of this project is the Tebyan system in the community, which allows the user to ensure the credibility of news in Arabic newspapers. It showed that the SVM classifier, on average, exhibited the highest performance results, resulting in 90% in every performance measure of sources. Moreover, the results indicate the second-best algorithm is the linear SVC since it resulted in 90% in performance measure with the societies' typical type of fake information. CONCLUSIONS The study concludes that conducting a system with machine learning algorithms using Python programming language allows the rapid measures of the users' perception to comment and rate the credibility result and subscribing to news email services.

Download Full-text

Machine Learning in P&C Insurance: A Review for Pricing and Reserving

Risks ◽

10.3390/risks9010004 ◽

2020 ◽

Vol 9 (1) ◽

pp. 4 ◽

Cited By ~ 1

Author(s):

Christopher Blier-Wong ◽

Hélène Cossette ◽

Luc Lamontagne ◽

Etienne Marceau

Keyword(s):

Machine Learning ◽

Insurance Industry ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Actuarial Science ◽

Science Field ◽

The Past ◽

Highly Nonlinear ◽

Computer Scientists ◽

Applications Of Machine Learning

In the past 25 years, computer scientists and statisticians developed machine learning algorithms capable of modeling highly nonlinear transformations and interactions of input features. While actuaries use GLMs frequently in practice, only in the past few years have they begun studying these newer algorithms to tackle insurance-related tasks. In this work, we aim to review the applications of machine learning to the actuarial science field and present the current state of the art in ratemaking and reserving. We first give an overview of neural networks, then briefly outline applications of machine learning algorithms in actuarial science tasks. Finally, we summarize the future trends of machine learning for the insurance industry.

Download Full-text

Classification of Diffusion Tensor Metrics for the Diagnosis of a Myelopathic Cord Using Machine Learning

International Journal of Neural Systems ◽

10.1142/s0129065717500368 ◽

2018 ◽

Vol 28 (02) ◽

pp. 1750036 ◽

Cited By ~ 8

Author(s):

Shuqiang Wang ◽

Yong Hu ◽

Yanyan Shen ◽

Hanxiong Li

Keyword(s):

Machine Learning ◽

Surgical Planning ◽

Diffusion Tensor ◽

Mean Value ◽

Machine Learning Algorithms ◽

Support Vector ◽

Svm Classifier ◽

Control Groups ◽

Diffusion Tensor Imaging Dti

In this study, we propose an automated framework that combines diffusion tensor imaging (DTI) metrics with machine learning algorithms to accurately classify control groups and groups with cervical spondylotic myelopathy (CSM) in the spinal cord. The comparison between selected voxel-based classification and mean value-based classification were performed. A support vector machine (SVM) classifier using a selected voxel-based dataset produced an accuracy of 95.73%, sensitivity of 93.41% and specificity of 98.64%. The efficacy of each index of diffusion for classification was also evaluated. Using the proposed approach, myelopathic areas in CSM are detected to provide an accurate reference to assist spine surgeons in surgical planning in complicated cases.

Download Full-text

A Novel Fast Training Method for SVM and Its Application in Fault Diagnosis of Service Robot

International Journal of Online Engineering (iJOE) ◽

10.3991/ijoe.v11i6.4846 ◽

2015 ◽

Vol 11 (6) ◽

pp. 4 ◽

Cited By ~ 2

Author(s):

Xianfeng Yuan ◽

Mumin Song ◽

Fengyu Zhou ◽

Yugang Wang ◽

Zhumin Chen

Keyword(s):

Fault Diagnosis ◽

Classification Accuracy ◽

Machine Learning Algorithms ◽

Sensor Data ◽

Service Robot ◽

Support Vector ◽

Svm Classifier ◽

Processing Unit ◽

Training Method ◽

Fast Training

Support Vector Machines (SVM) is a set of popular machine learning algorithms which have been successfully applied in diverse aspects, but for large training data sets the processing time and computational costs are prohibitive. This paper presents a novel fast training method for SVM, which is applied in the fault diagnosis of service robot. Firstly, sensor data are sampled under different running conditions of the robot and those samples are divided as training sets and testing sets. Secondly, the sampled data are preprocessed and the principal component analysis (PCA) model is established for fault feature extraction. Thirdly, the feature vectors are used to train the SVM classifier, which achieves the fault diagnosis of the robot. To speed up the training process of SVM, on the one hand, sample reduction is done using the proposed support vectors selection (SVS) algorithm, which can ensure good classification accuracy and generalization capability. On the other hand, we take advantage of the excellent parallel computing abilities of Graphics Processing Unit (GPU) to pre-calculate the kernel matrix, which avoids the recalculation during the cross validation process. Experimental results illustrate that the proposed method can significantly reduce the training time without decreasing the classification accuracy.

Download Full-text

Classification of Emotional States Inparkinson’s Disease Patients Using Time,Frequency and Time-Frequency Analysis

10.21203/rs.3.rs-273617/v1 ◽

2021 ◽

Author(s):

Rejith K.N ◽

Kamalraj Subramaniam ◽

Ayyem Pillai Vasudevan Pillai ◽

Roshini T V ◽

Renjith V. Ravi ◽

...

Keyword(s):

Frequency Domain ◽

Classification Accuracy ◽

Machine Learning Algorithms ◽

Signal Frequency ◽

Spectral Energy ◽

Support Vector ◽

Svm Classifier ◽

Time Frequency ◽

Energy Entropy ◽

Teager Energy

Abstract In this work, PD patients and healthy individuals were categorized with machine-learning algorithms. EEG signals associated with six different emotions, (Happiness(E1), Sadness(E2), Fear(E3), Anger(E4), Surprise,(E5) and disgust(E6)) were used for the study. EEG data were collected from 20 PD patients and 20 normal controls using multimodal stimuli. Different features were used to categorize emotional data. Emotional recognition in Parkinson’s disease (PD) has been investigated in three domains namely, time, frequency and time frequency using Entropy, Energy-Entropy and Teager Energy-Entropy features. Three classifiers namely, K-Nearest Neighbor Algorithm, Support Vector Machine and Probabilistic Neural Network were used to observethe classification results. Emotional EEG stimuli such as anger, surprise, happiness, sadness, fear, and disgust were used to categorize PD patients and healthy controls (HC). For each EEG signal, frequency features corresponding to alpha, beta and gamma bands were obtained for nine feature extraction methods (Entropy, Energy Entropy, Teager Energy Entropy, Spectral Entropy, Spectral Energy-Entropy, Spectral Teager Energy-Entropy, STFT Entropy, STFT Energy-Entropy and STFT Teager Energy-Entropy). From the analysis, it is observed that the entropy feature in frequency domain performs evenly well (above 80 %) for all six emotions with KNN. Classification results shows that using the selected energy entropy combination feature in frequency domain provides highest accuracy for all emotions except E1 and E2 for KNN and SVM classifier, whereas other features give accuracy values of above 60% for most emotions.It is also observed that emotion E1 gives above 90 % classification accuracy for all classifiers in time domain.In frequency domain also, emotion E1 gives above 90% classification accuracy using PNN classifier.

Download Full-text

Heat Disease Prediction using Machine Learning

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.36372 ◽

2021 ◽

Vol 9 (VI) ◽

pp. 846-852

Author(s):

Prof. Dr. R. Sandhiya

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Data Science ◽

Machine Learning Algorithms ◽

Assessment Process ◽

Disease Prediction ◽

Test Results ◽

Medical Field ◽

Modern Age ◽

Linear Svm

In recent times, the diagnosis of heart disease has become a very critical task in the medical field. In the modern age, one person dies every minute due to heart disease. Data science has an important role in processing big amounts of data in the field of health sciences. Since the diagnosis of heart disease is a complex task, the assessment process should be automated to avoid the risks associated with it and alert the patient in advance. This paper uses the heart disease dataset available in the UCI Machine Learning Repository. The proposed work assesses the risk of heart disease in a patient by applying various data mining methods such as Naive Bayes, Decision Tree, KNN, Linear SVM, RBF SVM, Gaussian Process, Neural Network, Adabost, QDA and Random Forest. This paper provides a comparative study by analyzing the performance of various machine learning algorithms. Test results confirm that the KNN algorithm achieved the highest 97% accuracy compared to other implemented ML algorithms.

Download Full-text

Hybridization of Machine Learning Algorithm in Intrusion Detection System

Handbook of Research on Machine and Deep Learning Applications for Cyber Security - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-5225-9611-0.ch008 ◽

2020 ◽

pp. 150-175

Author(s):

Amudha P. ◽

Sivakumari S.

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

Classification Accuracy ◽

Intrusion Detection System ◽

Learning Algorithm ◽

Detection System ◽

Principal Component ◽

Machine Learning Algorithms ◽

Feature Selection Technique ◽

Efficient Manner

In recent years, the field of machine learning grows very fast both on the development of techniques and its application in intrusion detection. The computational complexity of the machine learning algorithms increases rapidly as the number of features in the datasets increases. By choosing the significant features, the number of features in the dataset can be reduced, which is critical to progress the classification accuracy and speed of algorithms. Also, achieving high accuracy and detection rate and lowering false alarm rates are the major challenges in designing an intrusion detection system. The major motivation of this work is to address these issues by hybridizing machine learning and swarm intelligence algorithms for enhancing the performance of intrusion detection system. It also emphasizes applying principal component analysis as feature selection technique on intrusion detection dataset for identifying the most suitable feature subsets which may provide high-quality results in a fast and efficient manner.

Download Full-text

Towards scaling Twitter for digital epidemiology of birth defects

npj Digital Medicine ◽

10.1038/s41746-019-0170-5 ◽

2019 ◽

Vol 2 (1) ◽

Cited By ~ 4

Author(s):

Ari Z. Klein ◽

Abeed Sarker ◽

Davy Weissenbacher ◽

Graciela Gonzalez-Hernandez

Keyword(s):

Machine Learning ◽

Social Media ◽

Language Processing ◽

Birth Defects ◽

Birth Defect ◽

Learning Algorithms ◽

Class Imbalance ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Svm Classifier

Abstract Social media has recently been used to identify and study a small cohort of Twitter users whose pregnancies with birth defect outcomes—the leading cause of infant mortality—could be observed via their publicly available tweets. In this study, we exploit social media on a larger scale by developing natural language processing (NLP) methods to automatically detect, among thousands of users, a cohort of mothers reporting that their child has a birth defect. We used 22,999 annotated tweets to train and evaluate supervised machine learning algorithms—feature-engineered and deep learning-based classifiers—that automatically distinguish tweets referring to the user’s pregnancy outcome from tweets that merely mention birth defects. Because 90% of the tweets merely mention birth defects, we experimented with under-sampling and over-sampling approaches to address this class imbalance. An SVM classifier achieved the best performance for the two positive classes: an F1-score of 0.65 for the “defect” class and 0.51 for the “possible defect” class. We deployed the classifier on 20,457 unlabeled tweets that mention birth defects, which helped identify 542 additional users for potential inclusion in our cohort. Contributions of this study include (1) NLP methods for automatically detecting tweets by users reporting their birth defect outcomes, (2) findings that an SVM classifier can outperform a deep neural network-based classifier for highly imbalanced social media data, (3) evidence that automatic classification can be used to identify additional users for potential inclusion in our cohort, and (4) a publicly available corpus for training and evaluating supervised machine learning algorithms.

Download Full-text

A Novel Method for Colorectal Cancer Screening Based on Circulating Tumor Cells and Machine Learning

Entropy ◽

10.3390/e23101248 ◽

2021 ◽

Vol 23 (10) ◽

pp. 1248

Author(s):

Eleana Hatzidaki ◽

Aggelos Iliopoulos ◽

Ioannis Papasotiriou

Keyword(s):

Colorectal Cancer ◽

Machine Learning ◽

Flow Cytometry ◽

Cancer Screening ◽

Tumor Cells ◽

Circulating Tumor Cells ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Svm Classifier ◽

Significant Information

Colorectal cancer is one of the most common types of cancer, and it can have a high mortality rate if left untreated or undiagnosed. The fact that CRC becomes symptomatic at advanced stages highlights the importance of early screening. The reference screening method for CRC is colonoscopy, an invasive, time-consuming procedure that requires sedation or anesthesia and is recommended from a certain age and above. The aim of this study was to build a machine learning classifier that can distinguish cancer from non-cancer samples. For this, circulating tumor cells were enumerated using flow cytometry. Their numbers were used as a training set for building an optimized SVM classifier that was subsequently used on a blind set. The SVM classifier’s accuracy on the blind samples was found to be 90.0%, sensitivity was 80.0%, specificity was 100.0%, precision was 100.0% and AUC was 0.98. Finally, in order to test the generalizability of our method, we also compared the performances of different classifiers developed by various machine learning models, using over-sampling datasets generated by the SMOTE algorithm. The results showed that SVM achieved the best performances according to the validation accuracy metric. Overall, our results demonstrate that CTCs enumerated by flow cytometry can provide significant information, which can be used in machine learning algorithms to successfully discriminate between healthy and colorectal cancer patients. The clinical significance of this method could be the development of a simple, fast, non-invasive cancer screening tool based on blood CTC enumeration by flow cytometry and machine learning algorithms.

Download Full-text

A Hybrid Feature Selection Method for Improve the Accuracy of Medical Classification Process

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a9624.1111121 ◽

2021 ◽

Vol 11 (1) ◽

pp. 50-55

Author(s):

Maria Mohammad Yousef ◽

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Dimensionality Reduction ◽

Classification Accuracy ◽

Fitness Function ◽

Machine Learning Algorithms ◽

Feature Subset Selection ◽

High Dimensionality ◽

Support Vector ◽

Feature Subset

Generally, medical dataset classification has become one of the biggest problems in data mining research. Every database has a given number of features but it is observed that some of these features can be redundant and can be harmful as well as disrupt the process of classification and this problem is known as a high dimensionality problem. Dimensionality reduction in data preprocessing is critical for increasing the performance of machine learning algorithms. Besides the contribution of feature subset selection in dimensionality reduction gives a significant improvement in classification accuracy. In this paper, we proposed a new hybrid feature selection approach based on (GA assisted by KNN) to deal with issues of high dimensionality in biomedical data classification. The proposed method first applies the combination between GA and KNN for feature selection to find the optimal subset of features where the classification accuracy of the k-Nearest Neighbor (kNN) method is used as the fitness function for GA. After selecting the best-suggested subset of features, Support Vector Machine (SVM) are used as the classifiers. The proposed method experiments on five medical datasets of the UCI Machine Learning Repository. It is noted that the suggested technique performs admirably on these databases, achieving higher classification accuracy while using fewer features.

Download Full-text