Fake news spreaders profiling using N-grams of various types and SHAP-based feature selection

2021 ◽  
pp. 1-12
Author(s):  
Fazlourrahman Balouchzahi ◽  
Grigori Sidorov ◽  
Hosahalli Lakshmaiah Shashirekha

Complex learning approaches along with complicated and expensive features are not always the best or the only solution for Natural Language Processing (NLP) tasks. Despite huge progress and advancements in learning approaches such as Deep Learning (DL) and Transfer Learning (TL), there are many NLP tasks such as Text Classification (TC), for which basic Machine Learning (ML) classifiers perform superior to DL or TL approaches. Added to this, an efficient feature engineering step can significantly improve the performance of ML based systems. To check the efficacy of ML based systems and feature engineering on TC, this paper explores char, character sequences, syllables, word n-grams as well as syntactic n-grams as features and SHapley Additive exPlanations (SHAP) values to select the important features from the collection of extracted features. Voting Classifiers (VC) with soft and hard voting of four ML classifiers, namely: Support Vector Machine (SVM) with Linear and Radial Basis Function (RBF) kernel, Logistic Regression (LR), and Random Forest (RF) was trained and evaluated on Fake News Spreaders Profiling (FNSP) shared task dataset in PAN 2020. This shared task consists of profiling fake news spreaders in English and Spanish languages. The proposed models exhibited an average accuracy of 0.785 for both languages in this shared task and outperformed the best models submitted to this task.

2008 ◽  
Vol 34 (2) ◽  
pp. 193-224 ◽  
Author(s):  
Alessandro Moschitti ◽  
Daniele Pighin ◽  
Roberto Basili

The availability of large scale data sets of manually annotated predicate-argument structures has recently favored the use of machine learning approaches to the design of automated semantic role labeling (SRL) systems. The main research in this area relates to the design choices for feature representation and for effective decompositions of the task in different learning models. Regarding the former choice, structural properties of full syntactic parses are largely employed as they represent ways to encode different principles suggested by the linking theory between syntax and semantics. The latter choice relates to several learning schemes over global views of the parses. For example, re-ranking stages operating over alternative predicate-argument sequences of the same sentence have shown to be very effective. In this article, we propose several kernel functions to model parse tree properties in kernel-based machines, for example, perceptrons or support vector machines. In particular, we define different kinds of tree kernels as general approaches to feature engineering in SRL. Moreover, we extensively experiment with such kernels to investigate their contribution to individual stages of an SRL architecture both in isolation and in combination with other traditional manually coded features. The results for boundary recognition, classification, and re-ranking stages provide systematic evidence about the significant impact of tree kernels on the overall accuracy, especially when the amount of training data is small. As a conclusive result, tree kernels allow for a general and easily portable feature engineering method which is applicable to a large family of natural language processing tasks.


Author(s):  
Pawar A B ◽  
Jawale M A ◽  
Kyatanavar D N

Usages of Natural Language Processing techniques in the field of detection of fake news is analyzed in this research paper. Fake news are misleading concepts spread by invalid resources can provide damages to human-life, society. To carry out this analysis work, dataset obtained from web resource OpenSources.co is used which is mainly part of Signal Media. The document frequency terms as TF-IDF of bi-grams used in correlation with PCFG (Probabilistic Context Free Grammar) on a set of 11,000 documents extracted as news articles. This set tested on classification algorithms namely SVM (Support Vector Machines), Stochastic Gradient Descent, Bounded Decision Trees, Gradient Boosting algorithm with Random Forests. In experimental analysis, found that combination of Stochastic Gradient Descent with TF-IDF of bi-grams gives an accuracy of 77.2% in detecting fake contents, which observes with PCFGs having slight recalling defects


Author(s):  
Mohamed Elleuch ◽  
Monji Kherallah

Deep learning algorithms, as a machine learning algorithms developed in recent years, have been successfully applied in various domains of computer vision, such as face recognition, object detection and image classification. These Deep algorithms aim at extracting a high representation of the data via multi-layers in a deep hierarchical structure. However, to the authors' knowledge, these deep learning approaches have not been extensively studied to recognize Arabic Handwritten Script (AHS). In this paper, they present a deep learning model based on Support Vector Machine (SVM) named Deep SVM. This model has an inherent ability to select data points crucial to classify good generalization capabilities. The deep SVM is constructed by a stack of SVMs allowing to extracting/learning automatically features from the raw images and to perform classification as well. The Multi-class SVM with an RBF kernel, as non-linear discriminative features for classification, was chosen and tested on Handwritten Arabic Characters Database (HACDB). Simulation results show the effectiveness of the proposed model.


2019 ◽  
Vol 27 (1) ◽  
pp. 31-38 ◽  
Author(s):  
Youngjun Kim ◽  
Stéphane M Meystre

Abstract Objective Accurate and complete information about medications and related information is crucial for effective clinical decision support and precise health care. Recognition and reduction of adverse drug events is also central to effective patient care. The goal of this research is the development of a natural language processing (NLP) system to automatically extract medication and adverse drug event information from electronic health records. This effort was part of the 2018 n2c2 shared task on adverse drug events and medication extraction. Materials and Methods The new NLP system implements a stacked generalization based on a search-based structured prediction algorithm for concept extraction. We trained 4 sequential classifiers using a variety of structured learning algorithms. To enhance accuracy, we created a stacked ensemble consisting of these concept extraction models trained on the shared task training data. We implemented a support vector machine model to identify related concepts. Results Experiments with the official test set showed that our stacked ensemble achieved an F1 score of 92.66%. The relation extraction model with given concepts reached a 93.59% F1 score. Our end-to-end system yielded overall micro-averaged recall, precision, and F1 score of 92.52%, 81.88% and 86.88%, respectively. Our NLP system for adverse drug events and medication extraction ranked within the top 5 of teams participating in the challenge. Conclusion This study demonstrated that a stacked ensemble with a search-based structured prediction algorithm achieved good performance by effectively integrating the output of individual classifiers and could provide a valid solution for other clinical concept extraction tasks.


2021 ◽  
Vol 11 (19) ◽  
pp. 9292
Author(s):  
Noman Islam ◽  
Asadullah Shaikh ◽  
Asma Qaiser ◽  
Yousef Asiri ◽  
Sultan Almakdi ◽  
...  

In recent years, the consumption of social media content to keep up with global news and to verify its authenticity has become a considerable challenge. Social media enables us to easily access news anywhere, anytime, but it also gives rise to the spread of fake news, thereby delivering false information. This also has a negative impact on society. Therefore, it is necessary to determine whether or not news spreading over social media is real. This will allow for confusion among social media users to be avoided, and it is important in ensuring positive social development. This paper proposes a novel solution by detecting the authenticity of news through natural language processing techniques. Specifically, this paper proposes a novel scheme comprising three steps, namely, stance detection, author credibility verification, and machine learning-based classification, to verify the authenticity of news. In the last stage of the proposed pipeline, several machine learning techniques are applied, such as decision trees, random forest, logistic regression, and support vector machine (SVM) algorithms. For this study, the fake news dataset was taken from Kaggle. The experimental results show an accuracy of 93.15%, precision of 92.65%, recall of 95.71%, and F1-score of 94.15% for the support vector machine algorithm. The SVM is better than the second best classifier, i.e., logistic regression, by 6.82%.


2019 ◽  
Vol 34 (4) ◽  
pp. 323-333 ◽  
Author(s):  
Thin Van Dang ◽  
Vu Duc Nguyen ◽  
Nguyen Van Kiet ◽  
Nguyen Luu Thuy Ngan

Along with the explosion of user reviews on the Internet, sentiment analysis has becomeone of the trending research topics in the field of natural language processing. In the last five years,many shared tasks were organized to keep track of the progress of sentiment analysis for various lan-guages. In the Fifth International Workshop on Vietnamese Language and Speech Processing (VLSP2018), the Sentiment Analysis shared task was the first evaluation campaign for the Vietnamese lan-guage. In this paper, we describe our system for this shared task. We employ a supervised learningmethod based on the Support Vector Machine classifiers combined with a variety of features. Weobtained the F1-score of 61% for both domains, which was ranked highest in the shared task. For theaspect detection subtask, our method achieved 77% and 69% in F1-score for the restaurant domainand the hotel domain respectively.


2020 ◽  
Author(s):  
Ghada Alfattni ◽  
Maksim Belousov ◽  
Niels Peek ◽  
Goran Nenadic

BACKGROUND As drug prescriptions are often recorded in free-text clinical narratives, extracting such information is important to support complex health-related tasks. Several natural language processing (NLP) methods have been proposed to extract such information, but still with limited performance. OBJECTIVE This paper describes (DrugEx), which extracts drugs and their attributes from clinical free-text notes. The study aims to evaluate the feasibility of using NLP and deep learning approaches for extracting and linking drug-associated attributes. It also presents an extensive error analysis of different methods. This effort was part of the 2018 National NLP Clinical Challenges (n2c2) Shared Task on Adverse Drug Events and Medication Extraction. METHODS The proposed method (DrugEx) consists of a named entity recogniser (NER) to identify drugs and associated attributes, and relation extraction (RE) component to identify relations between them. For the NER, we explored deep learning-based approaches (i.e. Bi-LSTM-CRFs) with various embeddings (i.e. word, character and semantic-feature embeddings) in order to investigate how different embeddings influence the performance. For RE, a rule-based method was implemented and compared with a positional-aware LSTM model. The methods were trained and evaluated using the 2018 n2c2 shared-task data. RESULTS Experiments showed that the best model (Bi-LSTM-CRFs with words and character embeddings) achieved lenient micro F-scores of 0.921 for NER, 0.927 for RE and 0.855 for the end-to-end system. CONCLUSIONS The proposed end-to-end system achieves encouraging results and demonstrates the feasibility of using deep learning methods for extracting medication information from free-text data.


2018 ◽  
pp. 656-678 ◽  
Author(s):  
Mohamed Elleuch ◽  
Monji Kherallah

Deep learning algorithms, as a machine learning algorithms developed in recent years, have been successfully applied in various domains of computer vision, such as face recognition, object detection and image classification. These Deep algorithms aim at extracting a high representation of the data via multi-layers in a deep hierarchical structure. However, to the authors' knowledge, these deep learning approaches have not been extensively studied to recognize Arabic Handwritten Script (AHS). In this paper, they present a deep learning model based on Support Vector Machine (SVM) named Deep SVM. This model has an inherent ability to select data points crucial to classify good generalization capabilities. The deep SVM is constructed by a stack of SVMs allowing to extracting/learning automatically features from the raw images and to perform classification as well. The Multi-class SVM with an RBF kernel, as non-linear discriminative features for classification, was chosen and tested on Handwritten Arabic Characters Database (HACDB). Simulation results show the effectiveness of the proposed model.


2020 ◽  
Vol 41 (Supplement_2) ◽  
Author(s):  
P Brekke ◽  
I Pilan ◽  
H Husby ◽  
T Gundersen ◽  
F.A Dahl ◽  
...  

Abstract Background Syncope is a commonly occurring presenting symptom in emergency departments. While the majority of episodes are benign, syncope is associated with worse prognosis in hypertrophic cardiomyopathy, arrhythmia syndromes, heart failure, aortic stenosis and coronary heart disease. Flagging documented syncope in these patients may be crucial to management decisions. Previous studies show that the International Classification of Diseases (ICD) codes for syncope have a sensitivity of around 0.63, leading to a large number of false negatives if patient identification is based on administrative codes. Thus, in order to provide data-driven, clinical decision support, and to improve identification of patient cohorts for research, better tools are needed. A recent study manually annotated more than 30.000 patient records in order to develop a natural language processing (NLP) tool, which achieved a sensitivity of 92.2%. Since access to medical records and annotation resources is limited, we aimed to investigate whether an unsupervised machine learning and NLP approach with no manual input could achieve similar performance. Methods Our data was admission notes for adult patients admitted between 2005 and 2016 at a large university hospital in Norway. 500 records from patients with, and 500 without a “R55 Syncope” ICD code at discharge were drawn at random. R55 code was considered “ground truth”. Headers containing information about tentative diagnoses were removed from the notes, when present, using regular expressions. The dataset was divided into 70%/15%/15% subsets for training, validation and testing. Baseline identification was calculated by a simple lexical matching using the term “synkope”. We evaluated two linear classifiers, a Support Vector Machine (SVM) and a Linear Regression (LR) model, with a term frequency–inverse document frequency vectorizer, using a bag-of-words approach. In addition, we evaluated a simple convolutional neural network (CNN) consisting of a convolutional layer concatenating filter sizes of 3–5, max pooling and a dropout of 0.5 with randomly initialised word embeddings of 300 dimensions. Results Even a baseline regular expression model achieved a sensitivity of 78% and a specificity of 91% when classifying admission notes as belonging to the syncope class or not. The SVM model and the LR model achieved a sensitivity of 91% and 89%, respectively, and a specificity of 89% and 91%. The CNN model had a sensitivity of 95% and a specificity of 84%. Conclusion With a limited non-English dataset, common NLP and machine learning approaches were able to achieve approximately 90–95% sensitivity for the identification of admission notes related to syncope. Linear classifiers outperformed a CNN model in terms of specificity, as expected in this small dataset. The study demonstrates the feasibility of training document classifiers based on diagnostic codes in order to detect important clinical events. ROC curves for SVM and LR models Funding Acknowledgement Type of funding source: Public grant(s) – National budget only. Main funding source(s): The Research Council of Norway


2018 ◽  
Vol 2018 ◽  
pp. 1-14 ◽  
Author(s):  
Nouar AlDahoul ◽  
Aznul Qalid Md Sabri ◽  
Ali Mohammed Mansoor

Human detection in videos plays an important role in various real life applications. Most of traditional approaches depend on utilizing handcrafted features which are problem-dependent and optimal for specific tasks. Moreover, they are highly susceptible to dynamical events such as illumination changes, camera jitter, and variations in object sizes. On the other hand, the proposed feature learning approaches are cheaper and easier because highly abstract and discriminative features can be produced automatically without the need of expert knowledge. In this paper, we utilize automatic feature learning methods which combine optical flow and three different deep models (i.e., supervised convolutional neural network (S-CNN), pretrained CNN feature extractor, and hierarchical extreme learning machine) for human detection in videos captured using a nonstatic camera on an aerial platform with varying altitudes. The models are trained and tested on the publicly available and highly challenging UCF-ARG aerial dataset. The comparison between these models in terms of training, testing accuracy, and learning speed is analyzed. The performance evaluation considers five human actions (digging, waving, throwing, walking, and running). Experimental results demonstrated that the proposed methods are successful for human detection task. Pretrained CNN produces an average accuracy of 98.09%. S-CNN produces an average accuracy of 95.6% with soft-max and 91.7% with Support Vector Machines (SVM). H-ELM has an average accuracy of 95.9%. Using a normal Central Processing Unit (CPU), H-ELM’s training time takes 445 seconds. Learning in S-CNN takes 770 seconds with a high performance Graphical Processing Unit (GPU).


Sign in / Sign up

Export Citation Format

Share Document