Forward Context-Aware Clickbait Tweet Identification System

2021 ◽  
Vol 12 (2) ◽  
pp. 21-32
Author(s):  
Rajesh Kumar Mundotiya ◽  
Naina Yadav

Clickbait is an elusive challenge with the prevalence of social media such as Facebook and Twitter that misleads the readers while clicking on headlines. Limited annotated data makes it onerous to design an accurate clickbait identification system. The authors address this problem by purposing deep learning-based architecture with external knowledge which trains on social media post and descriptions. The pre-trained ELMO and BERT model obtains the sentence level contextual feature as knowledge; moreover, the LSTM layer helps to prevail the word level contextual feature. Training has done at different experiments (model with EMLO, model with BERT) with different regularization techniques such as dropout, early stopping, and finetuning. Forward context-aware clickbait tweet identification system (FCCTI) with BERT finetuning and model with ELMO using glove pre-trained embedding is the best model and achieves a clickbait identification accuracy of 0.847, improving on the previous baseline for this task.

2021 ◽  
Author(s):  
Xinghao Yang ◽  
Yongshun Gong ◽  
Weifeng Liu ◽  
JAMES BAILEY ◽  
Tianqing Zhu ◽  
...  

Deep learning models are known immensely brittle to adversarial image examples, yet their vulnerability in text classification is insufficiently explored. Existing text adversarial attack strategies can be roughly divided into three categories, i.e., character-level attack, word-level attack, and sentence-level attack. Despite the success brought by recent text attack methods, how to induce misclassification with the minimal text modifications while keeping the lexical correctness, syntactic soundness, and semantic consistency simultaneously is still a challenge. To examine the vulnerability of deep models, we devise a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) approach which attacks text documents not only at a unigram word level but also at a bigram level to avoid generating meaningless sentences. We also present a hybrid attack strategy that collects substitution words from both synonyms and sememe candidates, to enrich the potential candidate set. Besides, a Semantic Preservation Optimization (SPO) method is devised to determine the word substitution priority and reduce the perturbation cost. Furthermore, we constraint the SPO with a semantic Filter (dubbed SPOF) to improve the semantic similarity between the input text and the adversarial example. To estimate the effectiveness of our proposed methods, BU-SPO and BU-SPOF, we attack four victim deep learning models trained on three real-world text datasets. Experimental results demonstrate that our approaches accomplish the highest semantics consistency and attack success rates by making the minimal word modifications compared with competitive methods.


2021 ◽  
Author(s):  
Xinghao Yang ◽  
Yongshun Gong ◽  
Weifeng Liu ◽  
JAMES BAILEY ◽  
Tianqing Zhu ◽  
...  

Deep learning models are known immensely brittle to adversarial image examples, yet their vulnerability in text classification is insufficiently explored. Existing text adversarial attack strategies can be roughly divided into three categories, i.e., character-level attack, word-level attack, and sentence-level attack. Despite the success brought by recent text attack methods, how to induce misclassification with the minimal text modifications while keeping the lexical correctness, syntactic soundness, and semantic consistency simultaneously is still a challenge. To examine the vulnerability of deep models, we devise a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) approach which attacks text documents not only at a unigram word level but also at a bigram level to avoid generating meaningless sentences. We also present a hybrid attack strategy that collects substitution words from both synonyms and sememe candidates, to enrich the potential candidate set. Besides, a Semantic Preservation Optimization (SPO) method is devised to determine the word substitution priority and reduce the perturbation cost. Furthermore, we constraint the SPO with a semantic Filter (dubbed SPOF) to improve the semantic similarity between the input text and the adversarial example. To estimate the effectiveness of our proposed methods, BU-SPO and BU-SPOF, we attack four victim deep learning models trained on three real-world text datasets. Experimental results demonstrate that our approaches accomplish the highest semantics consistency and attack success rates by making the minimal word modifications compared with competitive methods.


Author(s):  
Marina Paolanti ◽  
Adriano Mancini ◽  
Emanuele Frontoni ◽  
Andrea Felicetti ◽  
Luca Marinelli ◽  
...  

AbstractSentiment analysis on social media such as Twitter is a challenging task given the data characteristics such as the length, spelling errors, abbreviations, and special characters. Social media sentiment analysis is also a fundamental issue with many applications. With particular regard of the tourism sector, where the characterization of fluxes is a vital issue, the sources of geotagged information have already proven to be promising for tourism-related geographic research. The paper introduces an approach to estimate the sentiment related to Cilento’s, a well known tourism venue in Southern Italy. A newly collected dataset of tweets related to tourism is at the base of our method. We aim at demonstrating and testing a deep learning social geodata framework to characterize spatial, temporal and demographic tourist flows across the vast of territory this rural touristic region and along its coasts. We have applied four specially trained Deep Neural Networks to identify and assess the sentiment, two word-level and two character-based, respectively. In contrast to many existing datasets, the actual sentiment carried by texts or hashtags is not automatically assessed in our approach. We manually annotated the whole set to get to a higher dataset quality in terms of accuracy, proving the effectiveness of our method. Moreover, the geographical coding labelling each information, allow for fitting the inferred sentiments with their geographical location, obtaining an even more nuanced content analysis of the semantic meaning.


2020 ◽  
Vol 2020 ◽  
pp. 1-10 ◽  
Author(s):  
Hanqian Wu ◽  
Mumu Liu ◽  
Shangbin Zhang ◽  
Zhike Wang ◽  
Siliang Cheng

Online product reviews are exploring on e-commerce platforms, and mining aspect-level product information contained in those reviews has great economic benefit. The aspect category classification task is a basic task for aspect-level sentiment analysis which has become a hot research topic in the natural language processing (NLP) field during the last decades. In various e-commerce platforms, there emerge various user-generated question-answering (QA) reviews which generally contain much aspect-related information of products. Although some researchers have devoted their efforts on the aspect category classification for traditional product reviews, the existing deep learning-based approaches cannot be well applied to represent the QA-style reviews. Thus, we propose a 4-dimension (4D) textual representation model based on QA interaction-level and hyperinteraction-level by modeling with different levels of the text representation, i.e., word-level, sentence-level, QA interaction-level, and hyperinteraction-level. In our experiments, the empirical studies on datasets from three domains demonstrate that our proposals perform better than traditional sentence-level representation approaches, especially in the Digit domain.


Minerals ◽  
2020 ◽  
Vol 10 (9) ◽  
pp. 809 ◽  
Author(s):  
Natsuo Okada ◽  
Yohei Maekawa ◽  
Narihiro Owada ◽  
Kazutoshi Haga ◽  
Atsushi Shibayama ◽  
...  

In mining operations, an ore is separated into its constituents through mineral processing methods, such as flotation. Identifying the type of minerals contained in the ore in advance aids greatly in performing faster and more efficient mineral processing. The human eye can recognize visual information in three wavelength regions: red, green, and blue. With hyperspectral imaging, high resolution spectral data that contains information from the visible light wavelength region to the near infrared region can be obtained. Using deep learning, the features of the hyperspectral data can be extracted and learned, and the spectral pattern that is unique to each mineral can be identified and analyzed. In this paper, we propose an automatic mineral identification system that can identify mineral types before the mineral processing stage by combining hyperspectral imaging and deep learning. By using this technique, it is possible to quickly identify the types of minerals contained in rocks using a non-destructive method. As a result of experimentation, the identification accuracy of the minerals that underwent deep learning on the red, green, and blue (RGB) image of the mineral was approximately 30%, while the result of the hyperspectral data analysis using deep learning identified the mineral species with a high accuracy of over 90%.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Xiang Li

In this paper, we propose a multilevel feature representation method that combines word-level features, such as German morphology and slang, and sentence-level features, such as special symbols and English-translated sentiment information, and build a deep learning model for German sentiment classification based on the self-attentive mechanism, in order to address the characteristics of German social media texts that are colloquial, irregular, and diverse. Compared with the existing studies, this model not only has the most obvious improvement effect but also has better feature extraction and classification ability for German emotion.


2019 ◽  
Vol 28 (3) ◽  
pp. 399-408 ◽  
Author(s):  
Anupam Jamatia ◽  
Amitava Das ◽  
Björn Gambäck

Abstract This article addresses language identification at the word level in Indian social media corpora taken from Facebook, Twitter and WhatsApp posts that exhibit code-mixing between English-Hindi, English-Bengali, as well as a blend of both language pairs. Code-mixing is a fusion of multiple languages previously mainly associated with spoken language, but which social media users also deploy when communicating in ways that tend to be rather casual. The coarse nature of code-mixed social media text makes language identification challenging. Here, the performance of deep learning on this task is compared to feature-based learning, with two Recursive Neural Network techniques, Long Short Term Memory (LSTM) and bidirectional LSTM, being contrasted to a Conditional Random Fields (CRF) classifier. The results show the deep learners outscoring the CRF, with the bidirectional LSTM demonstrating the best language identification performance.


2019 ◽  
Author(s):  
Joseph Tassone ◽  
Peizhi Yan ◽  
Mackenzie Simpson ◽  
Chetan Mendhe ◽  
Vijay Mago ◽  
...  

BACKGROUND The collection and examination of social media has become a useful mechanism for studying the mental activity and behavior tendencies of users. OBJECTIVE Through the analysis of a collected set of Twitter data, a model will be developed for predicting positively referenced, drug-related tweets. From this, trends and correlations can be determined. METHODS Twitter social media tweets and attribute data were collected and processed using topic pertaining keywords, such as drug slang and use-conditions (methods of drug consumption). Potential candidates were preprocessed resulting in a dataset 3,696,150 rows. The predictive classification power of multiple methods was compared including regression, decision trees, and CNN-based classifiers. For the latter, a deep learning approach was implemented to screen and analyze the semantic meaning of the tweets. RESULTS The logistic regression and decision tree models utilized 12,142 data points for training and 1041 data points for testing. The results calculated from the logistic regression models respectively displayed an accuracy of 54.56% and 57.44%, and an AUC of 0.58. While an improvement, the decision tree concluded with an accuracy of 63.40% and an AUC of 0.68. All these values implied a low predictive capability with little to no discrimination. Conversely, the CNN-based classifiers presented a heavy improvement, between the two models tested. The first was trained with 2,661 manually labeled samples, while the other included synthetically generated tweets culminating in 12,142 samples. The accuracy scores were 76.35% and 82.31%, with an AUC of 0.90 and 0.91. Using association rule mining in conjunction with the CNN-based classifier showed a high likelihood for keywords such as “smoke”, “cocaine”, and “marijuana” triggering a drug-positive classification. CONCLUSIONS Predictive analysis without a CNN is limited and possibly fruitless. Attribute-based models presented little predictive capability and were not suitable for analyzing this type of data. The semantic meaning of the tweets needed to be utilized, giving the CNN-based classifier an advantage over other solutions. Additionally, commonly mentioned drugs had a level of correspondence with frequently used illicit substances, proving the practical usefulness of this system. Lastly, the synthetically generated set provided increased scores, improving the predictive capability. CLINICALTRIAL None


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Yahya Albalawi ◽  
Jim Buckley ◽  
Nikola S. Nikolov

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.


Sign in / Sign up

Export Citation Format

Share Document