Topic-BERT: Detecting harmful information from social media

2021 ◽  
pp. 1-10
Author(s):  
Wang Gao ◽  
Hongtao Deng ◽  
Xun Zhu ◽  
Yuan Fang

Harmful information identification is a critical research topic in natural language processing. Existing approaches have been focused either on rule-based methods or harmful text identification of normal documents. In this paper, we propose a BERT-based model to identify harmful information from social media, called Topic-BERT. Firstly, Topic-BERT utilizes BERT to take additional information as input to alleviate the sparseness of short texts. The GPU-DMM topic model is used to capture hidden topics of short texts for attention weight calculation. Secondly, the proposed model divides harmful short text identification into two stages, and different granularity labels are identified by two similar sub-models. Finally, we conduct extensive experiments on a real-world social media dataset to evaluate our model. Experimental results demonstrate that our model can significantly improve the classification performance compared with baseline methods.

2021 ◽  
Vol 10 (7) ◽  
pp. 474
Author(s):  
Bingqing Wang ◽  
Bin Meng ◽  
Juan Wang ◽  
Siyu Chen ◽  
Jian Liu

Social media data contains real-time expressed information, including text and geographical location. As a new data source for crowd behavior research in the era of big data, it can reflect some aspects of the behavior of residents. In this study, a text classification model based on the BERT and Transformers framework was constructed, which was used to classify and extract more than 210,000 residents’ festival activities based on the 1.13 million Sina Weibo (Chinese “Twitter”) data collected from Beijing in 2019 data. On this basis, word frequency statistics, part-of-speech analysis, topic model, sentiment analysis and other methods were used to perceive different types of festival activities and quantitatively analyze the spatial differences of different types of festivals. The results show that traditional culture significantly influences residents’ festivals, reflecting residents’ motivation to participate in festivals and how residents participate in festivals and express their emotions. There are apparent spatial differences among residents in participating in festival activities. The main festival activities are distributed in the central area within the Fifth Ring Road in Beijing. In contrast, expressing feelings during the festival is mainly distributed outside the Fifth Ring Road in Beijing. The research integrates natural language processing technology, topic model analysis, spatial statistical analysis, and other technologies. It can also broaden the application field of social media data, especially text data, which provides a new research paradigm for studying residents’ festival activities and adds residents’ perception of the festival. The research results provide a basis for the design and management of the Chinese festival system.


Information ◽  
2020 ◽  
Vol 11 (2) ◽  
pp. 79 ◽  
Author(s):  
Xiaoyu Han ◽  
Yue Zhang ◽  
Wenkai Zhang ◽  
Tinglei Huang

Relation extraction is a vital task in natural language processing. It aims to identify the relationship between two specified entities in a sentence. Besides information contained in the sentence, additional information about the entities is verified to be helpful in relation extraction. Additional information such as entity type getting by NER (Named Entity Recognition) and description provided by knowledge base both have their limitations. Nevertheless, there exists another way to provide additional information which can overcome these limitations in Chinese relation extraction. As Chinese characters usually have explicit meanings and can carry more information than English letters. We suggest that characters that constitute the entities can provide additional information which is helpful for the relation extraction task, especially in large scale datasets. This assumption has never been verified before. The main obstacle is the lack of large-scale Chinese relation datasets. In this paper, first, we generate a large scale Chinese relation extraction dataset based on a Chinese encyclopedia. Second, we propose an attention-based model using the characters that compose the entities. The result on the generated dataset shows that these characters can provide useful information for the Chinese relation extraction task. By using this information, the attention mechanism we used can recognize the crucial part of the sentence that can express the relation. The proposed model outperforms other baseline models on our Chinese relation extraction dataset.


2021 ◽  
Author(s):  
Abul Hasan ◽  
Mark Levene ◽  
David Weston ◽  
Renate Fromson ◽  
Nicolas Koslover ◽  
...  

BACKGROUND The COVID-19 pandemic has created a pressing need for integrating information from disparate sources, in order to assist decision makers. Social media is important in this respect, however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. In particular, machine learning techniques for triage and diagnosis could allow for a better understanding of what social media may offer in this respect. OBJECTIVE This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and other interested parties with additional information on the symptoms, severity and prevalence of the disease. METHODS The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients’ posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. RESULTS We report that Macro- and Micro-averaged F_{1\ }scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on human labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Also, we highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. CONCLUSIONS Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from natural language narratives using a machine learning pipeline, in order to provide additional information on the severity and prevalence of the disease through the eyes of social media.


2018 ◽  
Vol 24 (2) ◽  
pp. 221-264 ◽  
Author(s):  
SABINE GRÜNDER-FAHRER ◽  
ANTJE SCHLAF ◽  
GREGOR WIEDEMANN ◽  
GERHARD HEYER

AbstractSocial media are an emerging new paradigm in interdisciplinary research in crisis informatics. They bring many opportunities as well as challenges to all fields of application and research involved in the project of using social media content for an improved disaster management. Using the Central European flooding 2013 as our case study, we optimize and apply methods from the field ofnatural language processingand unsupervised machine learning to investigate the thematic and temporal structure of German social media communication. By means of topic model analysis, we will investigate which kind of content was shared on social media during the event. On this basis, we will, furthermore, investigate the development of topics over time and apply temporal clustering techniques to automatically identify different characteristic phases of communication. From the results, we, first, want to reveal properties of social media content and show what potential social media have for improving disaster management in Germany. Second, we will be concerned with the methodological issue of finding and adapting natural language processing methods that are suitable for analysing social media data in order to obtain information relevant for disaster management. With respect to the first, application-oriented focal point, our study reveals high potential of social media content in the factual, organizational and psychological dimension of the disaster and during all stages of the disaster management life cycle. Interestingly, there appear to be systematic differences in thematic profile between the different platforms Facebook and Twitter and between different stages of the event. In context of our methodological investigation, we claim that if topic model analysis is combined with appropriate optimization techniques, it shows high applicability for thematic and temporal social media analysis in disaster management.


2019 ◽  
Vol 17 (2) ◽  
pp. 241-249
Author(s):  
Yangyang Li ◽  
Bo Liu

Short and sparse characteristics and synonyms and homonyms are main obstacles for short-text classification. In recent years, research on short-text classification has focused on expanding short texts but has barely guaranteed the validity of expanded words. This study proposes a new method to weaken these effects without external knowledge. The proposed method analyses short texts by using the topic model based on Latent Dirichlet Allocation (LDA), represents each short text by using a vector space model and presents a new method to adjust the vector of short texts. In the experiments, two open short-text data sets composed of google news and web search snippets are utilised to evaluate the classification performance and prove the effectiveness of our method.


Information ◽  
2020 ◽  
Vol 11 (6) ◽  
pp. 312 ◽  
Author(s):  
Asma Baccouche ◽  
Sadaf Ahmed ◽  
Daniel Sierra-Sosa ◽  
Adel Elmaghraby

Identifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the Natural Language Processing (NLP) perspective, Deep Learning models are a good alternative for classifying text after being preprocessed. In particular, Long Short-Term Memory (LSTM) networks are one of the models that perform well for the binary and multi-label text classification problems. In this paper, an approach merging two different data sources, one intended for Spam in social media posts and the other for Fraud classification in emails, is presented. We designed a multi-label LSTM model and trained it on the joint datasets including text with common bigrams, extracted from each independent dataset. The experiment results show that our proposed model is capable of identifying malicious text regardless of the source. The LSTM model trained with the merged dataset outperforms the models trained independently on each dataset.


Sensors ◽  
2019 ◽  
Vol 19 (17) ◽  
pp. 3728 ◽  
Author(s):  
Zhou ◽  
Wang ◽  
Sun ◽  
Sun

Text representation is one of the key tasks in the field of natural language processing (NLP). Traditional feature extraction and weighting methods often use the bag-of-words (BoW) model, which may lead to a lack of semantic information as well as the problems of high dimensionality and high sparsity. At present, to solve these problems, a popular idea is to utilize deep learning methods. In this paper, feature weighting, word embedding, and topic models are combined to propose an unsupervised text representation method named the feature, probability, and word embedding method. The main idea is to use the word embedding technology Word2Vec to obtain the word vector, and then combine this with the feature weighted TF-IDF and the topic model LDA. Compared with traditional feature engineering, the proposed method not only increases the expressive ability of the vector space model, but also reduces the dimensions of the document vector. Besides this, it can be used to solve the problems of the insufficient information, high dimensions, and high sparsity of BoW. We use the proposed method for the task of text categorization and verify the validity of the method.


Author(s):  
Pankaj Gupta ◽  
Yatin Chaudhary ◽  
Florian Buettner ◽  
Hinrich Schütze

We address two challenges in topic models: (1) Context information around words helps in determining their actual meaning, e.g., “networks” used in the contexts artificial neural networks vs. biological neuron networks. Generative topic models infer topic-word distributions, taking no or only little context into account. Here, we extend a neural autoregressive topic model to exploit the full context information around words in a document in a language modeling fashion. The proposed model is named as iDocNADE. (2) Due to the small number of word occurrences (i.e., lack of context) in short text and data sparsity in a corpus of few documents, the application of topic models is challenging on such texts. Therefore, we propose a simple and efficient way of incorporating external knowledge into neural autoregressive topic models: we use embeddings as a distributional prior. The proposed variants are named as DocNADEe and iDocNADEe. We present novel neural autoregressive topic model variants that consistently outperform state-of-the-art generative topic models in terms of generalization, interpretability (topic coherence) and applicability (retrieval and classification) over 7 long-text and 8 short-text datasets from diverse domains.


2020 ◽  
Vol 17 (12) ◽  
pp. 5477-5482
Author(s):  
Shaik Rahamat Basha ◽  
M. Surya Bhupal Rao ◽  
P. Kiran Kumar Reddy ◽  
G. Ravi Kumar

Online Social media are a huge source of regular communication since most people in the world today use these services to stay communicating with each other in their modern lives. Today’s research has been implemented on emotion recognition by message. The majority of the research uses a method of machine learning. In order to extract information from the textual text written by human beings, natural language processing (NLP) techniques were used. The emotion of humans may be expressed when reading or writing a message. Human beings are willing, since human life is filled with a variety of emotions, to feel various emotions. This analysis helps us to realize the use of text processing and text mining methods by social media researchers in order to classify key data themes. Our experiments presented that the two main social networks in the world are conducting text-based mining on Facebook and Twitter. In this proposed study, we categorized the human feelings such as joy, fear, love, anger, surprise, sadness and thankfulness and compared our results using various methods of machine learning.


2021 ◽  
Author(s):  
Wei Wang ◽  
Bing Guo ◽  
Yan Shen ◽  
Han Yang ◽  
Yaosen Chen ◽  
...  

Abstract Recently, some statistical topic modeling approaches based on LDA have been applied in the field of supervised document classification, where the model generation procedure incorporates prior knowledge to improve the classification performance. However, these customizations of topic modeling are limited by the cumbersome derivation of a specific inference algorithm for each modification. In this paper, we propose a new supervised topic modeling approach for document classification problems, Neural Labeled LDA (NL-LDA), which builds on the VAE framework, and designs a special generative network to incorporate prior information. The proposed model can support semi-supervised learning based on the manifold assumption and low-density assumption. Meanwhile, NL-LDA has a consistent and concise inference method while semi-supervised learning and predicting. Quantitative experimental results demonstrate our model has outstanding performance on supervised document classification relative to the compared approaches, including traditional statistical and neural topic models. Specially, the proposed model can support both single-label and multi-label document classification. The proposed NL-LDA performs significantly well on semi-supervised classification, especially under a small amount of labeled data. Further comparisons with related works also indicate our model is competitive with state-of-the-art topic modeling approaches on semi-supervised classification.


Sign in / Sign up

Export Citation Format

Share Document