A Comparative Study of Word Embeddings for the Construction of a Social Media Expert Filter

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

A Comparative Study on Word Embeddings in Deep Learning for Text Classification

Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval ◽

10.1145/3443279.3443304 ◽

2020 ◽

Author(s):

Congcong Wang ◽

Paul Nulty ◽

David Lillis

Keyword(s):

Deep Learning ◽

Comparative Study ◽

Text Classification ◽

Word Embeddings

Download Full-text

Embed2Detect: temporally clustered embedded words for event detection in social media

Machine Learning ◽

10.1007/s10994-021-05988-7 ◽

2021 ◽

Author(s):

Hansi Hettiarachchi ◽

Mariam Adedoyin-Olowe ◽

Jagdev Bhogal ◽

Mohamed Medhat Gaber

Keyword(s):

Social Media ◽

Event Detection ◽

High Volume ◽

Detection Methods ◽

Word Embeddings ◽

Agglomerative Clustering ◽

Data Set ◽

Social Media Data ◽

Social Media Platforms ◽

Media Data

AbstractSocial media is becoming a primary medium to discuss what is happening around the world. Therefore, the data generated by social media platforms contain rich information which describes the ongoing events. Further, the timeliness associated with these data is capable of facilitating immediate insights. However, considering the dynamic nature and high volume of data production in social media data streams, it is impractical to filter the events manually and therefore, automated event detection mechanisms are invaluable to the community. Apart from a few notable exceptions, most previous research on automated event detection have focused only on statistical and syntactical features in data and lacked the involvement of underlying semantics which are important for effective information retrieval from text since they represent the connections between words and their meanings. In this paper, we propose a novel method termed Embed2Detect for event detection in social media by combining the characteristics in word embeddings and hierarchical agglomerative clustering. The adoption of word embeddings gives Embed2Detect the capability to incorporate powerful semantical features into event detection and overcome a major limitation inherent in previous approaches. We experimented our method on two recent real social media data sets which represent the sports and political domain and also compared the results to several state-of-the-art methods. The obtained results show that Embed2Detect is capable of effective and efficient event detection and it outperforms the recent event detection methods. For the sports data set, Embed2Detect achieved 27% higher F-measure than the best-performed baseline and for the political data set, it was an increase of 29%.

Download Full-text

A Comparative Study of Social Media Prediction Potential in the 2012 U.S. Republican Presidential Preelections

2013 International Conference on Cloud and Green Computing ◽

10.1109/cgc.2013.43 ◽

2013 ◽

Cited By ~ 1

Author(s):

Fernanda Pimenta ◽

Darko Obradovic ◽

Andreas Dengel

Keyword(s):

Social Media ◽

Comparative Study

Download Full-text

A comparative study of motivations on social media platforms users’ information sharing between South Korea and China

The Journal of Internet Electronic Commerce Resarch ◽

10.37272/jiecr.2021.06.21.3.187 ◽

2021 ◽

Vol 21 (3) ◽

pp. 187-202

Author(s):

Seok Noh ◽

BongJae Kang

Keyword(s):

Social Media ◽

South Korea ◽

Comparative Study ◽

Information Sharing ◽

Social Media Platforms

Download Full-text

‘We (don’t) know how you feel’ – a comparative study of automated vs. manual analysis of social media conversations

Journal of Marketing Management ◽

10.1080/0267257x.2015.1047466 ◽

2015 ◽

Vol 31 (9-10) ◽

pp. 1141-1157 ◽

Cited By ~ 7

Author(s):

Ana Isabel Canhoto ◽

Yuvraj Padmanabhan

Keyword(s):

Social Media ◽

Comparative Study ◽

Know How ◽

Manual Analysis

Download Full-text

Disaster Reporting and Alert System Using Tweets in a Social Media

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit195249 ◽

2019 ◽

pp. 176-181

Author(s):

P. Tamije Selvy ◽

V. Suriya Prakash ◽

S. Shriram ◽

N. Vimalesh

Keyword(s):

Social Media ◽

Text Mining ◽

Comparative Study ◽

Alert System ◽

Accuracy Rate ◽

The Social ◽

Short Period ◽

Disaster Reporting ◽

Centralized Database

The number of Social Media users have increased rapidly these days and a lot of valuable as well as non valuable information is shared in the social which is capable of reaching many people in a short period of time and hence the valuable information that are shared in the social media can be used for many types of analysis. In this paper the tweets that are shared in the name of a disaster is taken and then a alert system is build. This alert system gives alert to the users after checking the received data with the centralized database. This paper also gives a comparative study on the algorithm used in extracting the data from the social media which gives us the accuracy rate of different algorithm that can be used for text mining.

Download Full-text

Family’s Role in Achieving Language Affiliation among Children: A Comparative Study Applied to Some Families

Arab World English Journal ◽

10.24093/awej/vol12no3.8 ◽

2021 ◽

Vol 12 (3) ◽

pp. 111-128

Author(s):

Aljohara Fahad Al Saud

Keyword(s):

Social Media ◽

Saudi Arabia ◽

Comparative Study ◽

Analytical Approach ◽

Kingdom Of Saudi Arabia ◽

Language Identity ◽

Study Population ◽

High Degree ◽

Very High

Identifying language affiliation among children for family immigrants is crucial for one’s language identity. This study aimed to determine the role played by Arab families in the Kingdom of Saudi Arabia, Austria, and Britain to attain language affiliation among their children. It also aims to identify the challenges facing families living in these countries in achieving language affiliation among their children. The study population consisted of all the families that live in the Kingdom of Saudi Arabia, in addition to all the Arab families that live in Austria and Britain and the study sample included (120) parents. The researcher adopted the descriptive-analytical approach and used the questionnaire as the study tool. The study reached several results; first, the role played by families in the Kingdom of Saudi Arabia, Austria, and United Kingdom to attain language affiliation among their children got a high degree of response. Second, the challenges facing activating the family’s role in attaining language affiliation of their children in the Kingdom of Saudi Arabia and Austria have got a high degree of response, while in Britain, they obtained a very high degree of response. The study recommended involving all family members in accessing different and creative ways of practicing their native language and activating the role of social media in developing the language affiliation of children.

Download Full-text

An Automated Method To Enrich Consumer Health Vocabularies Using GloVe Word Embeddings and An Auxiliary Lexical Resource (Preprint)

10.2196/preprints.26160 ◽

2020 ◽

Author(s):

Mohammed Ibrahim ◽

Susan Gauch ◽

Omar Salman ◽

Mohammed Alqahatani

Keyword(s):

Social Media ◽

Ground Truth ◽

Consumer Health ◽

Word Embeddings ◽

Relative Improvement ◽

Lexical Resource ◽

Automated Method ◽

Medical Terms ◽

Media Platform ◽

Novel Algorithms

BACKGROUND Clear language makes communication easier between any two parties. A layman may have difficulty communicating with a professional due to not understanding the specialized terms common to the domain. In healthcare, it is rare to find a layman knowledgeable in medical jargon which can lead to poor understanding of their condition and/or treatment. To bridge this gap, several professional vocabularies and ontologies have been created to map laymen medical terms to professional medical terms and vice versa. OBJECTIVE Many of the presented vocabularies are built manually or semi-automatically requiring large investments of time and human effort and consequently the slow growth of these vocabularies. In this paper, we present an automatic method to enrich laymen's vocabularies that has the benefit of being able to be applied to vocabularies in any domain. METHODS Our entirely automatic approach uses machine learning, specifically Global Vectors for Word Embeddings (GloVe), on a corpus collected from a social media healthcare platform to extend and enhance consumer health vocabularies (CHV). Our approach further improves the CHV by incorporating synonyms and hyponyms from the WordNet ontology. The basic GloVe and our novel algorithms incorporating WordNet were evaluated using two laymen datasets from the National Library of Medicine (NLM), Open-Access Consumer Health Vocabulary (OAC CHV) and MedlinePlus Healthcare Vocabulary. RESULTS The results show that GloVe was able to find new laymen terms with an F-score of 48.44%. Furthermore, our enhanced GloVe approach outperformed basic GloVe with an average F-score of 61%, a relative improvement of 25%. CONCLUSIONS This paper presents an automatic approach to enrich consumer health vocabularies using the GloVe word embeddings and an auxiliary lexical source, WordNet. Our approach was evaluated used a healthcare text downloaded from MedHelp.org, a healthcare social media platform using two standard laymen vocabularies, OAC CHV, and MedlinePlus. We used the WordNet ontology to expand the healthcare corpus by including synonyms, hyponyms, and hypernyms for each CHV layman term occurrence in the corpus. Given a seed term selected from a concept in the ontology, we measured our algorithms’ ability to automatically extract synonyms for those terms that appeared in the ground truth concept. We found that enhanced GloVe outperformed GloVe with a relative improvement of 25% in the F-score.

Download Full-text

Comparative Study of Emojis and Filters for Enhancing Presence on Social Media

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2018.3046 ◽

2018 ◽

Vol 6 (3) ◽

pp. 308-313

Author(s):

Sophia Surve

Keyword(s):

Social Media ◽

Comparative Study

Download Full-text