A Comparative Study of Word Embeddings for the Construction of a Social Media Expert Filter

2021 ◽  
pp. 196-208
Author(s):  
Jose A. Diaz-Garcia ◽  
M. Dolores Ruiz ◽  
Maria J. Martin-Bautista
2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Yahya Albalawi ◽  
Jim Buckley ◽  
Nikola S. Nikolov

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.


2021 ◽  
Author(s):  
Hansi Hettiarachchi ◽  
Mariam Adedoyin-Olowe ◽  
Jagdev Bhogal ◽  
Mohamed Medhat Gaber

AbstractSocial media is becoming a primary medium to discuss what is happening around the world. Therefore, the data generated by social media platforms contain rich information which describes the ongoing events. Further, the timeliness associated with these data is capable of facilitating immediate insights. However, considering the dynamic nature and high volume of data production in social media data streams, it is impractical to filter the events manually and therefore, automated event detection mechanisms are invaluable to the community. Apart from a few notable exceptions, most previous research on automated event detection have focused only on statistical and syntactical features in data and lacked the involvement of underlying semantics which are important for effective information retrieval from text since they represent the connections between words and their meanings. In this paper, we propose a novel method termed Embed2Detect for event detection in social media by combining the characteristics in word embeddings and hierarchical agglomerative clustering. The adoption of word embeddings gives Embed2Detect the capability to incorporate powerful semantical features into event detection and overcome a major limitation inherent in previous approaches. We experimented our method on two recent real social media data sets which represent the sports and political domain and also compared the results to several state-of-the-art methods. The obtained results show that Embed2Detect is capable of effective and efficient event detection and it outperforms the recent event detection methods. For the sports data set, Embed2Detect achieved 27% higher F-measure than the best-performed baseline and for the political data set, it was an increase of 29%.


Author(s):  
P. Tamije Selvy ◽  
V. Suriya Prakash ◽  
S. Shriram ◽  
N. Vimalesh

The number of Social Media users have increased rapidly these days and a lot of valuable as well as non valuable information is shared in the social which is capable of reaching many people in a short period of time and hence the valuable information that are shared in the social media can be used for many types of analysis. In this paper the tweets that are shared in the name of a disaster is taken and then a alert system is build. This alert system gives alert to the users after checking the received data with the centralized database. This paper also gives a comparative study on the algorithm used in extracting the data from the social media which gives us the accuracy rate of different algorithm that can be used for text mining.


2021 ◽  
Vol 12 (3) ◽  
pp. 111-128
Author(s):  
Aljohara Fahad Al Saud

Identifying language affiliation among children for family immigrants is crucial for one’s language identity. This study aimed to determine the role played by Arab families in the Kingdom of Saudi Arabia, Austria, and Britain to attain language affiliation among their children. It also aims to identify the challenges facing families living in these countries in achieving language affiliation among their children. The study population consisted of all the families that live in the Kingdom of Saudi Arabia, in addition to all the Arab families that live in Austria and Britain and the study sample included (120) parents. The researcher adopted the descriptive-analytical approach and used the questionnaire as the study tool. The study reached several results; first, the role played by families in the Kingdom of Saudi Arabia, Austria, and United Kingdom to attain language affiliation among their children got a high degree of response. Second, the challenges facing activating the family’s role in attaining language affiliation of their children in the Kingdom of Saudi Arabia and Austria have got a high degree of response, while in Britain, they obtained a very high degree of response. The study recommended involving all family members in accessing different and creative ways of practicing their native language and activating the role of social media in developing the language affiliation of children.


2020 ◽  
Author(s):  
Mohammed Ibrahim ◽  
Susan Gauch ◽  
Omar Salman ◽  
Mohammed Alqahatani

BACKGROUND Clear language makes communication easier between any two parties. A layman may have difficulty communicating with a professional due to not understanding the specialized terms common to the domain. In healthcare, it is rare to find a layman knowledgeable in medical jargon which can lead to poor understanding of their condition and/or treatment. To bridge this gap, several professional vocabularies and ontologies have been created to map laymen medical terms to professional medical terms and vice versa. OBJECTIVE Many of the presented vocabularies are built manually or semi-automatically requiring large investments of time and human effort and consequently the slow growth of these vocabularies. In this paper, we present an automatic method to enrich laymen's vocabularies that has the benefit of being able to be applied to vocabularies in any domain. METHODS Our entirely automatic approach uses machine learning, specifically Global Vectors for Word Embeddings (GloVe), on a corpus collected from a social media healthcare platform to extend and enhance consumer health vocabularies (CHV). Our approach further improves the CHV by incorporating synonyms and hyponyms from the WordNet ontology. The basic GloVe and our novel algorithms incorporating WordNet were evaluated using two laymen datasets from the National Library of Medicine (NLM), Open-Access Consumer Health Vocabulary (OAC CHV) and MedlinePlus Healthcare Vocabulary. RESULTS The results show that GloVe was able to find new laymen terms with an F-score of 48.44%. Furthermore, our enhanced GloVe approach outperformed basic GloVe with an average F-score of 61%, a relative improvement of 25%. CONCLUSIONS This paper presents an automatic approach to enrich consumer health vocabularies using the GloVe word embeddings and an auxiliary lexical source, WordNet. Our approach was evaluated used a healthcare text downloaded from MedHelp.org, a healthcare social media platform using two standard laymen vocabularies, OAC CHV, and MedlinePlus. We used the WordNet ontology to expand the healthcare corpus by including synonyms, hyponyms, and hypernyms for each CHV layman term occurrence in the corpus. Given a seed term selected from a concept in the ontology, we measured our algorithms’ ability to automatically extract synonyms for those terms that appeared in the ground truth concept. We found that enhanced GloVe outperformed GloVe with a relative improvement of 25% in the F-score.


Sign in / Sign up

Export Citation Format

Share Document