scholarly journals Towards Robust Word Embeddings for Noisy Texts

2020 ◽  
Vol 10 (19) ◽  
pp. 6893
Author(s):  
Yerai Doval ◽  
Jesús Vilares ◽  
Carlos Gómez-Rodríguez

Research on word embeddings has mainly focused on improving their performance on standard corpora, disregarding the difficulties posed by noisy texts in the form of tweets and other types of non-standard writing from social media. In this work, we propose a simple extension to the skipgram model in which we introduce the concept of bridge-words, which are artificial words added to the model to strengthen the similarity between standard words and their noisy variants. Our new embeddings outperform baseline models on noisy texts on a wide range of evaluation tasks, both intrinsic and extrinsic, while retaining a good performance on standard texts. To the best of our knowledge, this is the first explicit approach at dealing with these types of noisy texts at the word embedding level that goes beyond the support for out-of-vocabulary words.

2021 ◽  
Vol 15 (02) ◽  
pp. 263-290
Author(s):  
Renjith P. Ravindran ◽  
Kavi Narayana Murthy

Word embeddings have recently become a vital part of many Natural Language Processing (NLP) systems. Word embeddings are a suite of techniques that represent words in a language as vectors in an n-dimensional real space that has been shown to encode a significant amount of syntactic and semantic information. When used in NLP systems, these representations have resulted in improved performance across a wide range of NLP tasks. However, it is not clear how syntactic properties interact with the more widely studied semantic properties of words. Or what the main factors in the modeling formulation are that encourages embedding spaces to pick up more of syntactic behavior as opposed to semantic behavior of words. We investigate several aspects of word embedding spaces and modeling assumptions that maximize syntactic coherence — the degree to which words with similar syntactic properties form distinct neighborhoods in the embedding space. We do so in order to understand which of the existing models maximize syntactic coherence making it a more reliable source for extracting syntactic category (POS) information. Our analysis shows that syntactic coherence of S-CODE is superior to the other more popular and more recent embedding techniques such as Word2vec, fastText, GloVe and LexVec, when measured under compatible parameter settings. Our investigation also gives deeper insights into the geometry of the embedding space with respect to syntactic coherence, and how this is influenced by context size, frequency of words, and dimensionality of the embedding space.


Author(s):  
Ziqian Zeng ◽  
Yichun Yin ◽  
Yangqiu Song ◽  
Ming Zhang

Word embeddings have attracted a lot of attention. On social media, each user’s language use can be significantly affected by the user’s friends. In this paper, we propose a socialized word embedding algorithm which can consider both user’s personal characteristics of language use and the user’s social relationship on social media. To incorporate personal characteristics, we propose to use a user vector to represent each user. Then for each user, the word embeddings are trained based on each user’s corpus by combining the global word vectors and local user vector. To incorporate social relationship, we add a regularization term to impose similarity between two friends. In this way, we can train the global word vectors and user vectors jointly. To demonstrate the effectiveness, we used the latest large-scale Yelp data to train our vectors, and designed several experiments to show how user vectors affect the results.


2020 ◽  
Author(s):  
Sarah Delanys ◽  
Farah Benamara ◽  
Véronique Moriceau ◽  
François Olivier ◽  
Josiane Mothe

BACKGROUND With the advent of digital technology and specifically user generated contents in social media, new ways emerged for studying possible stigma of people in relation with mental health. Several pieces of work studied the discourse conveyed about psychiatric pathologies on Twitter considering mostly tweets in English and a limited number of psychiatric disorders terms. This paper proposes the first study to analyze the use of a wide range of psychiatric terms in tweets in French. OBJECTIVE Our aim is to study how generic, nosographic and therapeutic psychiatric terms are used on Twitter in French. More specifically, our study has three complementary goals: (1) to analyze the types of psychiatric word use namely medical, misuse, irrelevant, (2) to analyze the polarity conveyed in the tweets that use these terms (positive/negative/neural), and (3) to compare the frequency of these terms to those observed in related work (mainly in English ). METHODS Our study has been conducted on a corpus of tweets in French posted between 01/01/2016 to 12/31/2018 and collected using dedicated keywords. The corpus has been manually annotated by clinical psychiatrists following a multilayer annotation scheme that includes the type of word use and the opinion orientation of the tweet. Two analysis have been performed. First a qualitative analysis to measure the reliability of the produced manual annotation, then a quantitative analysis considering mainly term frequency in each layer and exploring the interactions between them. RESULTS One of the first result is a resource as an annotated dataset . The initial dataset is composed of 22,579 tweets in French containing at least one of the selected psychiatric terms. From this set, experts in psychiatry randomly annotated 3,040 tweets that corresponds to the resource resulting from our work. The second result is the analysis of the annotations; it shows that terms are misused in 45.3% of the tweets and that their associated polarity is negative in 86.2% of the cases. When considering the three types of term use, 59.5% of the tweets are associated to a negative polarity. Misused terms related to psychotic disorders (55.5%) are more frequent to those related to mood disorders (26.5%). CONCLUSIONS Some psychiatric terms are misused in the corpora we studied; which is consistent with the results reported in related work in other languages. Thanks to the great diversity of studied terms, this work highlighted a disparity in the representations and ways of using psychiatric terms. Moreover, our study is important to help psychiatrists to be aware of the term use in new communication media such as social networks which are widely used. This study has the huge advantage to be reproducible thanks to the framework and guidelines we produced; so that the study could be renewed in order to analyze the evolution of term usage. While the newly build dataset is a valuable resource for other analytical studies, it could also serve to train machine learning algorithms to automatically identify stigma in social media.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Yahya Albalawi ◽  
Jim Buckley ◽  
Nikola S. Nikolov

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Sakthi Kumar Arul Prakash ◽  
Conrad Tucker

AbstractThis work investigates the ability to classify misinformation in online social media networks in a manner that avoids the need for ground truth labels. Rather than approach the classification problem as a task for humans or machine learning algorithms, this work leverages user–user and user–media (i.e.,media likes) interactions to infer the type of information (fake vs. authentic) being spread, without needing to know the actual details of the information itself. To study the inception and evolution of user–user and user–media interactions over time, we create an experimental platform that mimics the functionality of real-world social media networks. We develop a graphical model that considers the evolution of this network topology to model the uncertainty (entropy) propagation when fake and authentic media disseminates across the network. The creation of a real-world social media network enables a wide range of hypotheses to be tested pertaining to users, their interactions with other users, and with media content. The discovery that the entropy of user–user and user–media interactions approximate fake and authentic media likes, enables us to classify fake media in an unsupervised learning manner.


2021 ◽  
Author(s):  
Hansi Hettiarachchi ◽  
Mariam Adedoyin-Olowe ◽  
Jagdev Bhogal ◽  
Mohamed Medhat Gaber

AbstractSocial media is becoming a primary medium to discuss what is happening around the world. Therefore, the data generated by social media platforms contain rich information which describes the ongoing events. Further, the timeliness associated with these data is capable of facilitating immediate insights. However, considering the dynamic nature and high volume of data production in social media data streams, it is impractical to filter the events manually and therefore, automated event detection mechanisms are invaluable to the community. Apart from a few notable exceptions, most previous research on automated event detection have focused only on statistical and syntactical features in data and lacked the involvement of underlying semantics which are important for effective information retrieval from text since they represent the connections between words and their meanings. In this paper, we propose a novel method termed Embed2Detect for event detection in social media by combining the characteristics in word embeddings and hierarchical agglomerative clustering. The adoption of word embeddings gives Embed2Detect the capability to incorporate powerful semantical features into event detection and overcome a major limitation inherent in previous approaches. We experimented our method on two recent real social media data sets which represent the sports and political domain and also compared the results to several state-of-the-art methods. The obtained results show that Embed2Detect is capable of effective and efficient event detection and it outperforms the recent event detection methods. For the sports data set, Embed2Detect achieved 27% higher F-measure than the best-performed baseline and for the political data set, it was an increase of 29%.


2018 ◽  
Vol 15 (4) ◽  
pp. 29-44 ◽  
Author(s):  
Yi Zhao ◽  
Chong Wang ◽  
Jian Wang ◽  
Keqing He

With the rapid growth of web services on the internet, web service discovery has become a hot topic in services computing. Faced with the heterogeneous and unstructured service descriptions, many service clustering approaches have been proposed to promote web service discovery, and many other approaches leveraged auxiliary features to enhance the classical LDA model to achieve better clustering performance. However, these extended LDA approaches still have limitations in processing data sparsity and noise words. This article proposes a novel web service clustering approach by incorporating LDA with word embedding, which leverages relevant words obtained based on word embedding to improve the performance of web service clustering. Especially, the semantically relevant words of service keywords by Word2vec were used to train the word embeddings and then incorporated into the LDA training process. Finally, experiments conducted on a real-world dataset published on ProgrammableWeb show that the authors' proposed approach can achieve better clustering performance than several classical approaches.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Elisa C. Baek ◽  
Matthew Brook O’Donnell ◽  
Christin Scholz ◽  
Rui Pei ◽  
Javier O. Garcia ◽  
...  

AbstractWord of mouth recommendations influence a wide range of choices and behaviors. What takes place in the mind of recommendation receivers that determines whether they will be successfully influenced? Prior work suggests that brain systems implicated in assessing the value of stimuli (i.e., subjective valuation) and understanding others’ mental states (i.e., mentalizing) play key roles. The current study used neuroimaging and natural language classifiers to extend these findings in a naturalistic context and tested the extent to which the two systems work together or independently in responding to social influence. First, we show that in response to text-based social media recommendations, activity in both the brain’s valuation system and mentalizing system was associated with greater likelihood of opinion change. Second, participants were more likely to update their opinions in response to negative, compared to positive, recommendations, with activity in the mentalizing system scaling with the negativity of the recommendations. Third, decreased functional connectivity between valuation and mentalizing systems was associated with opinion change. Results highlight the role of brain regions involved in mentalizing and positive valuation in recommendation propagation, and further show that mentalizing may be particularly key in processing negative recommendations, whereas the valuation system is relevant in evaluating both positive and negative recommendations.


2021 ◽  
Vol 5 ◽  
pp. 130-146
Author(s):  
Mikhail Bashkirov ◽  

The figure of “shaman warrior” Aleksandr Gabyshev from Yakutsk became the object of attention in social media in 2019–2020. The interest toward Gabyshev was sparked both by the goal he declared (“to drive President Vladimir Putin out of the Kremlin”) and by his peculiar personality. This article is drawn on a wide range of materials gathered in the course of research work on a visual documentary about Gabyshev. The worldview of the “shaman warrior” was a paradoxical tangle of the native Yakut culture and the Russian Orthodox culture. In many ways Gabyshev adhered to the line of behavior typical of “blessed fools” in the Russian Orthodox tradition. Indeed, his behavior and personality image could be seen as grounded in a sequence of contradictions that seemed meaningless and illogical in the context of the shamanic tradition. Yet aspects both of neoshamanism and of “blessed foolishness” were important assets that let him creatively develop his personal identity.


2016 ◽  
Vol 6 (2) ◽  
pp. 71
Author(s):  
Aouyporn Suphasawat ◽  
Sirichai Hongsanguansri ◽  
Patcharin Seree ◽  
Ouaychai Rotjananirunkit

<p>The purpose of this study is to investigate the relationship between internet usage behavior and academic achievement among elementary school students from grade 4-6 in Bangkok. The researcher employed Multi-stage Sampling to recruit 297 samples. The data was gathered via the following tests: 1) Intelligence tests, namely Colored Progressive Matrices (CPM) for students aged 5-11 year old or Standard Progressive Matrices (SPM) for 12 year old and above, and 2) Academic achievement test, namely Wide Range Achievement Test Thai Edition: WRAT-Thai. The findings revealed that time spent on the internet is negatively correlated to student’s reading achievement (r = -.24, p &lt; .001), spelling achievement (r = -.26, p &lt; .001), and math achievement (r = -.20, p = .001). More surprisingly, academic related internet usage was also found to be negatively correlated to math achievement (r = -.20, p &lt; 0.05). Meanwhile, internet usage for social media has a correlation with academic achievement in math and reading, (r = -.20, p = .001) and (r = -.13, p &lt; .05), respectively. Moreover, internet usage for entertainment was found to have a negative correlation with academic achievement in reading, spelling and math, (r = -.25, p &lt; .001), (r = -.27, p &lt; .001) and (r = -.21, p &lt; .001), respectively. Internet usage for online business, however, yielded no correlation to academic achievement. The study concluded that daily internet usage does have an effect on academic achievement in math. Moreover, when used for entertainment and social media, internet usage can pose a negative effect on academic achievement in reading and writing.</p>


Sign in / Sign up

Export Citation Format

Share Document