Towards Robust Word Embeddings for Noisy Texts

Research on word embeddings has mainly focused on improving their performance on standard corpora, disregarding the difficulties posed by noisy texts in the form of tweets and other types of non-standard writing from social media. In this work, we propose a simple extension to the skipgram model in which we introduce the concept of bridge-words, which are artificial words added to the model to strengthen the similarity between standard words and their noisy variants. Our new embeddings outperform baseline models on noisy texts on a wide range of evaluation tasks, both intrinsic and extrinsic, while retaining a good performance on standard texts. To the best of our knowledge, this is the first explicit approach at dealing with these types of noisy texts at the word embedding level that goes beyond the support for out-of-vocabulary words.

Download Full-text

Syntactic Coherence in Word Embedding Spaces

International Journal of Semantic Computing ◽

10.1142/s1793351x21500057 ◽

2021 ◽

Vol 15 (02) ◽

pp. 263-290

Author(s):

Renjith P. Ravindran ◽

Kavi Narayana Murthy

Keyword(s):

Language Processing ◽

Real Space ◽

Word Embedding ◽

Word Embeddings ◽

Wide Range ◽

Reliable Source ◽

Improved Performance ◽

Syntactic Behavior ◽

Syntactic Properties ◽

Embedding Spaces

Word embeddings have recently become a vital part of many Natural Language Processing (NLP) systems. Word embeddings are a suite of techniques that represent words in a language as vectors in an n-dimensional real space that has been shown to encode a significant amount of syntactic and semantic information. When used in NLP systems, these representations have resulted in improved performance across a wide range of NLP tasks. However, it is not clear how syntactic properties interact with the more widely studied semantic properties of words. Or what the main factors in the modeling formulation are that encourages embedding spaces to pick up more of syntactic behavior as opposed to semantic behavior of words. We investigate several aspects of word embedding spaces and modeling assumptions that maximize syntactic coherence — the degree to which words with similar syntactic properties form distinct neighborhoods in the embedding space. We do so in order to understand which of the existing models maximize syntactic coherence making it a more reliable source for extracting syntactic category (POS) information. Our analysis shows that syntactic coherence of S-CODE is superior to the other more popular and more recent embedding techniques such as Word2vec, fastText, GloVe and LexVec, when measured under compatible parameter settings. Our investigation also gives deeper insights into the geometry of the embedding space with respect to syntactic coherence, and how this is influenced by context size, frequency of words, and dimensionality of the embedding space.

Download Full-text

Socialized Word Embeddings

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/547 ◽

2017 ◽

Cited By ~ 1

Author(s):

Ziqian Zeng ◽

Yichun Yin ◽

Yangqiu Song ◽

Ming Zhang

Keyword(s):

Social Media ◽

Social Relationship ◽

Large Scale ◽

Language Use ◽

Personal Characteristics ◽

Word Embedding ◽

Word Embeddings ◽

Regularization Term ◽

Local User

Word embeddings have attracted a lot of attention. On social media, each user’s language use can be significantly affected by the user’s friends. In this paper, we propose a socialized word embedding algorithm which can consider both user’s personal characteristics of language use and the user’s social relationship on social media. To incorporate personal characteristics, we propose to use a user vector to represent each user. Then for each user, the word embeddings are trained based on each user’s corpus by combining the global word vectors and local user vector. To incorporate social relationship, we add a regularization term to impose similarity between two friends. In this way, we can train the global word vectors and user vectors jointly. To demonstrate the effectiveness, we used the latest large-scale Yelp data to train our vectors, and designed several experiments to show how user vectors affect the results.

Download Full-text

RepPer: Perception of Psychiatric Disorders on Twitter in French (Preprint)

10.2196/preprints.18539 ◽

2020 ◽

Author(s):

Sarah Delanys ◽

Farah Benamara ◽

Véronique Moriceau ◽

François Olivier ◽

Josiane Mothe

Keyword(s):

Social Media ◽

Psychiatric Disorders ◽

Digital Technology ◽

Psychotic Disorders ◽

Negative Polarity ◽

Machine Learning Algorithms ◽

Annotation Scheme ◽

Word Use ◽

Wide Range ◽

Initial Dataset

BACKGROUND With the advent of digital technology and specifically user generated contents in social media, new ways emerged for studying possible stigma of people in relation with mental health. Several pieces of work studied the discourse conveyed about psychiatric pathologies on Twitter considering mostly tweets in English and a limited number of psychiatric disorders terms. This paper proposes the first study to analyze the use of a wide range of psychiatric terms in tweets in French. OBJECTIVE Our aim is to study how generic, nosographic and therapeutic psychiatric terms are used on Twitter in French. More specifically, our study has three complementary goals: (1) to analyze the types of psychiatric word use namely medical, misuse, irrelevant, (2) to analyze the polarity conveyed in the tweets that use these terms (positive/negative/neural), and (3) to compare the frequency of these terms to those observed in related work (mainly in English ). METHODS Our study has been conducted on a corpus of tweets in French posted between 01/01/2016 to 12/31/2018 and collected using dedicated keywords. The corpus has been manually annotated by clinical psychiatrists following a multilayer annotation scheme that includes the type of word use and the opinion orientation of the tweet. Two analysis have been performed. First a qualitative analysis to measure the reliability of the produced manual annotation, then a quantitative analysis considering mainly term frequency in each layer and exploring the interactions between them. RESULTS One of the first result is a resource as an annotated dataset . The initial dataset is composed of 22,579 tweets in French containing at least one of the selected psychiatric terms. From this set, experts in psychiatry randomly annotated 3,040 tweets that corresponds to the resource resulting from our work. The second result is the analysis of the annotations; it shows that terms are misused in 45.3% of the tweets and that their associated polarity is negative in 86.2% of the cases. When considering the three types of term use, 59.5% of the tweets are associated to a negative polarity. Misused terms related to psychotic disorders (55.5%) are more frequent to those related to mood disorders (26.5%). CONCLUSIONS Some psychiatric terms are misused in the corpora we studied; which is consistent with the results reported in related work in other languages. Thanks to the great diversity of studied terms, this work highlighted a disparity in the representations and ways of using psychiatric terms. Moreover, our study is important to help psychiatrists to be aware of the term use in new communication media such as social networks which are widely used. This study has the huge advantage to be reproducible thanks to the framework and guidelines we produced; so that the study could be renewed in order to analyze the evolution of term usage. While the newly build dataset is a valuable resource for other analytical studies, it could also serve to train machine learning algorithms to automatically identify stigma in social media.

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

Classification of unlabeled online media

Scientific Reports ◽

10.1038/s41598-021-85608-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sakthi Kumar Arul Prakash ◽

Conrad Tucker

Keyword(s):

Social Media ◽

Real World ◽

Graphical Model ◽

Ground Truth ◽

Classification Problem ◽

Machine Learning Algorithms ◽

Social Media Networks ◽

Online Social Media ◽

Wide Range

AbstractThis work investigates the ability to classify misinformation in online social media networks in a manner that avoids the need for ground truth labels. Rather than approach the classification problem as a task for humans or machine learning algorithms, this work leverages user–user and user–media (i.e.,media likes) interactions to infer the type of information (fake vs. authentic) being spread, without needing to know the actual details of the information itself. To study the inception and evolution of user–user and user–media interactions over time, we create an experimental platform that mimics the functionality of real-world social media networks. We develop a graphical model that considers the evolution of this network topology to model the uncertainty (entropy) propagation when fake and authentic media disseminates across the network. The creation of a real-world social media network enables a wide range of hypotheses to be tested pertaining to users, their interactions with other users, and with media content. The discovery that the entropy of user–user and user–media interactions approximate fake and authentic media likes, enables us to classify fake media in an unsupervised learning manner.

Download Full-text

Embed2Detect: temporally clustered embedded words for event detection in social media

Machine Learning ◽

10.1007/s10994-021-05988-7 ◽

2021 ◽

Author(s):

Hansi Hettiarachchi ◽

Mariam Adedoyin-Olowe ◽

Jagdev Bhogal ◽

Mohamed Medhat Gaber

Keyword(s):

Social Media ◽

Event Detection ◽

High Volume ◽

Detection Methods ◽

Word Embeddings ◽

Agglomerative Clustering ◽

Data Set ◽

Social Media Data ◽

Social Media Platforms ◽

Media Data

AbstractSocial media is becoming a primary medium to discuss what is happening around the world. Therefore, the data generated by social media platforms contain rich information which describes the ongoing events. Further, the timeliness associated with these data is capable of facilitating immediate insights. However, considering the dynamic nature and high volume of data production in social media data streams, it is impractical to filter the events manually and therefore, automated event detection mechanisms are invaluable to the community. Apart from a few notable exceptions, most previous research on automated event detection have focused only on statistical and syntactical features in data and lacked the involvement of underlying semantics which are important for effective information retrieval from text since they represent the connections between words and their meanings. In this paper, we propose a novel method termed Embed2Detect for event detection in social media by combining the characteristics in word embeddings and hierarchical agglomerative clustering. The adoption of word embeddings gives Embed2Detect the capability to incorporate powerful semantical features into event detection and overcome a major limitation inherent in previous approaches. We experimented our method on two recent real social media data sets which represent the sports and political domain and also compared the results to several state-of-the-art methods. The obtained results show that Embed2Detect is capable of effective and efficient event detection and it outperforms the recent event detection methods. For the sports data set, Embed2Detect achieved 27% higher F-measure than the best-performed baseline and for the political data set, it was an increase of 29%.

Download Full-text

Incorporating LDA With Word Embedding for Web Service Clustering

International Journal of Web Services Research ◽

10.4018/ijwsr.2018100102 ◽

2018 ◽

Vol 15 (4) ◽

pp. 29-44 ◽

Cited By ~ 4

Author(s):

Yi Zhao ◽

Chong Wang ◽

Jian Wang ◽

Keqing He

Keyword(s):

Web Service ◽

Service Discovery ◽

Word Embedding ◽

The Internet ◽

Word Embeddings ◽

Training Process ◽

Web Service Discovery ◽

Processing Data ◽

Clustering Approach ◽

Service Clustering

With the rapid growth of web services on the internet, web service discovery has become a hot topic in services computing. Faced with the heterogeneous and unstructured service descriptions, many service clustering approaches have been proposed to promote web service discovery, and many other approaches leveraged auxiliary features to enhance the classical LDA model to achieve better clustering performance. However, these extended LDA approaches still have limitations in processing data sparsity and noise words. This article proposes a novel web service clustering approach by incorporating LDA with word embedding, which leverages relevant words obtained based on word embedding to improve the performance of web service clustering. Especially, the semantically relevant words of service keywords by Word2vec were used to train the word embeddings and then incorporated into the LDA training process. Finally, experiments conducted on a real-world dataset published on ProgrammableWeb show that the authors' proposed approach can achieve better clustering performance than several classical approaches.

Download Full-text

Activity in the brain’s valuation and mentalizing networks is associated with propagation of online recommendations

Scientific Reports ◽

10.1038/s41598-021-90420-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Elisa C. Baek ◽

Matthew Brook O’Donnell ◽

Christin Scholz ◽

Rui Pei ◽

Javier O. Garcia ◽

...

Keyword(s):

Social Media ◽

Word Of Mouth ◽

Mental States ◽

Brain Regions ◽

Prior Work ◽

Opinion Change ◽

Wide Range ◽

The Mind ◽

Mentalizing System

AbstractWord of mouth recommendations influence a wide range of choices and behaviors. What takes place in the mind of recommendation receivers that determines whether they will be successfully influenced? Prior work suggests that brain systems implicated in assessing the value of stimuli (i.e., subjective valuation) and understanding others’ mental states (i.e., mentalizing) play key roles. The current study used neuroimaging and natural language classifiers to extend these findings in a naturalistic context and tested the extent to which the two systems work together or independently in responding to social influence. First, we show that in response to text-based social media recommendations, activity in both the brain’s valuation system and mentalizing system was associated with greater likelihood of opinion change. Second, participants were more likely to update their opinions in response to negative, compared to positive, recommendations, with activity in the mentalizing system scaling with the negativity of the recommendations. Third, decreased functional connectivity between valuation and mentalizing systems was associated with opinion change. Results highlight the role of brain regions involved in mentalizing and positive valuation in recommendation propagation, and further show that mentalizing may be particularly key in processing negative recommendations, whereas the valuation system is relevant in evaluating both positive and negative recommendations.

Download Full-text

“Shaman Warrior” Aleksandr Gabyshev: Identity at the Intersection of Two Cultures [“Shaman-voin” Aleksandr Gabyshev: identichnost’ na styke dvukh kul’tur]

Этнографическое обозрение ◽

10.31857/s086954150017418-9 ◽

2021 ◽

Vol 5 ◽

pp. 130-146

Author(s):

Mikhail Bashkirov ◽

Keyword(s):

Social Media ◽

Personal Identity ◽

Research Work ◽

Russian Orthodox ◽

Two Cultures ◽

Wide Range ◽

Vladimir Putin

The figure of “shaman warrior” Aleksandr Gabyshev from Yakutsk became the object of attention in social media in 2019–2020. The interest toward Gabyshev was sparked both by the goal he declared (“to drive President Vladimir Putin out of the Kremlin”) and by his peculiar personality. This article is drawn on a wide range of materials gathered in the course of research work on a visual documentary about Gabyshev. The worldview of the “shaman warrior” was a paradoxical tangle of the native Yakut culture and the Russian Orthodox culture. In many ways Gabyshev adhered to the line of behavior typical of “blessed fools” in the Russian Orthodox tradition. Indeed, his behavior and personality image could be seen as grounded in a sequence of contradictions that seemed meaningless and illogical in the context of the shamanic tradition. Yet aspects both of neoshamanism and of “blessed foolishness” were important assets that let him creatively develop his personal identity.

Download Full-text

The Study of Internet Use and Academic Achievement of Elementary Students in Bangkok

Journal of Educational and Developmental Psychology ◽

10.5539/jedp.v6n2p71 ◽

2016 ◽

Vol 6 (2) ◽

pp. 71

Author(s):

Aouyporn Suphasawat ◽

Sirichai Hongsanguansri ◽

Patcharin Seree ◽

Ouaychai Rotjananirunkit

Keyword(s):

Academic Achievement ◽

Social Media ◽

Math Achievement ◽

Achievement Test ◽

Internet Usage ◽

School Students ◽

Wide Range Achievement Test ◽

Multi Stage ◽

Wide Range ◽

Negative Effect

<p>The purpose of this study is to investigate the relationship between internet usage behavior and academic achievement among elementary school students from grade 4-6 in Bangkok. The researcher employed Multi-stage Sampling to recruit 297 samples. The data was gathered via the following tests: 1) Intelligence tests, namely Colored Progressive Matrices (CPM) for students aged 5-11 year old or Standard Progressive Matrices (SPM) for 12 year old and above, and 2) Academic achievement test, namely Wide Range Achievement Test Thai Edition: WRAT-Thai. The findings revealed that time spent on the internet is negatively correlated to student’s reading achievement (r = -.24, p < .001), spelling achievement (r = -.26, p < .001), and math achievement (r = -.20, p = .001). More surprisingly, academic related internet usage was also found to be negatively correlated to math achievement (r = -.20, p < 0.05). Meanwhile, internet usage for social media has a correlation with academic achievement in math and reading, (r = -.20, p = .001) and (r = -.13, p < .05), respectively. Moreover, internet usage for entertainment was found to have a negative correlation with academic achievement in reading, spelling and math, (r = -.25, p < .001), (r = -.27, p < .001) and (r = -.21, p < .001), respectively. Internet usage for online business, however, yielded no correlation to academic achievement. The study concluded that daily internet usage does have an effect on academic achievement in math. Moreover, when used for entertainment and social media, internet usage can pose a negative effect on academic achievement in reading and writing.</p>

Download Full-text