Word Embeddings for Semantic Resemblance of Substantial Text Data: A Comparative Study

We propose a multilingual model to recognize Big Five Personality traits from text data in four different languages: English, Spanish, Dutch and Italian. Our analysis shows that words having a similar semantic meaning in different languages do not necessarily correspond to the same personality traits. Therefore, we propose a personality alignment method, GlobalTrait, which has a mapping for each trait from the source language to the target language (English), such that words that correlate positively to each trait are close together in the multilingual vector space. Using these aligned embeddings for training, we can transfer personality related training features from high-resource languages such as English to other low-resource languages, and get better multilingual results, when compared to using simple monolingual and unaligned multilingual embeddings. We achieve an average F-score increase (across all three languages except English) from 65 to 73.4 (+8.4), when comparing our monolingual model to multilingual using CNN with personality aligned embeddings. We also show relatively good performance in the regression tasks, and better classification results when evaluating our model on a separate Chinese dataset.

Download Full-text

Comparative study of deep learning models for sentiment analysis

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.24459 ◽

2018 ◽

Vol 7 (2.14) ◽

pp. 5726

Author(s):

Oumaima Hourrane ◽

El Habib Benlahmar ◽

Ahmed Zellou

Keyword(s):

Deep Learning ◽

Comparative Study ◽

Sentiment Analysis ◽

Language Processing ◽

Word Embeddings ◽

Human Language ◽

Learning Models ◽

Automatic Learning ◽

Learning Capability ◽

The Web

Sentiment analysis is one of the new absorbing parts appeared in natural language processing with the emergence of community sites on the web. Taking advantage of the amount of information now available, research and industry have been seeking ways to automatically analyze the sentiments expressed in texts. The challenge for this task is the human language ambiguity, and also the lack of labeled data. In order to solve this issue, sentiment analysis and deep learning have been merged as deep learning models are effective due to their automatic learning capability. In this paper, we provide a comparative study on IMDB movie review dataset, we compare word embeddings and further deep learning models on sentiment analysis and give broad empirical outcomes for those keen on taking advantage of deep learning for sentiment analysis in real-world settings.

Download Full-text

A Comparative Study on Feature Selection Techniques for Multi-cluster Text Data

Harmony Search and Nature Inspired Optimization Algorithms - Advances in Intelligent Systems and Computing ◽

10.1007/978-981-13-0761-4_21 ◽

2018 ◽

pp. 203-215

Author(s):

Ananya Gupta ◽

Shahin Ara Begum

Keyword(s):

Feature Selection ◽

Comparative Study ◽

Text Data ◽

Feature Selection Techniques

Download Full-text

Query Reformulation Based on Word Embeddings: A Comparative Study

Security Informatics and Law Enforcement - Technology Development for Security Practitioners ◽

10.1007/978-3-030-69460-9_3 ◽

2021 ◽

pp. 41-55

Author(s):

Panos Panagiotou ◽

George Kalpakis ◽

Theodora Tsikrika ◽

Stefanos Vrochidis ◽

Ioannis Kompatsiaris

Keyword(s):

Comparative Study ◽

Query Reformulation ◽

Word Embeddings

Download Full-text

A Brief Study of Approaches to Text Feature Selection

Modern Technologies for Big Data Classification and Clustering - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-2805-0.ch009 ◽

2018 ◽

pp. 216-243

Author(s):

Ravindra Babu Tallamaraju ◽

Manas Kirti

Keyword(s):

Feature Selection ◽

Comparative Study ◽

Social Networking ◽

Text Data ◽

Social Networking Services ◽

Storage Devices ◽

Text Feature ◽

Wide Range ◽

Efficiency And Effectiveness ◽

Language Text

With reducing cost of storage devices, increasing amounts of data is being stored and processed for extracting intelligence. Classification and clustering have been two major approaches in generating data abstraction. Over the last few years, text data is dominating the types of data shared and stored. Some of the sources of such datasets are mobile data, e-commerce, and wide-range of continuously expanding social-networking services. Within each of these sources, the nature of data differs drastically from formal language text to Twitter or SMS slangs thereby leading to the need for different ways of processing the data for making meaningful summarization. Such summaries could effectively be used for business advantage. Processing of such data requires identifying appropriate set of features both for efficiency and effectiveness. In the current Chapter, we propose to discuss approaches to text feature selection and make a comparative study.

Download Full-text

An End-to-End Efficient Lucene-Based Framework of Document/Information Retrieval

International Journal of Information Retrieval Research ◽

10.4018/ijirr.289950 ◽

2022 ◽

Vol 12 (1) ◽

pp. 0-0

Keyword(s):

Big Data ◽

Information Retrieval ◽

Query Expansion ◽

Industrial Revolution ◽

Document Retrieval ◽

Word Embeddings ◽

Text Data ◽

Digital World ◽

Stage System ◽

End To End

In the context of big data and the 4.0 industrial revolution era, enhancing document/information retrieval frameworks efficiency to handle the ever‐growing volume of text data in an ever more digital world is a must. This article describes a double-stage system of document/information retrieval. First, a Lucene-based document retrieval tool is implemented, and a couple of query expansion techniques using a comparable corpus (Wikipedia) and word embeddings are proposed and tested. Second, a retention-fidelity summarization protocol is performed on top of the retrieved documents to create a short, accurate, and fluent extract of a longer retrieved single document (or a set of top retrieved documents). Obtained results show that using word embeddings is an excellent way to achieve higher precision rates and retrieve more accurate documents. Also, obtained summaries satisfy the retention and fidelity criteria of relevant summaries.

Download Full-text