Efficient Weighted Semantic Score Based on the Huffman Coding Algorithm and Knowledge Bases for Word Sequences Embedding

2020 ◽  
Vol 16 (2) ◽  
pp. 126-142
Author(s):  
Nada Ben-Lhachemi ◽  
El Habib Nfaoui

Learning text representation is forming a core for numerous natural language processing applications. Word embedding is a type of text representation that allows words with similar meaning to have similar representation. Word embedding techniques categorize semantic similarities between linguistic items based on their distributional properties in large samples of text data. Although these techniques are very efficient, handling semantic and pragmatics ambiguity with high accuracy is still a challenging research task. In this article, we propose a new feature as a semantic score which handles ambiguities between words. We use external knowledge bases and the Huffman Coding algorithm to compute this score that depicts the semantic relatedness between all fragments composing a given text. We combine this feature with word embedding methods to improve text representation. We evaluate our method on a hashtag recommendation system in Twitter where text is noisy and short. The experimental results demonstrate that, compared with state-of-the-art algorithms, our method achieves good results.

Sentiment Classification is one of the well-known and most popular domain of machine learning and natural language processing. An algorithm is developed to understand the opinion of an entity similar to human beings. This research fining article presents a similar to the mention above. Concept of natural language processing is considered for text representation. Later novel word embedding model is proposed for effective classification of the data. Tf-IDF and Common BoW representation models were considered for representation of text data. Importance of these models are discussed in the respective sections. The proposed is testing using IMDB datasets. 50% training and 50% testing with three random shuffling of the datasets are used for evaluation of the model.


2019 ◽  
Vol 9 (21) ◽  
pp. 4701 ◽  
Author(s):  
Qicai Wang ◽  
Peiyu Liu ◽  
Zhenfang Zhu ◽  
Hongxia Yin ◽  
Qiuyue Zhang ◽  
...  

As a core task of natural language processing and information retrieval, automatic text summarization is widely applied in many fields. There are two existing methods for text summarization task at present: abstractive and extractive. On this basis we propose a novel hybrid model of extractive-abstractive to combine BERT (Bidirectional Encoder Representations from Transformers) word embedding with reinforcement learning. Firstly, we convert the human-written abstractive summaries to the ground truth labels. Secondly, we use BERT word embedding as text representation and pre-train two sub-models respectively. Finally, the extraction network and the abstraction network are bridged by reinforcement learning. To verify the performance of the model, we compare it with the current popular automatic text summary model on the CNN/Daily Mail dataset, and use the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics as the evaluation method. Extensive experimental results show that the accuracy of the model is improved obviously.


Author(s):  
Arpita Roy ◽  
Shimei Pan

Word embedding, a process to automatically learn the mathematical representations of words from unlabeled text corpora, has gained a lot of attention recently. Since words are the basic units of a natural language, the more precisely we can represent the morphological, syntactic and semantic properties of words, the better we can support downstream Natural Language Processing (NLP) tasks. Since traditional word embeddings are mainly designed to capture the semantic relatedness between co-occurred words in a predefined context, it may not be effective in encoding other information that is important for different NLP applications. In this survey, we summarize the recent advances in incorporating extra knowledge to enhance word embedding. We will also identify the limitations of existing work as well as point out a few promising future directions.


2021 ◽  
pp. 35-48
Author(s):  
A.A. Reybandt ◽  
◽  
A.N. Areseniev ◽  
T.G. Maximova ◽  
◽  
...  

The article demonstrates the design and implementation of a data aggregation algorithm for a future recommendation system in the field of personalized nutrition. It was based on theoretical materials on machine learning methods in natural language processing, as well as tutorials on building classification models using the Keras library. A distinctive feature of the classifier implemented within the framework of this project is the fact that it simultaneously accepts images and text data as input to obtain more accurate and balanced predictions. The implementation of the designed data aggregation algorithm for the recommendation system in the field of personalized nutrition is considered in detail. A review was made of the tools and approaches chosen at various stages of aggregation. The metrics for evaluating the predictions of the implemented model for the classification of geographic labels, as well as the analysis of the average sentiment of user reviews are determined and the results are visualized. Predicted geo tags and revealed comment sentiments were added to the main data frame as additional features.


Author(s):  
Haoran Huang ◽  
Qi Zhang ◽  
Xuanjing Huang

In this study, we investigated the problem of recommending usernames when people attempt to use the ``@'' sign to mention other people in twitter-like social media. With the extremely rapid development of social networking services, this problem has received considerable attention in recent years. Previous methods have studied the problem from different aspects. Because most of Twitter-like microblogging services limit the length of posts, statistical learning methods may be affected by the problems of word sparseness and synonyms. Although recent progress in neural word embedding methods have advanced the state-of-the-art in many natural language processing tasks, the benefits of word embedding have not been taken into consideration for this problem. In this work, we proposed a novel end-to-end memory network architecture to perform this task. We incorporated the interests of users with external memory. A hierarchical attention mechanism was also applied to better consider the interests of users. The experimental results on a dataset we collected from Twitter demonstrated that the proposed method could outperform state-of-the-art approaches.


Sensors ◽  
2019 ◽  
Vol 19 (17) ◽  
pp. 3728 ◽  
Author(s):  
Zhou ◽  
Wang ◽  
Sun ◽  
Sun

Text representation is one of the key tasks in the field of natural language processing (NLP). Traditional feature extraction and weighting methods often use the bag-of-words (BoW) model, which may lead to a lack of semantic information as well as the problems of high dimensionality and high sparsity. At present, to solve these problems, a popular idea is to utilize deep learning methods. In this paper, feature weighting, word embedding, and topic models are combined to propose an unsupervised text representation method named the feature, probability, and word embedding method. The main idea is to use the word embedding technology Word2Vec to obtain the word vector, and then combine this with the feature weighted TF-IDF and the topic model LDA. Compared with traditional feature engineering, the proposed method not only increases the expressive ability of the vector space model, but also reduces the dimensions of the document vector. Besides this, it can be used to solve the problems of the insufficient information, high dimensions, and high sparsity of BoW. We use the proposed method for the task of text categorization and verify the validity of the method.


2020 ◽  
Vol 10 (12) ◽  
pp. 4386 ◽  
Author(s):  
Sandra Rizkallah ◽  
Amir F. Atiya ◽  
Samir Shaheen

Embedding words from a dictionary as vectors in a space has become an active research field, due to its many uses in several natural language processing applications. Distances between the vectors should reflect the relatedness between the corresponding words. The problem with existing word embedding methods is that they often fail to distinguish between synonymous, antonymous, and unrelated word pairs. Meanwhile, polarity detection is crucial for applications such as sentiment analysis. In this work we propose an embedding approach that is designed to capture the polarity issue. The approach is based on embedding the word vectors into a sphere, whereby the dot product between any vectors represents the similarity. Vectors corresponding to synonymous words would be close to each other on the sphere, while a word and its antonym would lie at opposite poles of the sphere. The approach used to design the vectors is a simple relaxation algorithm. The proposed word embedding is successful in distinguishing between synonyms, antonyms, and unrelated word pairs. It achieves results that are better than those of some of the state-of-the-art techniques and competes well with the others.


2009 ◽  
Vol 34 ◽  
pp. 443-498 ◽  
Author(s):  
E. Gabrilovich ◽  
S. Markovitch

Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such as WordNet, or on huge manual efforts such as the CYC project. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on text categorization and on computing the degree of semantic relatedness between fragments of natural language text. Using ESA results in significant improvements over the previous state of the art in both tasks. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.


2020 ◽  
Vol 14 ◽  
Author(s):  
Khoirom Motilal Singh ◽  
Laiphrakpam Dolendro Singh ◽  
Themrichon Tuithung

Background: Data which are in the form of text, audio, image and video are used everywhere in our modern scientific world. These data are stored in physical storage, cloud storage and other storage devices. Some of it are very sensitive and requires efficient security while storing as well as in transmitting from the sender to the receiver. Objective: With the increase in data transfer operation, enough space is also required to store these data. Many researchers have been working to develop different encryption schemes, yet there exist many limitations in their works. There is always a need for encryption schemes with smaller cipher data, faster execution time and low computation cost. Methods: A text encryption based on Huffman coding and ElGamal cryptosystem is proposed. Initially, the text data is converted to its corresponding binary bits using Huffman coding. Next, the binary bits are grouped and again converted into large integer values which will be used as the input for the ElGamal cryptosystem. Results: Encryption and Decryption are successfully performed where the data size is reduced using Huffman coding and advance security with the smaller key size is provided by the ElGamal cryptosystem. Conclusion: Simulation results and performance analysis specifies that our encryption algorithm is better than the existing algorithms under consideration.


2021 ◽  
Vol 54 (2) ◽  
pp. 1-37
Author(s):  
Dhivya Chandrasekaran ◽  
Vijay Mago

Estimating the semantic similarity between text data is one of the challenging and open research problems in the field of Natural Language Processing (NLP). The versatility of natural language makes it difficult to define rule-based methods for determining semantic similarity measures. To address this issue, various semantic similarity methods have been proposed over the years. This survey article traces the evolution of such methods beginning from traditional NLP techniques such as kernel-based methods to the most recent research work on transformer-based models, categorizing them based on their underlying principles as knowledge-based, corpus-based, deep neural network–based methods, and hybrid methods. Discussing the strengths and weaknesses of each method, this survey provides a comprehensive view of existing systems in place for new researchers to experiment and develop innovative ideas to address the issue of semantic similarity.


Sign in / Sign up

Export Citation Format

Share Document