Word-embedding Based Text Vectorization Using Clustering

Vitaly I. Yuferev; Nikolai A. Razin

doi:10.18255/1818-1015-2021-3-292-311

Word-embedding Based Text Vectorization Using Clustering

Modeling and Analysis of Information Systems ◽

10.18255/1818-1015-2021-3-292-311 ◽

2021 ◽

Vol 28 (3) ◽

pp. 292-311

Author(s):

Vitaly I. Yuferev ◽

Nikolai A. Razin

Keyword(s):

Language Processing ◽

Word Embedding ◽

Vector Representation ◽

Optimal Parameters ◽

Ranking Problem ◽

Series Of Experiments ◽

Text Ranking ◽

Vector Representations ◽

Similar Elements ◽

Entire Text

It is known that in the tasks of natural language processing, the representation of texts by vectors of fixed length using word-embedding models makes sense in cases where the vectorized texts are short.The longer the texts being compared, the worse the approach works. This situation is due to the fact that when using word-embedding models, information is lost when converting the vector representations of the words that make up the text into a vector representation of the entire text, which usually has the same dimension as the vector of a single word.This paper proposes an alternative way for using pre-trained word-embedding models for text vectorization. The essence of the proposed method consists in combining semantically similar elements of the dictionary of the existing text corpus by clustering their (dictionary elements) embeddings, as a result of which a new dictionary is formed with a size smaller than the original one, each element of which corresponds to one cluster. The original corpus of texts is reformulated in terms of this new dictionary, after which vectorization is performed on the reformulated texts using one of the dictionary approaches (TF-IDF was used in the work). The resulting vector representation of the text can be additionally enriched using the vectors of words of the original dictionary obtained by decreasing the dimension of their embeddings for each cluster.A series of experiments to determine the optimal parameters of the method is described in the paper, the proposed approach is compared with other methods of text vectorization for the text ranking problem – averaging word embeddings with TF-IDF weighting and without weighting, as well as vectorization based on TF-IDF coefficients.

Download Full-text

Legal Document Summarization Using Nlp and Ml Techniques

International Journal Of Engineering And Computer Science ◽

10.18535/ijecs/v9i05.4488 ◽

2020 ◽

Vol 9 (05) ◽

pp. 25039-25046 ◽

Cited By ~ 1

Author(s):

Rahul C Kore ◽

Prachi Ray ◽

Priyanka Lade ◽

Amit Nerurkar

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Domain Knowledge ◽

Vector Representation ◽

Essential Information ◽

Legal Documents ◽

Ranking Algorithms ◽

Legal Document ◽

Text Ranking

Reading legal documents are tedious and sometimes it requires domain knowledge related to that document. It is hard to read the full legal document without missing the key important sentences. With increasing number of legal documents it would be convenient to get the essential information from the document without having to go through the whole document. The purpose of this study is to understand a large legal document within a short duration of time. Summarization gives flexibility and convenience to the reader. Using vector representation of words, text ranking algorithms, similarity techniques, this study gives a way to produce the highest ranked sentences. Summarization produces the result in such a way that it covers the most vital information of the document in a concise manner. The paper proposes how the different natural language processing concepts can be used to produce the desired result and give readers the relief from going through the whole complex document. This study definitively presents the steps that are required to achieve the aim and elaborates all the algorithms used at each and every step in the process.

Download Full-text

A Polarity Capturing Sphere for Word to Vector Representation

Applied Sciences ◽

10.3390/app10124386 ◽

2020 ◽

Vol 10 (12) ◽

pp. 4386 ◽

Cited By ~ 1

Author(s):

Sandra Rizkallah ◽

Amir F. Atiya ◽

Samir Shaheen

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

State Of The Art ◽

Unrelated Word ◽

Research Field ◽

Word Embedding ◽

Vector Representation ◽

Active Research ◽

Embedding Methods ◽

Better Than

Embedding words from a dictionary as vectors in a space has become an active research field, due to its many uses in several natural language processing applications. Distances between the vectors should reflect the relatedness between the corresponding words. The problem with existing word embedding methods is that they often fail to distinguish between synonymous, antonymous, and unrelated word pairs. Meanwhile, polarity detection is crucial for applications such as sentiment analysis. In this work we propose an embedding approach that is designed to capture the polarity issue. The approach is based on embedding the word vectors into a sphere, whereby the dot product between any vectors represents the similarity. Vectors corresponding to synonymous words would be close to each other on the sphere, while a word and its antonym would lie at opposite poles of the sphere. The approach used to design the vectors is a simple relaxation algorithm. The proposed word embedding is successful in distinguishing between synonyms, antonyms, and unrelated word pairs. It achieves results that are better than those of some of the state-of-the-art techniques and competes well with the others.

Download Full-text

Confusion2Vec: towards enriching vector space word representations with representational ambiguities

PeerJ Computer Science ◽

10.7717/peerj-cs.195 ◽

2019 ◽

Vol 5 ◽

pp. e195 ◽

Cited By ~ 2

Author(s):

Prashanth Gurunath Shivakumar ◽

Panayiotis Georgiou

Keyword(s):

Vector Space ◽

Language Processing ◽

Principal Component ◽

Acoustic Similarity ◽

Vector Representation ◽

Sources Of Information ◽

Contextual Cues ◽

Word Similarity ◽

Morphological Transformations ◽

Vector Representations

Word vector representations are a crucial part of natural language processing (NLP) and human computer interaction. In this paper, we propose a novel word vector representation, Confusion2Vec, motivated from the human speech production and perception that encodes representational ambiguity. Humans employ both acoustic similarity cues and contextual cues to decode information and we focus on a model that incorporates both sources of information. The representational ambiguity of acoustics, which manifests itself in word confusions, is often resolved by both humans and machines through contextual cues. A range of representational ambiguities can emerge in various domains further to acoustic perception, such as morphological transformations, word segmentation, paraphrasing for NLP tasks like machine translation, etc. In this work, we present a case study in application to automatic speech recognition (ASR) task, where the word representational ambiguities/confusions are related to acoustic similarity. We present several techniques to train an acoustic perceptual similarity representation ambiguity. We term this Confusion2Vec and learn on unsupervised-generated data from ASR confusion networks or lattice-like structures. Appropriate evaluations for the Confusion2Vec are formulated for gauging acoustic similarity in addition to semantic–syntactic and word similarity evaluations. The Confusion2Vec is able to model word confusions efficiently, without compromising on the semantic-syntactic word relations, thus effectively enriching the word vector space with extra task relevant ambiguity information. We provide an intuitive exploration of the two-dimensional Confusion2Vec space using principal component analysis of the embedding and relate to semantic relationships, syntactic relationships, and acoustic relationships. We show through this that the new space preserves the semantic/syntactic relationships while robustly encoding acoustic similarities. The potential of the new vector representation and its ability in the utilization of uncertainty information associated with the lattice is demonstrated through small examples relating to the task of ASR error correction.

Download Full-text

Word Embeddings and Its Application in Deep Learning

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k1343.0981119 ◽

2019 ◽

Vol 8 (11) ◽

pp. 337-341 ◽

Cited By ~ 2

Keyword(s):

Deep Learning ◽

Language Processing ◽

Word Embedding ◽

Numerical Representation ◽

Word Embeddings ◽

Close Proximity ◽

Simple Term ◽

Motivating Factor ◽

Vector Representations ◽

Implementation Steps

Word embedding in simple term can be defined as representing text in form of vectors. Vector representations of text help people in finding similarities, because contextual words that seem to appear nearby regularly use to appear in close proximity in vector space. The motivating factor behind such numerical representation of text corpus is that it can be manipulated arithmetically just like any other vector. Deep learning along with neural network is not new at all, both the concepts are prevalent around the decades but there was a major tailback of unavailability and accessibility of computation power. Deep learning is now effectively being used in Natural Language Processing with the improvement in techniques like word embedding, mobile enablement and focus on attention. The paper will discuss about the two popular model of word embedding (Word2Vec model) can be used for deep learning and will also compare them. The implementation steps of Skip gram model are also discusses in the paper. The paper will also discuss challenging issues for Word2Vce model.

Download Full-text

Incorporating Synonym for Lexical Sememe Prediction: An Attention-Based Model

Applied Sciences ◽

10.3390/app10175996 ◽

2020 ◽

Vol 10 (17) ◽

pp. 5996

Author(s):

Xiaojun Kang ◽

Bing Li ◽

Hong Yao ◽

Qingzhong Liang ◽

Shengwen Li ◽

...

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

Prediction Accuracy ◽

Word Embedding ◽

Commonsense Knowledge ◽

Proposed Model ◽

Series Of Experiments ◽

And Performance ◽

Novel Model

Sememe is the smallest semantic unit for describing real-world concepts, which improves the interpretability and performance of Natural Language Processing (NLP). To maintain the accuracy of the sememe description, its knowledge base needs to be continuously updated, which is time-consuming and labor-intensive. Sememes predictions can assign sememes to unlabeled words and are valuable work for automatically building and/or updating sememeknowledge bases (KBs). Existing methods are overdependent on the quality of the word embedding vectors, it remains a challenge for accurate sememe prediction. To address this problem, this study proposes a novel model to improve the performance of sememe prediction by introducing synonyms. The model scores candidate sememes from synonyms by combining distances of words in embedding vector space and derives an attention-based strategy to dynamically balance two kinds of knowledge from synonymous word set and word embedding vector. A series of experiments are performed, and the results show that the proposed model has made a significant improvement in the sememe prediction accuracy. The model provides a methodological reference for commonsense KB updating and embedding of commonsense knowledge.

Download Full-text

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

10.26434/chemrxiv.5513581.v1 ◽

2017 ◽

Author(s):

Sabrina Jaeger ◽

Simone Fulle ◽

Samo Turk

Keyword(s):

Machine Learning ◽

Language Processing ◽

Supervised Machine Learning ◽

Learning Approach ◽

Learning Approaches ◽

Unsupervised Machine Learning ◽

Feature Representations ◽

Machine Learning Approach ◽

The Individual ◽

Vector Representations

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

Download Full-text

Scholar2vec: Vector Representation of Scholars for Lifetime Collaborator Prediction

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3442199 ◽

2021 ◽

Vol 15 (3) ◽

pp. 1-19

Author(s):

Wei Wang ◽

Feng Xia ◽

Jian Wu ◽

Zhiguo Gong ◽

Hanghang Tong ◽

...

Keyword(s):

Scientific Collaboration ◽

Early Stage ◽

Collaboration Network ◽

Vector Representation ◽

Network Embedding ◽

Machine Learning Methods ◽

Academic Networks ◽

Special Relationships ◽

Real World Datasets ◽

Vector Representations

While scientific collaboration is critical for a scholar, some collaborators can be more significant than others, e.g., lifetime collaborators. It has been shown that lifetime collaborators are more influential on a scholar’s academic performance. However, little research has been done on investigating predicting such special relationships in academic networks. To this end, we propose Scholar2vec, a novel neural network embedding for representing scholar profiles. First, our approach creates scholars’ research interest vector from textual information, such as demographics, research, and influence. After bridging research interests with a collaboration network, vector representations of scholars can be gained with graph learning. Meanwhile, since scholars are occupied with various attributes, we propose to incorporate four types of scholar attributes for learning scholar vectors. Finally, the early-stage similarity sequence based on Scholar2vec is used to predict lifetime collaborators with machine learning methods. Extensive experiments on two real-world datasets show that Scholar2vec outperforms state-of-the-art methods in lifetime collaborator prediction. Our work presents a new way to measure the similarity between two scholars by vector representation, which tackles the knowledge between network embedding and academic relationship mining.

Download Full-text

Word Embedding for Semantically Relative Words: an Experimental Study

Modeling and Analysis of Information Systems ◽

10.18255/1818-1015-2018-6-726-733 ◽

2018 ◽

Vol 25 (6) ◽

pp. 726-733

Author(s):

Maria S. Karyaeva ◽

Pavel I. Braslavski ◽

Valery A. Sokolov

Keyword(s):

Experimental Study ◽

Natural Language Processing ◽

Language Processing ◽

Intelligent Systems ◽

Russian Language ◽

Word Embedding ◽

Semantic Relations ◽

Automatic Extraction ◽

Semantic Relationships ◽

The Russian Language

The ability to identify semantic relations between words has made a word2vec model widely used in NLP tasks. The idea of word2vec is based on a simple rule that a higher similarity can be reached if two words have a similar context. Each word can be represented as a vector, so the closest coordinates of vectors can be interpreted as similar words. It allows to establish semantic relations (synonymy, relations of hypernymy and hyponymy and other semantic relations) by applying an automatic extraction. The extraction of semantic relations by hand is considered as a time-consuming and biased task, requiring a large amount of time and some help of experts. Unfortunately, the word2vec model provides an associative list of words which does not consist of relative words only. In this paper, we show some additional criteria that may be applicable to solve this problem. Observations and experiments with well-known characteristics, such as word frequency, a position in an associative list, might be useful for improving results for the task of extraction of semantic relations for the Russian language by using word embedding. In the experiments, the word2vec model trained on the Flibusta and pairs from Wiktionary are used as examples with semantic relationships. Semantically related words are applicable to thesauri, ontologies and intelligent systems for natural language processing.

Download Full-text

Embeddings in Natural Language Processing. Theory and Advances in Vector Representations of Meaning

Computational Linguistics ◽

10.1162/coli_r_00410 ◽

2021 ◽

pp. 1-4

Author(s):

Marcos Garcia

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Processing Theory ◽

Vector Representations

Download Full-text

Sentiment of App with Word Vectors

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1416.0986s319 ◽

2019 ◽

Vol 8 (6S3) ◽

pp. 2156-2159

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Text Data ◽

Vector Representations ◽

Text Sentiment Analysis

Vector representations for language have been shown to be useful in a number of Natural Language Processing tasks. In this paper, we aim to investigate the effectiveness of word vector representations for the problem of Sentiment Analysis. In particular, we target three sub-tasks namely sentiment words extraction, polarity of sentiment words detection, and text sentiment prediction. We investigate the effectiveness of vector representations over different text data and evaluate the quality of domain-dependent vectors. Vector representations has been used to compute various vector-based features and conduct systematically experiments to demonstrate their effectiveness. Using simple vector based features can achieve better results for text sentiment analysis of APP.

Download Full-text