Confusion2Vec: towards enriching vector space word representations with representational ambiguities

PeerJ Computer Science ◽

10.7717/peerj-cs.195 ◽

2019 ◽

Vol 5 ◽

pp. e195 ◽

Cited By ~ 2

Author(s):

Prashanth Gurunath Shivakumar ◽

Panayiotis Georgiou

Keyword(s):

Vector Space ◽

Language Processing ◽

Principal Component ◽

Acoustic Similarity ◽

Vector Representation ◽

Sources Of Information ◽

Contextual Cues ◽

Word Similarity ◽

Morphological Transformations ◽

Vector Representations

Word vector representations are a crucial part of natural language processing (NLP) and human computer interaction. In this paper, we propose a novel word vector representation, Confusion2Vec, motivated from the human speech production and perception that encodes representational ambiguity. Humans employ both acoustic similarity cues and contextual cues to decode information and we focus on a model that incorporates both sources of information. The representational ambiguity of acoustics, which manifests itself in word confusions, is often resolved by both humans and machines through contextual cues. A range of representational ambiguities can emerge in various domains further to acoustic perception, such as morphological transformations, word segmentation, paraphrasing for NLP tasks like machine translation, etc. In this work, we present a case study in application to automatic speech recognition (ASR) task, where the word representational ambiguities/confusions are related to acoustic similarity. We present several techniques to train an acoustic perceptual similarity representation ambiguity. We term this Confusion2Vec and learn on unsupervised-generated data from ASR confusion networks or lattice-like structures. Appropriate evaluations for the Confusion2Vec are formulated for gauging acoustic similarity in addition to semantic–syntactic and word similarity evaluations. The Confusion2Vec is able to model word confusions efficiently, without compromising on the semantic-syntactic word relations, thus effectively enriching the word vector space with extra task relevant ambiguity information. We provide an intuitive exploration of the two-dimensional Confusion2Vec space using principal component analysis of the embedding and relate to semantic relationships, syntactic relationships, and acoustic relationships. We show through this that the new space preserves the semantic/syntactic relationships while robustly encoding acoustic similarities. The potential of the new vector representation and its ability in the utilization of uncertainty information associated with the lattice is demonstrated through small examples relating to the task of ASR error correction.

Download Full-text

Word-embedding Based Text Vectorization Using Clustering

Modeling and Analysis of Information Systems ◽

10.18255/1818-1015-2021-3-292-311 ◽

2021 ◽

Vol 28 (3) ◽

pp. 292-311

Author(s):

Vitaly I. Yuferev ◽

Nikolai A. Razin

Keyword(s):

Language Processing ◽

Word Embedding ◽

Vector Representation ◽

Optimal Parameters ◽

Ranking Problem ◽

Series Of Experiments ◽

Text Ranking ◽

Vector Representations ◽

Similar Elements ◽

Entire Text

It is known that in the tasks of natural language processing, the representation of texts by vectors of fixed length using word-embedding models makes sense in cases where the vectorized texts are short.The longer the texts being compared, the worse the approach works. This situation is due to the fact that when using word-embedding models, information is lost when converting the vector representations of the words that make up the text into a vector representation of the entire text, which usually has the same dimension as the vector of a single word.This paper proposes an alternative way for using pre-trained word-embedding models for text vectorization. The essence of the proposed method consists in combining semantically similar elements of the dictionary of the existing text corpus by clustering their (dictionary elements) embeddings, as a result of which a new dictionary is formed with a size smaller than the original one, each element of which corresponds to one cluster. The original corpus of texts is reformulated in terms of this new dictionary, after which vectorization is performed on the reformulated texts using one of the dictionary approaches (TF-IDF was used in the work). The resulting vector representation of the text can be additionally enriched using the vectors of words of the original dictionary obtained by decreasing the dimension of their embeddings for each cluster.A series of experiments to determine the optimal parameters of the method is described in the paper, the proposed approach is compared with other methods of text vectorization for the text ranking problem – averaging word embeddings with TF-IDF weighting and without weighting, as well as vectorization based on TF-IDF coefficients.

Download Full-text

Enriching Word Vectors with Subword Information

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00051 ◽

2017 ◽

Vol 5 ◽

pp. 135-146 ◽

Cited By ~ 1156

Author(s):

Piotr Bojanowski ◽

Edouard Grave ◽

Armand Joulin ◽

Tomas Mikolov

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Training Data ◽

Vector Representation ◽

New Approach ◽

Word Similarity ◽

Art Performance ◽

N Gram

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

Download Full-text

Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations

10.20944/preprints202012.0600.v1 ◽

2020 ◽

Author(s):

Md. Rajib Hossain ◽

Mohammed Moshiul Hoque

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Language Processing ◽

Text Classification ◽

Feature Space ◽

Vector Representation ◽

Word Similarity ◽

Corpus Creation ◽

Classification Information ◽

Correlation Accuracy

Distributional word vector representation orword embedding has become an essential ingredient in many natural language processing (NLP) tasks such as machine translation, document classification, information retrieval andquestion answering. Investigation of embedding model helps to reduce the feature space and improves textual semantic as well as syntactic relations.This paper presents three embedding techniques (such as Word2Vec, GloVe, and FastText) with different hyperparameters implemented on a Bengali corpusconsists of180 million words. The performance of the embedding techniques is evaluated with extrinsic and intrinsic ways. Extrinsic performance evaluated by text classification, which achieved a maximum of 96.48% accuracy. Intrinsic performance evaluatedby word similarity (e.g., semantic, syntactic and relatedness) and analogy tasks. The maximum Pearson (&circ;r) correlation accuracy of 60.66% (Ss&circ;r) achieved for semantic similarities and 71.64% (Sy&circ;r) for syntactic similarities whereas the relatedness obtained 79.80% (Rs&circ;r). The semantic word analogy tasks achieved 44.00% of accuracy while syntactic word analogy tasks obtained 36.00%

Download Full-text

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

10.26434/chemrxiv.5513581.v1 ◽

2017 ◽

Author(s):

Sabrina Jaeger ◽

Simone Fulle ◽

Samo Turk

Keyword(s):

Machine Learning ◽

Language Processing ◽

Supervised Machine Learning ◽

Learning Approach ◽

Learning Approaches ◽

Unsupervised Machine Learning ◽

Feature Representations ◽

Machine Learning Approach ◽

The Individual ◽

Vector Representations

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

Download Full-text

Nonverbal Semantics Test (NVST)—A Novel Diagnostic Tool to Assess Semantic Processing Deficits: Application to Persons with Aphasia after Cerebrovascular Accident

Brain Sciences ◽

10.3390/brainsci11030359 ◽

2021 ◽

Vol 11 (3) ◽

pp. 359

Author(s):

Katharina Hogrefe ◽

Georg Goldenberg ◽

Ralf Glindemann ◽

Madleen Klonowski ◽

Wolfram Ziegler

Keyword(s):

Language Processing ◽

Cerebrovascular Accident ◽

Semantic Processing ◽

Factor Model ◽

Principal Component ◽

Original Data ◽

Verbal Abilities ◽

Verbal Tasks ◽

Alternative Means ◽

The Relationship

Assessment of semantic processing capacities often relies on verbal tasks which are, however, sensitive to impairments at several language processing levels. Especially for persons with aphasia there is a strong need for a tool that measures semantic processing skills independent of verbal abilities. Furthermore, in order to assess a patient’s potential for using alternative means of communication in cases of severe aphasia, semantic processing should be assessed in different nonverbal conditions. The Nonverbal Semantics Test (NVST) is a tool that captures semantic processing capacities through three tasks—Semantic Sorting, Drawing, and Pantomime. The main aim of the current study was to investigate the relationship between the NVST and measures of standard neurolinguistic assessment. Fifty-one persons with aphasia caused by left hemisphere brain damage were administered the NVST as well as the Aachen Aphasia Test (AAT). A principal component analysis (PCA) was conducted across all AAT and NVST subtests. The analysis resulted in a two-factor model that captured 69% of the variance of the original data, with all linguistic tasks loading high on one factor and the NVST subtests loading high on the other. These findings suggest that nonverbal tasks assessing semantic processing capacities should be administered alongside standard neurolinguistic aphasia tests.

Download Full-text

Scholar2vec: Vector Representation of Scholars for Lifetime Collaborator Prediction

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3442199 ◽

2021 ◽

Vol 15 (3) ◽

pp. 1-19

Author(s):

Wei Wang ◽

Feng Xia ◽

Jian Wu ◽

Zhiguo Gong ◽

Hanghang Tong ◽

...

Keyword(s):

Scientific Collaboration ◽

Early Stage ◽

Collaboration Network ◽

Vector Representation ◽

Network Embedding ◽

Machine Learning Methods ◽

Academic Networks ◽

Special Relationships ◽

Real World Datasets ◽

Vector Representations

While scientific collaboration is critical for a scholar, some collaborators can be more significant than others, e.g., lifetime collaborators. It has been shown that lifetime collaborators are more influential on a scholar’s academic performance. However, little research has been done on investigating predicting such special relationships in academic networks. To this end, we propose Scholar2vec, a novel neural network embedding for representing scholar profiles. First, our approach creates scholars’ research interest vector from textual information, such as demographics, research, and influence. After bridging research interests with a collaboration network, vector representations of scholars can be gained with graph learning. Meanwhile, since scholars are occupied with various attributes, we propose to incorporate four types of scholar attributes for learning scholar vectors. Finally, the early-stage similarity sequence based on Scholar2vec is used to predict lifetime collaborators with machine learning methods. Extensive experiments on two real-world datasets show that Scholar2vec outperforms state-of-the-art methods in lifetime collaborator prediction. Our work presents a new way to measure the similarity between two scholars by vector representation, which tackles the knowledge between network embedding and academic relationship mining.

Download Full-text

Spatial Conditions Supporting Sustainable Development of Enterprises on Local Level

Sustainability ◽

10.3390/su13042292 ◽

2021 ◽

Vol 13 (4) ◽

pp. 2292

Author(s):

Aneta Ptak-Chmielewska ◽

Agnieszka Chłoń-Domińczak

Keyword(s):

Social Insurance ◽

Small And Medium Enterprises ◽

Local Level ◽

Principal Component ◽

Sources Of Information ◽

Clustering Methods ◽

Significant Information ◽

The Status ◽

Medium Enterprises ◽

High Share

Micro, small and medium enterprises (MSMEs) represent more than 99% of enterprises in Europe. Therefore, knowledge about this sector, also in the spatial context is important to understand the patterns of economic and social development. The main goal of this article is an analysis of spatial conditions and the situation of MSMEs on a local level using combined sources of information. This includes data collected in the Social Insurance Institution and Tax registers in Poland, which provides information on the employment, wages, revenues and taxes paid by the MSMEs on a local level as well as contextual statistical information. The data is used for a diagnosis of spatial circumstances and discussion of conditions influencing the status of the MSMEs sector in a selected region (voivodeship) in Poland. Taxonomy methods including factor analysis and clustering methods based on k-means and SOM Kohonen were used for selecting significant information and grouping of the local units according to the situation of the MSMEs. There are eight factors revealed in principal component analysis and five clusters of local units distinguished using these factors. These include two clusters with a high share of rural local units and two clusters with a high share of rural-urban and urban local units. Additionally, there was an outstanding cluster with only two dominant urban local units. Factors show differences between clusters in the situation of MSMEs sector and infrastructure. Different spatial conditions in different regions influence the situation of MSMEs.

Download Full-text

The Ecological and Ethical Consumption Development Prospects in Poland Compared with the Western European Countries

Comparative Economic Research ◽

10.2478/v10103-011-0013-3 ◽

2011 ◽

Vol 14 (2) ◽

pp. 101-123 ◽

Cited By ~ 9

Author(s):

Małgorzata Koszewska

Keyword(s):

Principal Component ◽

Ethical Consumption ◽

European Countries ◽

Primary Sources ◽

Sources Of Information ◽

Socially Responsible ◽

Secondary Sources ◽

Western European ◽

Polish Economy ◽

Development Prospects

An overview of the Western European literature shows that one of the most distinct trends in consumption that has been noted in the recent years is globally increasing environmental and social awareness. The issue of consumers' behaviours and attitudes towards "socially responsible products" has been gaining importance in Polish economy as well. This article evaluates the development prospects of ethical and ecological consumption in Poland vis-a-vis Western European countries. The comparative analysis being part of the article utilizes primary sources of information, i.e. interviews with a representative sample of Polish adults, as well as secondary sources of information. A factor analysis or, more precisely, a principal component analysis, allowed dividing Polish consumers into groups that were typologically homogeneous in respect of their sensitivity to various aspects of business ethics and ecology.

Download Full-text

Learning Lexical Subspaces in a Distributional Vector Space

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00316 ◽

2020 ◽

Vol 8 ◽

pp. 311-329

Author(s):

Kushal Arora ◽

Aishik Chakraborty ◽

Jackie C. K. Cheung

Keyword(s):

Vector Space ◽

Semantic Relations ◽

Distributional Semantics ◽

Word Embeddings ◽

Word Similarity ◽

Lexical Semantic ◽

Novel Approach ◽

Classification Tasks

In this paper, we propose LexSub, a novel approach towards unifying lexical and distributional semantics. We inject knowledge about lexical-semantic relations into distributional word embeddings by defining subspaces of the distributional vector space in which a lexical relation should hold. Our framework can handle symmetric attract and repel relations (e.g., synonymy and antonymy, respectively), as well as asymmetric relations (e.g., hypernymy and meronomy). In a suite of intrinsic benchmarks, we show that our model outperforms previous approaches on relatedness tasks and on hypernymy classification and detection, while being competitive on word similarity tasks. It also outperforms previous systems on extrinsic classification tasks that benefit from exploiting lexical relational cues. We perform a series of analyses to understand the behaviors of our model. 1 Code available at https://github.com/aishikchakraborty/LexSub .

Download Full-text

Word and sentence embedding tools to measure semantic similarity of Gene Ontology terms by their definitions

10.1101/103648 ◽

2017 ◽

Cited By ~ 1

Author(s):

Dat Duong ◽

Wasi Uddin Ahmad ◽

Eleazar Eskin ◽

Kai-Wei Chang ◽

Jingyi Jessica Li

Keyword(s):

Neural Network ◽

Gene Ontology ◽

Language Processing ◽

Classification Accuracy ◽

Dimensional Space ◽

Similarity Score ◽

Biological Functions ◽

Word Similarity ◽

True Protein ◽

Go Terms

AbstractThe Gene Ontology (GO) database contains GO terms that describe biological functions of genes. Previous methods for comparing GO terms have relied on the fact that GO terms are organized into a tree structure. In this paradigm, the locations of two GO terms in the tree dictate their similarity score. In this paper, we introduce two new solutions for this problem, by focusing instead on the definitions of the GO terms. We apply neural network based techniques from the natural language processing (NLP) domain. The first method does not rely on the GO tree, whereas the second indirectly depends on the GO tree. In our first approach, we compare two GO definitions by treating them as two unordered sets of words. The word similarity is estimated by a word embedding model that maps words into an N-dimensional space. In our second approach, we account for the word-ordering within a sentence. We use a sentence encoder to embed GO definitions into vectors and estimate how likely one definition entails another. We validate our methods in two ways. In the first experiment, we test the model’s ability to differentiate a true protein-protein network from a randomly generated network. In the second experiment, we test the model in identifying orthologs from randomly-matched genes in human, mouse, and fly. In both experiments, a hybrid of NLP and GO-tree based method achieves the best classification accuracy.Availabilitygithub.com/datduong/NLPMethods2CompareGOterms

Download Full-text