Frequency estimates for statistical word similarity measures

AbstractAutomatic text classification using machine learning is significantly affected by the text representation model. The structural information in text is necessary for natural language understanding, which is usually ignored in vector-based representations. In this paper, we present a graph kernel-based text classification framework which utilises the structural information in text effectively through the weighting and enrichment of a graph-based representation. We introduce weighted co-occurrence graphs to represent text documents, which weight the terms and their dependencies based on their relevance to text classification. We propose a novel method to automatically enrich the weighted graphs using semantic knowledge in the form of a word similarity matrix. The similarity between enriched graphs, knowledge-driven graph similarity, is calculated using a graph kernel. The semantic knowledge in the enriched graphs ensures that the graph kernel goes beyond exact matching of terms and patterns to compute the semantic similarity of documents. In the experiments on sentiment classification and topic classification tasks, our knowledge-driven similarity measure significantly outperforms the baseline text similarity measures on five benchmark text classification datasets.

Download Full-text

Utilizing semantic word similarity measures for video retrieval

2008 IEEE Conference on Computer Vision and Pattern Recognition ◽

10.1109/cvpr.2008.4587822 ◽

2008 ◽

Cited By ~ 8

Author(s):

Yusuf Aytar ◽

Mubarak Shah ◽

Jiebo Luo

Keyword(s):

Video Retrieval ◽

Similarity Measures ◽

Word Similarity

Download Full-text

EHME: a New Word Database for Research in Basque Language

The Spanish Journal of Psychology ◽

10.1017/sjp.2014.79 ◽

2014 ◽

Vol 17 ◽

Cited By ~ 9

Author(s):

Joana Acha ◽

Itziar Laka ◽

Josu Landa ◽

Pello Salaburu

Keyword(s):

Word Frequency ◽

Similarity Measures ◽

Morphological Structure ◽

Statistical Characteristics ◽

Online Program ◽

Word Similarity ◽

Frequency Dictionary ◽

Orthographic Similarity ◽

Word Level ◽

Statistical Criteria

AbstractThis article presents EHME, the frequency dictionary of Basque structure, an online program that enables researchers in psycholinguistics to extract word and nonword stimuli, based on a broad range of statistics concerning the properties of Basque words. The database consists of 22.7 million tokens, and properties available include morphological structure frequency and word-similarity measures, apart from classical indexes: word frequency, orthographic structure, orthographic similarity, bigram and biphone frequency, and syllable-based measures. Measures are indexed at the lemma, morpheme and word level. We include reliability and validation analysis. The application is freely available, and enables the user to extract words based on concrete statistical criteria1, as well as to obtain statistical characteristics from a list of words2.

Download Full-text

Asymmetric Attributional Word Similarity Measures to Detect the Relations of Textual Generality

Computers ◽

10.3390/computers9040081 ◽

2020 ◽

Vol 9 (4) ◽

pp. 81

Author(s):

Sebastião Pais ◽

Gaël Dias

Keyword(s):

Language Processing ◽

Similarity Measures ◽

Inference Mechanism ◽

Word Similarity ◽

Semantic Inference ◽

Annotation Data ◽

Entailment Relation ◽

Different Types ◽

Textual Entailment ◽

Task Systems

In this work, we present a new unsupervised and language-independent methodology to detect the relations of textual generality. For this, we introduce a particular case of Textual Entailment (TE), namely Textual Entailment by Generality (TEG). TE aims to capture primary semantic inference needs across applications in Natural Language Processing (NLP). Since 2005, in the TE Recognition (RTE) task, systems have been asked to automatically judge whether the meaning of a portion of the text, the Text (T), entails the meaning of another text, the Hypothesis (H). Several novel approaches and improvements in TE technologies demonstrated in RTE Challenges are signaling renewed interest towards a more in-depth and better understanding of the core phenomena involved in TE. In line with this direction, in this work, we focus on a particular case of entailment, entailment by generality, to detect the relations of textual generality. In text, there are different kinds of entailments, yielded from different types of implicative reasoning (lexical, syntactical, common sense based), but here, we focus just on TEG, which can be defined as an entailment from a specific statement towards a relatively more general one. Therefore, we have T→GH whenever the premise T entails the hypothesis H, this also being more general than the premise. We propose an unsupervised and language-independent method to recognize TEGs, from a pair ⟨T,H⟩ having an entailment relation. To this end, we introduce an Informative Asymmetric Measure (IAM) called Simplified Asymmetric InfoSimba (AISs), which we combine with different Asymmetric Association Measures (AAM). In this work, we hypothesize about the existence of a particular mode of TE, namely TEG. Thus, the main contribution of our study is highlighting the importance of this inference mechanism. Consequently, the new annotation data seem to be a valuable resource for the community.

Download Full-text

Bootstrapping Distributional Feature Vector Quality

Computational Linguistics ◽

10.1162/coli.08-032-r1-06-96 ◽

2009 ◽

Vol 35 (3) ◽

pp. 435-461 ◽

Cited By ~ 11

Author(s):

Maayan Zhitomirsky-Geffet ◽

Ido Dagan

Keyword(s):

Feature Vector ◽

Similarity Measures ◽

Feature Reduction ◽

Feature Weighting ◽

Superior Performance ◽

Weighting Functions ◽

Word Similarity ◽

Feature Vectors ◽

Distributional Similarity

This article presents a novel bootstrapping approach for improving the quality of feature vector weighting in distributional word similarity. The method was motivated by attempts to utilize distributional similarity for identifying the concrete semantic relationship of lexical entailment. Our analysis revealed that a major reason for the rather loose semantic similarity obtained by distributional similarity methods is insufficient quality of the word feature vectors, caused by deficient feature weighting. This observation led to the definition of a bootstrapping scheme which yields improved feature weights, and hence higher quality feature vectors. The underlying idea of our approach is that features which are common to similar words are also most characteristic for their meanings, and thus should be promoted. This idea is realized via a bootstrapping step applied to an initial standard approximation of the similarity space. The superior performance of the bootstrapping method was assessed in two different experiments, one based on direct human gold-standard annotation and the other based on an automatically created disambiguation dataset. These results are further supported by applying a novel quantitative measurement of the quality of feature weighting functions. Improved feature weighting also allows massive feature reduction, which indicates that the most characteristic features for a word are indeed concentrated at the top ranks of its vector. Finally, experiments with three prominent similarity measures and two feature weighting functions showed that the bootstrapping scheme is robust and is independent of the original functions over which it is applied.

Download Full-text

Verb sense disambiguation based on dual distributional similarity

Natural Language Engineering ◽

10.1017/s1351324999002193 ◽

1999 ◽

Vol 5 (2) ◽

pp. 157-170

Author(s):

JEONG-MI CHO ◽

JUNGYUN SEO ◽

GIL CHANG KIM

Keyword(s):

Similarity Measures ◽

Training Corpus ◽

Word Similarity ◽

Data Sparseness ◽

Sense Disambiguation ◽

Distributional Similarity ◽

The Senses ◽

Machine Readable

This paper presents a system for automatic verb sense disambiguation using a small corpus and a Machine-Readable Dictionary (MRD) in Korean. The system learns a set of typical uses listed in the MRD usage examples for each of the senses of a polysemous verb in the MRD definitions using verb-object co-occurrences acquired from the corpus. This paper concentrates on the problem of data sparseness in two ways. First, by extending word similarity measures from direct co-occurrences to co-occurrences of co-occurring words, we compute the word similarities using non co-occurring words but co-occurring clusters. Secondly, we acquire IS-A relations of nouns from the MRD definitions. It is possible to roughly cluster the nouns by the identification of the IS-A relationship. Using these methods, two words may be considered similar even if they do not share any word elements. Experiments show that this method can learn from a very small training corpus, achieving over an 86% correct disambiguation performance without any restriction on a word's senses.

Download Full-text

Directional distributional similarity for lexical inference

Natural Language Engineering ◽

10.1017/s1351324910000124 ◽

2010 ◽

Vol 16 (4) ◽

pp. 359-389 ◽

Cited By ~ 49

Author(s):

LILI KOTLERMAN ◽

IDO DAGAN ◽

IDAN SZPEKTOR ◽

MAAYAN ZHITOMIRSKY-GEFFET

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Similarity Measures ◽

Average Precision ◽

Inference Making ◽

Symmetric Relation ◽

Word Similarity ◽

Directional Relations ◽

Asymmetric Similarity

AbstractDistributional word similarity is most commonly perceived as a symmetric relation. Yet, directional relations are abundant in lexical semantics and in many Natural Language Processing (NLP) settings that require lexical inference, making symmetric similarity measures less suitable for their identification. This paper investigates the nature of directional (asymmetric) similarity measures that aim to quantify distributional feature inclusion. We identify desired properties of such measures for lexical inference, specify a particular measure based on Average Precision that addresses these properties, and demonstrate the empirical benefit of directional measures for two different NLP datasets.

Download Full-text

Visualizing multiple word similarity measures

Behavior Research Methods ◽

10.3758/s13428-012-0236-7 ◽

2012 ◽

Vol 44 (3) ◽

pp. 656-674 ◽

Cited By ~ 10

Author(s):

Brent Kievit-Kylar ◽

Michael N. Jones

Keyword(s):

Similarity Measures ◽

Word Similarity

Download Full-text

Comparable Evaluation of Contemporary Corpus-Based and Knowledge-Based Semantic Similarity Measures of Short Texts

JITA - Journal of Information Technology and Applications (Banja Luka) - APEIRON ◽

10.7251/jit1101065f ◽

2011 ◽

Vol 1 (1) ◽

Cited By ~ 3

Author(s):

Bojan Furlan ◽

Vladimir Sivački ◽

Davor Jovanović ◽

Boško Nikolić

Keyword(s):

Semantic Similarity ◽

Large Fraction ◽

Similarity Measures ◽

The Other ◽

Similarity Measurement ◽

Data Set ◽

Word Similarity ◽

Short Text ◽

Knowledge Based ◽

Semantic Similarity Measurement

This paper presents methods for measuring the semantic similarity of texts, where we evaluated different approaches based on existing similarity measures. On one side word similarity was calculated by processing large text corpuses and on the other, commonsense knowledgebase was used. Given that a large fraction of the information available today, on the Web and elsewhere, consists of short text snippets (e.g. abstracts of scientific documents, image captions or product descriptions), where commonsense knowledge has an important role, in this paper we focus on computing the similarity between two sentences or two short paragraphs by extending existing measures with information from the ConceptNet knowledgebase. On the other hand, an extensive research has been done in the field of corpus-based semantic similarity, so we also evaluated existing solutions by imposing some modifications. Through experiments performed on a paraphrase data set, we demonstrate that some of proposed approaches can improve the semantic similarity measurement of short text.

Download Full-text

Frequency estimates for statistical word similarity measures

A Comparison of Word Similarity Measures for Noun Compound Disambiguation

Knowledge-driven graph similarity for text classification

Utilizing semantic word similarity measures for video retrieval

EHME: a New Word Database for Research in Basque Language

Asymmetric Attributional Word Similarity Measures to Detect the Relations of Textual Generality

Bootstrapping Distributional Feature Vector Quality

Verb sense disambiguation based on dual distributional similarity

Directional distributional similarity for lexical inference

Visualizing multiple word similarity measures

Comparable Evaluation of Contemporary Corpus-Based and Knowledge-Based Semantic Similarity Measures of Short Texts

Export Citation Format