Using Latent Semantic Analysis and the Predication Algorithm to Improve Extraction of Meanings from a Diagnostic Corpus

2009 ◽  
Vol 12 (2) ◽  
pp. 424-440 ◽  
Author(s):  
Guillermo Jorge-Botana ◽  
Ricardo Olmos ◽  
José Antonio León

There is currently a widespread interest in indexing and extracting taxonomic information from large text collections. An example is the automatic categorization of informally written medical or psychological diagnoses, followed by the extraction of epidemiological information or even terms and structures needed to formulate guiding questions as an heuristic tool for helping doctors. Vector space models have been successfully used to this end (Lee, Cimino, Zhu, Sable, Shanker, Ely & Yu, 2006; Pakhomov, Buntrock & Chute, 2006). In this study we use a computational model known as Latent Semantic Analysis (LSA) on a diagnostic corpus with the aim of retrieving definitions (in the form of lists of semantic neighbors) of common structures it contains (e.g. “storm phobia”, “dog phobia”) or less common structures that might be formed by logical combinations of categories and diagnostic symptoms (e.g. “gun personality” or “germ personality”). In the quest to bring definitions into line with the meaning of structures and make them in some way representative, various problems commonly arise while recovering content using vector space models. We propose some approaches which bypass these problems, such as Kintsch's (2001) predication algorithm and some corrections to the way lists of neighbors are obtained, which have already been tested on semantic spaces in a non-specific domain (Jorge-Botana, León, Olmos & Hassan-Montero, under review). The results support the idea that the predication algorithm may also be useful for extracting more precise meanings of certain structures from scientific corpora, and that the introduction of some corrections based on vector length may increases its efficiency on non-representative terms.

2019 ◽  
Vol 10 (1) ◽  
pp. 29
Author(s):  
Yulius Denny Prabowo ◽  
Tedi Lesmana Marselino ◽  
Meylisa Suryawiguna

Extracting information from a large amount of structured data requires expensive computing. The Vector Space Model method works by mapping words in continuous vector space where semantically similar words are mapped in adjacent vector spaces. The Vector Space Model model assumes words that appear in the same context, having the same semantic meaning. In the implementation, there are two different approaches: counting methods (eg: Latent Semantic Analysis) and predictive methods (eg Neural Probabilistic Language Model). This study aims to apply Word2Vec method using the Continuous Bag of Words approach in Indonesian language. Research data was obtained by crawling on several online news portals. The expected result of the research is the Indonesian words vector mapping based on the data used.Keywords: vector space model, word to vector, Indonesian vector space model.Ekstraksi informasi dari sekumpulan data terstruktur dalam jumlah yang besar membutuhkan komputasi yang mahal. Metode Vector Space Model bekerja dengan cara memetakan kata-kata dalam ruang vektor kontinu dimana kata-kata yang serupa secara semantis dipetakan dalam ruang vektor yang berdekatan. Metode Vector Space Model mengasumsikan kata-kata yang muncul pada konteks yang sama, memiliki makna semantik yang sama. Dalam penerapannya ada dua pendekatan yang berbeda yaitu: metode yang berbasis hitungan (misal: Latent Semantic Analysis) dan metode prediktif (misalnya Neural Probabilistic Language Model). Penelitian ini bertujuan untuk menerapkan metode Word2Vec menggunakan pendekatan Continuous Bag Of Words model dalam Bahasa Indonesia. Data penelitian yang digunakan didapatkan dengan cara crawling pada berberapa portal berita online. Hasil penelitian yang diharapkan adalah pemetaan vektor kata Bahasa Indonesia berdasarkan data yang digunakan.Kata Kunci: vector space model, word to vector, vektor kata bahasa Indonesia.


Author(s):  
М.А. Нокель ◽  
Н.В. Лукашевич

Представлены результаты экспериментов по добавлению биграмм в тематические модели и учету сходства между ними и униграммами. Предложен новый алгоритм PLSA-SIM, являющийся модификацией алгоритма построения тематических моделей PLSA (Probabilistic Latent Semantic Analysis). Предложенный алгоритм позволяет добавлять биграммы и учитывать сходство между ними и униграммными компонентами. Исследована возможность применения ассоциативных мер для выбора и последующего включения биграмм в тематические модели. В качестве текстовых коллекций взяты русскоязычная подборка статей из электронных банковских журналов, английские части корпусов параллельных текстов Europarl и JRC-Acquiz и англоязычный архив исследовательских работ по компьютерной лингвистике ACL Anthology. Выполненные эксперименты показывают, что существует подгруппа тестируемых мер, упорядочивающих биграммы таким образом, что при последующем их добавлении в предложенный алгоритм PLSA-SIM качество получающихся тематических моделей значительно повышается. Предложен новый итеративный алгоритм PLSA-ITER без учителя, позволяющий добавлять наиболее подходящие биграммы. Эксперименты показывают дальнейшее улучшение качества тематических моделей по сравнению с исходным алгоритмом PLSA. The results of experimental study of adding bigrams and taking account of the similarity between them and unigrams are discussed. A novel PLSA-SIM algorithm based on a modification of the original PLSA (Probabilistic Latent Semantic Analysis) algorithm is proposed. The proposed algorithm incorporates bigrams and takes into account the similarity between them and unigram components. Various word association measures are analyzed to integrate top-ranked bigrams into topic models. As target text collections, articles from various Russian electronic banking magazines, English parts of parallel corpora Europarl and JRC-Acquiz, and the English digital archive of research papers in computational linguistics (ACL Anthology) are chosen. The computational experiments show that there exists a subgroup of tested measures that produce top-ranked bigrams in such a way that their inclusion into the PLSA-SIM algorithm significantly improves the quality of topic models for all collections. A novel unsupervised iterative algorithm named PLSA-ITER is also proposed for adding the most relevant bigrams. The computational experiments show a further improvement in the quality of topic models compared to the PLSA algorithm.


2015 ◽  
Vol 1 (44) ◽  
pp. 5 ◽  
Author(s):  
Anatoly Dmitrievich Khomonenko ◽  
Sergej Vjacheslavovich Logashev ◽  
Sergey Aleksandrovich Krasnov

Author(s):  
Esther Vlieger ◽  
Loet Leydesdorff

A step-by-step introduction is provided on how to generate a semantic map from a collection of messages (full texts, paragraphs, or statements) using freely available software and/or SPSS for the relevant statistics and the visualization. The techniques are discussed in the various theoretical contexts of (i) linguistics (e.g., Latent Semantic Analysis), (ii) sociocybernetics and social systems theory (e.g., the communication of meaning), and (iii) communication studies (e.g., framing and agenda-setting). The authors distinguish between the communication of information in the network space (social network analysis) and the communication of meaning in the vector space. The vector space can be considered a generated as an architecture by the network of relations in the network space; words are then not only related, but also positioned. These positions are expected rather than observed, and therefore one can communicate meaning. Knowledge can be generated when these meanings can recursively be communicated and therefore also further codified.


2014 ◽  
Vol 9 (1) ◽  
pp. 67-106 ◽  
Author(s):  
Guillermo Jorge-Botana ◽  
Ricardo Olmos

The role which the diversity of a word’s contexts plays in lexical access is currently the object of research. Vector-space models such as Latent Semantic Analysis (LSA) are useful to examine this role. Having an objective, discrete model of lexical representation allows us to objectify parameters in order to define contextual focalization in a more measurable way. In the first part of our study, we investigate whether certain empirical data on ambiguity can be modeled by means of an exclusively symbolic single representation model such as LSA and an excitatory-inhibitory mechanism such as the Construction-­Integration framework. Our observations support the idea that some ambiguity effects could be explained by the contextual distribution using such a model. In the second part, we put abstract and concrete words to the test. Our LSA model (exclusively symbolic) and the excitatory-inhibitory mechanism can also explain the penalty paid by abstract words as they activate other words through semantic similarity and the advantage of concrete words in naming and semantic judgments, though it does not account for the advantage of concrete words in lexical decision tasks. The results of this second part are then discussed within the framework of the embodied/symbolic view of the language process.


2011 ◽  
Vol 3 (1) ◽  
pp. 28-50 ◽  
Author(s):  
Esther Vlieger ◽  
Loet Leydesdorff

A step-to-step introduction is provided on how to generate a semantic map from a collection of messages (full texts, paragraphs or statements) using freely available software and/or SPSS for the relevant statistics and the visualization. The techniques are discussed in the various theoretical contexts of (i) linguistics (e.g., Latent Semantic Analysis), (ii) sociocybernetics and social systems theory (e.g., the communication of meaning), and (iii) communication studies (e.g., framing and agenda-setting). We distinguish between the communication of information in the network space (social network analysis) and the communication of meaning in the vector space. The vector space can be considered as the space in which the network of relations spans an architecture; words then are not only related, but also positioned. These positions are expected rather than observed and therefore one can communicate meaning.


Sign in / Sign up

Export Citation Format

Share Document