Using Latent Semantic Analysis and the Predication Algorithm to Improve Extraction of Meanings from a Diagnostic Corpus

There is currently a widespread interest in indexing and extracting taxonomic information from large text collections. An example is the automatic categorization of informally written medical or psychological diagnoses, followed by the extraction of epidemiological information or even terms and structures needed to formulate guiding questions as an heuristic tool for helping doctors. Vector space models have been successfully used to this end (Lee, Cimino, Zhu, Sable, Shanker, Ely & Yu, 2006; Pakhomov, Buntrock & Chute, 2006). In this study we use a computational model known as Latent Semantic Analysis (LSA) on a diagnostic corpus with the aim of retrieving definitions (in the form of lists of semantic neighbors) of common structures it contains (e.g. “storm phobia”, “dog phobia”) or less common structures that might be formed by logical combinations of categories and diagnostic symptoms (e.g. “gun personality” or “germ personality”). In the quest to bring definitions into line with the meaning of structures and make them in some way representative, various problems commonly arise while recovering content using vector space models. We propose some approaches which bypass these problems, such as Kintsch's (2001) predication algorithm and some corrections to the way lists of neighbors are obtained, which have already been tested on semantic spaces in a non-specific domain (Jorge-Botana, León, Olmos & Hassan-Montero, under review). The results support the idea that the predication algorithm may also be useful for extracting more precise meanings of certain structures from scientific corpora, and that the introduction of some corrections based on vector length may increases its efficiency on non-representative terms.

Download Full-text

Pembentukan Vector Space Model Bahasa Indonesia Menggunakan Metode Word to Vector

Jurnal Buana Informatika ◽

10.24002/jbi.v10i1.2053 ◽

2019 ◽

Vol 10 (1) ◽

pp. 29

Author(s):

Yulius Denny Prabowo ◽

Tedi Lesmana Marselino ◽

Meylisa Suryawiguna

Keyword(s):

Vector Space ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Language Model ◽

Vector Space Model ◽

Online News ◽

Bag Of Words ◽

Space Model ◽

Language Research ◽

Bahasa Indonesia

Extracting information from a large amount of structured data requires expensive computing. The Vector Space Model method works by mapping words in continuous vector space where semantically similar words are mapped in adjacent vector spaces. The Vector Space Model model assumes words that appear in the same context, having the same semantic meaning. In the implementation, there are two different approaches: counting methods (eg: Latent Semantic Analysis) and predictive methods (eg Neural Probabilistic Language Model). This study aims to apply Word2Vec method using the Continuous Bag of Words approach in Indonesian language. Research data was obtained by crawling on several online news portals. The expected result of the research is the Indonesian words vector mapping based on the data used.Keywords: vector space model, word to vector, Indonesian vector space model.Ekstraksi informasi dari sekumpulan data terstruktur dalam jumlah yang besar membutuhkan komputasi yang mahal. Metode Vector Space Model bekerja dengan cara memetakan kata-kata dalam ruang vektor kontinu dimana kata-kata yang serupa secara semantis dipetakan dalam ruang vektor yang berdekatan. Metode Vector Space Model mengasumsikan kata-kata yang muncul pada konteks yang sama, memiliki makna semantik yang sama. Dalam penerapannya ada dua pendekatan yang berbeda yaitu: metode yang berbasis hitungan (misal: Latent Semantic Analysis) dan metode prediktif (misalnya Neural Probabilistic Language Model). Penelitian ini bertujuan untuk menerapkan metode Word2Vec menggunakan pendekatan Continuous Bag Of Words model dalam Bahasa Indonesia. Data penelitian yang digunakan didapatkan dengan cara crawling pada berberapa portal berita online. Hasil penelitian yang diharapkan adalah pemetaan vektor kata Bahasa Indonesia berdasarkan data yang digunakan.Kata Kunci: vector space model, word to vector, vektor kata bahasa Indonesia.

Download Full-text

Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering

Knowledge and Information Systems ◽

10.1007/s10115-009-0191-5 ◽

2009 ◽

Vol 22 (3) ◽

pp. 347-369 ◽

Cited By ~ 10

Author(s):

Wei Song ◽

Soon Cheol Park

Keyword(s):

Fuzzy Logic ◽

Vector Space ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Space Expansion ◽

Genetic Clustering

Download Full-text

Topic models: adding bigrams and taking account of the similarity between unigrams and bigrams

Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie) ◽

10.26089/nummet.v16r222 ◽

2015 ◽

pp. 215-234

Author(s):

М.А. Нокель ◽

Н.В. Лукашевич

Keyword(s):

Computational Linguistics ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Word Association ◽

Topic Models ◽

Computational Experiments ◽

Probabilistic Latent Semantic Analysis ◽

Parallel Corpora ◽

Text Collections

Представлены результаты экспериментов по добавлению биграмм в тематические модели и учету сходства между ними и униграммами. Предложен новый алгоритм PLSA-SIM, являющийся модификацией алгоритма построения тематических моделей PLSA (Probabilistic Latent Semantic Analysis). Предложенный алгоритм позволяет добавлять биграммы и учитывать сходство между ними и униграммными компонентами. Исследована возможность применения ассоциативных мер для выбора и последующего включения биграмм в тематические модели. В качестве текстовых коллекций взяты русскоязычная подборка статей из электронных банковских журналов, английские части корпусов параллельных текстов Europarl и JRC-Acquiz и англоязычный архив исследовательских работ по компьютерной лингвистике ACL Anthology. Выполненные эксперименты показывают, что существует подгруппа тестируемых мер, упорядочивающих биграммы таким образом, что при последующем их добавлении в предложенный алгоритм PLSA-SIM качество получающихся тематических моделей значительно повышается. Предложен новый итеративный алгоритм PLSA-ITER без учителя, позволяющий добавлять наиболее подходящие биграммы. Эксперименты показывают дальнейшее улучшение качества тематических моделей по сравнению с исходным алгоритмом PLSA. The results of experimental study of adding bigrams and taking account of the similarity between them and unigrams are discussed. A novel PLSA-SIM algorithm based on a modification of the original PLSA (Probabilistic Latent Semantic Analysis) algorithm is proposed. The proposed algorithm incorporates bigrams and takes into account the similarity between them and unigram components. Various word association measures are analyzed to integrate top-ranked bigrams into topic models. As target text collections, articles from various Russian electronic banking magazines, English parts of parallel corpora Europarl and JRC-Acquiz, and the English digital archive of research papers in computational linguistics (ACL Anthology) are chosen. The computational experiments show that there exists a subgroup of tested measures that produce top-ranked bigrams in such a way that their inclusion into the PLSA-SIM algorithm significantly improves the quality of topic models for all collections. A novel unsupervised iterative algorithm named PLSA-ITER is also proposed for adding the most relevant bigrams. The computational experiments show a further improvement in the quality of topic models compared to the PLSA algorithm.

Download Full-text

Automatic Categorization of Documents Using Latent Semantic Analysis and Fuzzy Inference Algorithm of Mamdani

SPIIRAS Proceedings ◽

10.15622/sp.44.1 ◽

2015 ◽

Vol 1 (44) ◽

pp. 5 ◽

Cited By ~ 2

Author(s):

Anatoly Dmitrievich Khomonenko ◽

Sergej Vjacheslavovich Logashev ◽

Sergey Aleksandrovich Krasnov

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Fuzzy Inference ◽

Inference Algorithm ◽

Automatic Categorization

Download Full-text

Experiments with Document Retrieval from Small Text Collections Using Latent Semantic Analysis or Term Similarity with Query Coordination and Automatic Relevance Feedback

Semantic Keyword-Based Search on Structured Data Sources - Lecture Notes in Computer Science ◽

10.1007/978-3-319-53640-8_3 ◽

2017 ◽

pp. 25-36 ◽

Cited By ~ 3

Author(s):

Colin Layfield ◽

Joel Azzopardi ◽

Chris Staff

Keyword(s):

Relevance Feedback ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Document Retrieval ◽

Text Collections ◽

Term Similarity

Download Full-text

Visualization and Analysis of Frames in Collections of Messages

Research Methodologies, Innovations and Philosophies in Software Systems Engineering and Information Systems ◽

10.4018/978-1-4666-0179-6.ch016 ◽

2012 ◽

pp. 321-339 ◽

Cited By ~ 3

Author(s):

Esther Vlieger ◽

Loet Leydesdorff

Keyword(s):

Social Network ◽

Social Network Analysis ◽

Network Analysis ◽

Vector Space ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Social Systems ◽

Semantic Map ◽

Network Space ◽

Communication Of Meaning

A step-by-step introduction is provided on how to generate a semantic map from a collection of messages (full texts, paragraphs, or statements) using freely available software and/or SPSS for the relevant statistics and the visualization. The techniques are discussed in the various theoretical contexts of (i) linguistics (e.g., Latent Semantic Analysis), (ii) sociocybernetics and social systems theory (e.g., the communication of meaning), and (iii) communication studies (e.g., framing and agenda-setting). The authors distinguish between the communication of information in the network space (social network analysis) and the communication of meaning in the vector space. The vector space can be considered a generated as an architecture by the network of relations in the network space; words are then not only related, but also positioned. These positions are expected rather than observed, and therefore one can communicate meaning. Knowledge can be generated when these meanings can recursively be communicated and therefore also further codified.

Download Full-text

Exploring similarity between academic paper and patent based on Latent Semantic Analysis and Vector Space Model

2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) ◽

10.1109/fskd.2015.7382045 ◽

2015 ◽

Cited By ~ 4

Author(s):

Hongjiao Xu ◽

Wen Zeng ◽

Jie Gui ◽

Peng Qu ◽

Xiaohua Zhu ◽

...

Keyword(s):

Vector Space ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Vector Space Model ◽

Space Model ◽

Academic Paper

Download Full-text

How lexical ambiguity distributes activation to semantic neighbors

The Mental Lexicon ◽

10.1075/ml.9.1.04jor ◽

2014 ◽

Vol 9 (1) ◽

pp. 67-106 ◽

Cited By ~ 1

Author(s):

Guillermo Jorge-Botana ◽

Ricardo Olmos

Keyword(s):

Latent Semantic Analysis ◽

Discrete Model ◽

Semantic Analysis ◽

Lexical Representation ◽

Inhibitory Mechanism ◽

Integration Framework ◽

Construction Integration ◽

Vector Space Models ◽

Abstract Words ◽

Concrete Words

The role which the diversity of a word’s contexts plays in lexical access is currently the object of research. Vector-space models such as Latent Semantic Analysis (LSA) are useful to examine this role. Having an objective, discrete model of lexical representation allows us to objectify parameters in order to define contextual focalization in a more measurable way. In the first part of our study, we investigate whether certain empirical data on ambiguity can be modeled by means of an exclusively symbolic single representation model such as LSA and an excitatory-inhibitory mechanism such as the Construction-Integration framework. Our observations support the idea that some ambiguity effects could be explained by the contextual distribution using such a model. In the second part, we put abstract and concrete words to the test. Our LSA model (exclusively symbolic) and the excitatory-inhibitory mechanism can also explain the penalty paid by abstract words as they activate other words through semantic similarity and the advantage of concrete words in naming and semantic judgments, though it does not account for the advantage of concrete words in lexical decision tasks. The results of this second part are then discussed within the framework of the embodied/symbolic view of the language process.

Download Full-text

Content Analysis and the Measurement of Meaning: The Visualization of Frames in Collections of Messages

Public Journal of Semiotics ◽

10.37693/pjos.2011.3.8830 ◽

2011 ◽

Vol 3 (1) ◽

pp. 28-50 ◽

Cited By ~ 19

Author(s):

Esther Vlieger ◽

Loet Leydesdorff

Keyword(s):

Content Analysis ◽

Social Network ◽

Network Analysis ◽

Vector Space ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Social Systems ◽

Communication Studies ◽

Semantic Map ◽

Communication Of Meaning

A step-to-step introduction is provided on how to generate a semantic map from a collection of messages (full texts, paragraphs or statements) using freely available software and/or SPSS for the relevant statistics and the visualization. The techniques are discussed in the various theoretical contexts of (i) linguistics (e.g., Latent Semantic Analysis), (ii) sociocybernetics and social systems theory (e.g., the communication of meaning), and (iii) communication studies (e.g., framing and agenda-setting). We distinguish between the communication of information in the network space (social network analysis) and the communication of meaning in the vector space. The vector space can be considered as the space in which the network of relations spans an architecture; words then are not only related, but also positioned. These positions are expected rather than observed and therefore one can communicate meaning.

Download Full-text

Improving Website Usability with Latent Semantic Analysis

PsycEXTRA Dataset ◽

10.1037/e577712012-027 ◽

2006 ◽

Author(s):

Sarah A. Nuehring ◽

Peter W. Foltz

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Website Usability

Download Full-text