Text Clustering Based on Domain Ontology and Latent Semantic Analysis

One key step in text mining is the categorization of texts, i.e., to put texts of the same or similar contents into one group so as to distinguish texts of different contents. However, traditional word-frequency-based statistical approaches, such as VSM model, failed to reflect the complicated meaning in texts. This paper ushers in domain ontology and constructs new conceptual vector space model in the pre-processing stage of text clustering, substituting the initial matrix (lexicon-text matrix) in the latent semantic analysis with concept-text matrix. In the clustering analysis stage, this model adopts semantic similarity, partially overcoming the difficulty in accurately and effectively evaluating the degree of similarity of text due to simply taking into account the frequency of words and/or phrases in the text. Experimental results indicate that this method is helpful in improving the result of text clustering.

Download Full-text

Pembentukan Vector Space Model Bahasa Indonesia Menggunakan Metode Word to Vector

Jurnal Buana Informatika ◽

10.24002/jbi.v10i1.2053 ◽

2019 ◽

Vol 10 (1) ◽

pp. 29

Author(s):

Yulius Denny Prabowo ◽

Tedi Lesmana Marselino ◽

Meylisa Suryawiguna

Keyword(s):

Vector Space ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Language Model ◽

Vector Space Model ◽

Online News ◽

Bag Of Words ◽

Space Model ◽

Language Research ◽

Bahasa Indonesia

Extracting information from a large amount of structured data requires expensive computing. The Vector Space Model method works by mapping words in continuous vector space where semantically similar words are mapped in adjacent vector spaces. The Vector Space Model model assumes words that appear in the same context, having the same semantic meaning. In the implementation, there are two different approaches: counting methods (eg: Latent Semantic Analysis) and predictive methods (eg Neural Probabilistic Language Model). This study aims to apply Word2Vec method using the Continuous Bag of Words approach in Indonesian language. Research data was obtained by crawling on several online news portals. The expected result of the research is the Indonesian words vector mapping based on the data used.Keywords: vector space model, word to vector, Indonesian vector space model.Ekstraksi informasi dari sekumpulan data terstruktur dalam jumlah yang besar membutuhkan komputasi yang mahal. Metode Vector Space Model bekerja dengan cara memetakan kata-kata dalam ruang vektor kontinu dimana kata-kata yang serupa secara semantis dipetakan dalam ruang vektor yang berdekatan. Metode Vector Space Model mengasumsikan kata-kata yang muncul pada konteks yang sama, memiliki makna semantik yang sama. Dalam penerapannya ada dua pendekatan yang berbeda yaitu: metode yang berbasis hitungan (misal: Latent Semantic Analysis) dan metode prediktif (misalnya Neural Probabilistic Language Model). Penelitian ini bertujuan untuk menerapkan metode Word2Vec menggunakan pendekatan Continuous Bag Of Words model dalam Bahasa Indonesia. Data penelitian yang digunakan didapatkan dengan cara crawling pada berberapa portal berita online. Hasil penelitian yang diharapkan adalah pemetaan vektor kata Bahasa Indonesia berdasarkan data yang digunakan.Kata Kunci: vector space model, word to vector, vektor kata bahasa Indonesia.

Download Full-text

Exploring similarity between academic paper and patent based on Latent Semantic Analysis and Vector Space Model

2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) ◽

10.1109/fskd.2015.7382045 ◽

2015 ◽

Cited By ~ 4

Author(s):

Hongjiao Xu ◽

Wen Zeng ◽

Jie Gui ◽

Peng Qu ◽

Xiaohua Zhu ◽

...

Keyword(s):

Vector Space ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Vector Space Model ◽

Space Model ◽

Academic Paper

Download Full-text

Text Clustering Based on Domain Ontology and Latent Semantic Analysis

2010 International Conference on Asian Language Processing ◽

10.1109/ialp.2010.55 ◽

2010 ◽

Cited By ~ 3

Author(s):

Yaxiong Li ◽

Jianqiang Zhang ◽

Dan Hu

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Text Clustering ◽

Domain Ontology

Download Full-text

Knowledge-based vector space model for text clustering

Knowledge and Information Systems ◽

10.1007/s10115-009-0256-5 ◽

2009 ◽

Vol 25 (1) ◽

pp. 35-55 ◽

Cited By ~ 51

Author(s):

Liping Jing ◽

Michael K. Ng ◽

Joshua Z. Huang

Keyword(s):

Vector Space ◽

Vector Space Model ◽

Text Clustering ◽

Space Model ◽

Knowledge Based

Download Full-text

A CONCEPT VECTOR SPACE MODEL FOR SEMANTIC KERNELS

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213009000123 ◽

2009 ◽

Vol 18 (02) ◽

pp. 239-272 ◽

Cited By ~ 1

Author(s):

SUJEEVAN ASEERVATHAM

Keyword(s):

Vector Space ◽

Language Processing ◽

Text Categorization ◽

Semantic Analysis ◽

Similarity Measures ◽

Vector Space Model ◽

Inner Product ◽

Support Vector ◽

Linear Kernel ◽

Space Model

Kernels are widely used in Natural Language Processing as similarity measures within inner-product based learning methods like the Support Vector Machine. The Vector Space Model (VSM) is extensively used for the spatial representation of the documents. However, it is purely a statistical representation. In this paper, we present a Concept Vector Space Model (CVSM) representation which uses linguistic prior knowledge to capture the meanings of the documents. We also propose a linear kernel and a latent kernel for this space. The linear kernel takes advantage of the linguistic concepts whereas the latent kernel combines statistical and linguistic concepts. Indeed, the latter kernel uses latent concepts extracted by the Latent Semantic Analysis (LSA) in the CVSM. The kernels were evaluated on a text categorization task in the biomedical domain. The Ohsumed corpus, well known for being difficult to categorize, was used. The results have shown that the CVSM improves performance compared to the VSM.

Download Full-text

Summarization of text clustering based vector space model

2009 IEEE 10th International Conference on Computer-Aided Industrial Design & Conceptual Design ◽

10.1109/caidcd.2009.5375265 ◽

2009 ◽

Author(s):

Mingzhen Chen ◽

Yu Song

Keyword(s):

Vector Space ◽

Vector Space Model ◽

Text Clustering ◽

Space Model

Download Full-text

Enhancing GSOM text clustering with Latent Semantic Analysis

2010 Fifth International Conference on Information and Automation for Sustainability ◽

10.1109/iciafs.2010.5715702 ◽

2010 ◽

Cited By ~ 2

Author(s):

S Matharage ◽

D Alahakoon

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Text Clustering

Download Full-text

Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet

Future Internet ◽

10.3390/fi11050114 ◽

2019 ◽

Vol 11 (5) ◽

pp. 114 ◽

Cited By ~ 5

Author(s):

Korawit Orkphol ◽

Wu Yang

Keyword(s):

Vector Space ◽

Language Processing ◽

Semantic Analysis ◽

Word Sense Disambiguation ◽

Vector Space Model ◽

Word Embedding ◽

Cosine Similarity ◽

Word Sense ◽

Lexical Database ◽

Space Model

Words have different meanings (i.e., senses) depending on the context. Disambiguating the correct sense is important and a challenging task for natural language processing. An intuitive way is to select the highest similarity between the context and sense definitions provided by a large lexical database of English, WordNet. In this database, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms interlinked through conceptual semantics and lexicon relations. Traditional unsupervised approaches compute similarity by counting overlapping words between the context and sense definitions which must match exactly. Similarity should compute based on how words are related rather than overlapping by representing the context and sense definitions on a vector space model and analyzing distributional semantic relationships among them using latent semantic analysis (LSA). When a corpus of text becomes more massive, LSA consumes much more memory and is not flexible to train a huge corpus of text. A word-embedding approach has an advantage in this issue. Word2vec is a popular word-embedding approach that represents words on a fix-sized vector space model through either the skip-gram or continuous bag-of-words (CBOW) model. Word2vec is also effectively capturing semantic and syntactic word similarities from a huge corpus of text better than LSA. Our method used Word2vec to construct a context sentence vector, and sense definition vectors then give each word sense a score using cosine similarity to compute the similarity between those sentence vectors. The sense definition also expanded with sense relations retrieved from WordNet. If the score is not higher than a specific threshold, the score will be combined with the probability of that sense distribution learned from a large sense-tagged corpus, SEMCOR. The possible answer senses can be obtained from high scores. Our method shows that the result (50.9% or 48.7% without the probability of sense distribution) is higher than the baselines (i.e., original, simplified, adapted and LSA Lesk) and outperforms many unsupervised systems participating in the SENSEVAL-3 English lexical sample task.

Download Full-text

An information-theoretic, vector-space-model approach to cross-language information retrieval

Natural Language Engineering ◽

10.1017/s1351324910000185 ◽

2011 ◽

Vol 17 (1) ◽

pp. 37-70 ◽

Cited By ~ 7

Author(s):

PETER A. CHEW ◽

BRETT W. BADER ◽

STEPHEN HELMREICH ◽

AHMED ABDELALI ◽

STEPHEN J. VERZI

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Computational Linguistics ◽

Semantic Analysis ◽

Statistical Machine Translation ◽

Vector Space Model ◽

Standard Approach ◽

Eigenvalue Decomposition ◽

Information Theoretic ◽

Space Model

AbstractIn this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a ‘standard’ approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.

Download Full-text

Chinese-English Cross-Language Text Clustering Algorithm Based on Latent Semantic Analysis

10.22323/1.300.0007 ◽

2018 ◽

Author(s):

Huihong Lan ◽

Jinde Huang

Keyword(s):

Latent Semantic Analysis ◽

Clustering Algorithm ◽

Semantic Analysis ◽

Text Clustering ◽

Cross Language ◽

Language Text

Download Full-text