Pembentukan Vector Space Model Bahasa Indonesia Menggunakan Metode Word to Vector

Extracting information from a large amount of structured data requires expensive computing. The Vector Space Model method works by mapping words in continuous vector space where semantically similar words are mapped in adjacent vector spaces. The Vector Space Model model assumes words that appear in the same context, having the same semantic meaning. In the implementation, there are two different approaches: counting methods (eg: Latent Semantic Analysis) and predictive methods (eg Neural Probabilistic Language Model). This study aims to apply Word2Vec method using the Continuous Bag of Words approach in Indonesian language. Research data was obtained by crawling on several online news portals. The expected result of the research is the Indonesian words vector mapping based on the data used.Keywords: vector space model, word to vector, Indonesian vector space model.Ekstraksi informasi dari sekumpulan data terstruktur dalam jumlah yang besar membutuhkan komputasi yang mahal. Metode Vector Space Model bekerja dengan cara memetakan kata-kata dalam ruang vektor kontinu dimana kata-kata yang serupa secara semantis dipetakan dalam ruang vektor yang berdekatan. Metode Vector Space Model mengasumsikan kata-kata yang muncul pada konteks yang sama, memiliki makna semantik yang sama. Dalam penerapannya ada dua pendekatan yang berbeda yaitu: metode yang berbasis hitungan (misal: Latent Semantic Analysis) dan metode prediktif (misalnya Neural Probabilistic Language Model). Penelitian ini bertujuan untuk menerapkan metode Word2Vec menggunakan pendekatan Continuous Bag Of Words model dalam Bahasa Indonesia. Data penelitian yang digunakan didapatkan dengan cara crawling pada berberapa portal berita online. Hasil penelitian yang diharapkan adalah pemetaan vektor kata Bahasa Indonesia berdasarkan data yang digunakan.Kata Kunci: vector space model, word to vector, vektor kata bahasa Indonesia.

Download Full-text

Exploring similarity between academic paper and patent based on Latent Semantic Analysis and Vector Space Model

2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) ◽

10.1109/fskd.2015.7382045 ◽

2015 ◽

Cited By ~ 4

Author(s):

Hongjiao Xu ◽

Wen Zeng ◽

Jie Gui ◽

Peng Qu ◽

Xiaohua Zhu ◽

...

Keyword(s):

Vector Space ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Vector Space Model ◽

Space Model ◽

Academic Paper

Download Full-text

An Extension of the VSM Documents Representation

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2017.3.2889 ◽

2017 ◽

Vol 12 (3) ◽

pp. 402

Author(s):

Lucian Nicolae Vintan ◽

Daniel Ionel Morariu ◽

Radu George Cretulescu ◽

Maria Vintan

Keyword(s):

Vector Space ◽

Clustering Algorithms ◽

Vector Space Model ◽

Bag Of Words ◽

New Approach ◽

Parts Of Speech ◽

Space Model ◽

Part Of Speech ◽

Different Parts ◽

Hyper Space

In this paper we will present a new approach regarding the documents representation in order to be used in classification and/or clustering algorithms. In our new representation we will start from the classical "bag-of-words" representation but we will augment each word with its correspondent part-of-speech. Thus we will introduce a new concept called hyper-vectors where each document is represented in a hyper-space where each dimension is a different part-of-speech component. For each dimension the document is represented using the Vector Space Model (VSM). In this work we will use only five different parts of speech: noun, verb, adverb, adjective and others. In the hyper-space each dimension has a different weight. To compute the similarity between two documents we have developed a new hyper-cosine formula. Some interesting classification experiments are presented as validation cases.

Download Full-text

TF-IDF Method and Vector Space Model Regarding the Covid-19 Vaccine on Online News

SinkrOn ◽

10.33395/sinkron.v6i1.11179 ◽

2021 ◽

Vol 6 (1) ◽

pp. 69-79

Author(s):

Bita Parga Zen ◽

Irwan Susanto ◽

Dian Finaliamartha

Keyword(s):

Vector Space ◽

Vector Space Model ◽

Relevant Information ◽

Online News ◽

Basic Form ◽

Information Redundancy ◽

Search System ◽

Space Model ◽

Use Of The Internet ◽

Online News Sites

Advances in information and technology have caused the use of the internet to be a concern of the general public. Online news sites are one of the technologies that have developed as a means of disseminating the latest information in the world. When viewed in terms of numbers, newsreaders are very sufficient to get the desired information. However, with this, the amount of information collected will result in an explosion of information and the possibility of information redundancy. The search system is one of the solutions which expected to help in finding the desired or relevant information by the input query. The methods commonly used in this case are TF-IDF and VSM (Vector Space Model) which are used in weighting to measure statistics from a collection of documents on the search for some information about the Covid 19 vaccine on kompas.com news then tokenizing it to separate the text, stopword removal or filtering to remove unnecessary words which usually consist of conjunctions and others. The next step is sentence stemming which aims to eliminate word inflection to its basic form. Then the TF-IDF and VSM calculations were carried out and the final result are news documents 3 (DOC 3) with a weight of 5.914226424; news documents 2 (DOC 2) with a weight of 1.767692186; news documents 5 (DOC 5) with weights 1.550165096; news document 4 (DOC 4) with a weight of 1.17141223;, and the last is news document 1 (DOC 1) with a weight of 0.5244103739.

Download Full-text

A CONCEPT VECTOR SPACE MODEL FOR SEMANTIC KERNELS

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213009000123 ◽

2009 ◽

Vol 18 (02) ◽

pp. 239-272 ◽

Cited By ~ 1

Author(s):

SUJEEVAN ASEERVATHAM

Keyword(s):

Vector Space ◽

Language Processing ◽

Text Categorization ◽

Semantic Analysis ◽

Similarity Measures ◽

Vector Space Model ◽

Inner Product ◽

Support Vector ◽

Linear Kernel ◽

Space Model

Kernels are widely used in Natural Language Processing as similarity measures within inner-product based learning methods like the Support Vector Machine. The Vector Space Model (VSM) is extensively used for the spatial representation of the documents. However, it is purely a statistical representation. In this paper, we present a Concept Vector Space Model (CVSM) representation which uses linguistic prior knowledge to capture the meanings of the documents. We also propose a linear kernel and a latent kernel for this space. The linear kernel takes advantage of the linguistic concepts whereas the latent kernel combines statistical and linguistic concepts. Indeed, the latter kernel uses latent concepts extracted by the Latent Semantic Analysis (LSA) in the CVSM. The kernels were evaluated on a text categorization task in the biomedical domain. The Ohsumed corpus, well known for being difficult to categorize, was used. The results have shown that the CVSM improves performance compared to the VSM.

Download Full-text

Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet

Future Internet ◽

10.3390/fi11050114 ◽

2019 ◽

Vol 11 (5) ◽

pp. 114 ◽

Cited By ~ 5

Author(s):

Korawit Orkphol ◽

Wu Yang

Keyword(s):

Vector Space ◽

Language Processing ◽

Semantic Analysis ◽

Word Sense Disambiguation ◽

Vector Space Model ◽

Word Embedding ◽

Cosine Similarity ◽

Word Sense ◽

Lexical Database ◽

Space Model

Words have different meanings (i.e., senses) depending on the context. Disambiguating the correct sense is important and a challenging task for natural language processing. An intuitive way is to select the highest similarity between the context and sense definitions provided by a large lexical database of English, WordNet. In this database, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms interlinked through conceptual semantics and lexicon relations. Traditional unsupervised approaches compute similarity by counting overlapping words between the context and sense definitions which must match exactly. Similarity should compute based on how words are related rather than overlapping by representing the context and sense definitions on a vector space model and analyzing distributional semantic relationships among them using latent semantic analysis (LSA). When a corpus of text becomes more massive, LSA consumes much more memory and is not flexible to train a huge corpus of text. A word-embedding approach has an advantage in this issue. Word2vec is a popular word-embedding approach that represents words on a fix-sized vector space model through either the skip-gram or continuous bag-of-words (CBOW) model. Word2vec is also effectively capturing semantic and syntactic word similarities from a huge corpus of text better than LSA. Our method used Word2vec to construct a context sentence vector, and sense definition vectors then give each word sense a score using cosine similarity to compute the similarity between those sentence vectors. The sense definition also expanded with sense relations retrieved from WordNet. If the score is not higher than a specific threshold, the score will be combined with the probability of that sense distribution learned from a large sense-tagged corpus, SEMCOR. The possible answer senses can be obtained from high scores. Our method shows that the result (50.9% or 48.7% without the probability of sense distribution) is higher than the baselines (i.e., original, simplified, adapted and LSA Lesk) and outperforms many unsupervised systems participating in the SENSEVAL-3 English lexical sample task.

Download Full-text

An information-theoretic, vector-space-model approach to cross-language information retrieval

Natural Language Engineering ◽

10.1017/s1351324910000185 ◽

2011 ◽

Vol 17 (1) ◽

pp. 37-70 ◽

Cited By ~ 7

Author(s):

PETER A. CHEW ◽

BRETT W. BADER ◽

STEPHEN HELMREICH ◽

AHMED ABDELALI ◽

STEPHEN J. VERZI

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Computational Linguistics ◽

Semantic Analysis ◽

Statistical Machine Translation ◽

Vector Space Model ◽

Standard Approach ◽

Eigenvalue Decomposition ◽

Information Theoretic ◽

Space Model

AbstractIn this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a ‘standard’ approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.

Download Full-text

Time Frame Detection Based on Online News Documents Using Vector Space Model

2019 International Conference on Computer Science, Information Technology, and Electrical Engineering (ICOMITEE) ◽

10.1109/icomitee.2019.8921075 ◽

2019 ◽

Author(s):

Ferry Wiranto ◽

Achmad Maududie ◽

Tio Dharmawan

Keyword(s):

Vector Space ◽

Vector Space Model ◽

Time Frame ◽

Online News ◽

Space Model

Download Full-text

MODDEL MESIN PENCARI DOKUMEN BAHASA INDONESIA, STUDI EFEKTIFITAS PADA VECTOR SPACE MODEL ALGORITMA STEMMING POTER PEMBOBOTAN FREKUENSI TERM BERBANDING FREKUENSI TERM DALAM PENCARIAN DAN FUNGSI KESAMAAN COSINE

10.31234/osf.io/2h7b3 ◽

2020 ◽

Author(s):

Lasarus Pelipus Malese

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Vector Space Model ◽

Space Model ◽

Bahasa Indonesia

Seiring dengan semakin berharganya nilai sebuah informasi dan semakin banyak sumber-sumber informasi, maka semakin banyak pula kebutuhan manusia untuk dapat menemukan informasi yang sesuai keperluannya dengan cepat. Information Retrieval (Perolehan Informasi) merupakan suatu pencarian informasi(biasanya berupa dokumen) yang didasarkan pada suatu query (inputan user) yang diharapkan dapat memenuhi keinginan user dari dokumen yang ada. Dua aspek penting dalam konsep Information Retrieval yang diterapkan dalam sebuah perancangan mesin pencari yaitu repesentasi dari informasi dan pengukuran yang akan mengukur nilai kesamaan antara dua obyek. Informasi yang dapat direpresentasikan menjadi sebuah obyek yang direpresentasikan dalam berbagai bentuk dan model (heterogeneous). Keadaan ini mengakibatkan bahwa pencarian untuk suatu obyek informasi yang diinginkan akan dapat dipetakan kepada beberapa obyek informati yang dinilai relevan. Relevansi dua informasi di-ukur dari keberadaan kata kunci (keyword) dan bobotnya. Konsekuensi logis atas keadaan ini adalah bahwa dalam melakukan pencarian obyek yang diinginkan, ditemukan terdapat ketidakpastian (uncertainly) terhadap penggunaan keyword pada query oleh pengguna dengan keberadaan keyword pada dokumen. Pada penelitian ini akan difokuskan pada studi efektifitas vektor model dengan menggunakan algoritma steming poter untuk membentuk kata-kata menjadi sebuah kata baku serta pembobotan frekuwensi term untuk menentukan tingkat kepentingan setiap indeks term dalam sebuah dokumen dan fungsi kesamaan cosine dalam mengukur kemiripan queri dengan dokumen. Hasil pengujian nilai kualitas rata-rata precesion dan recall bahwa untuk semua bentuk query pencarian dokumen mempunyai nilai precesion 100% artinya baik pencarian berdasarkan bentuk query isi, judul dan dokumen mempunyai precesion yang baik sedangkan hasil pengujian recall menunjukkan hasil yang berbeda, dimana pencarian berdasarkan bentuk queri isi dokumen menempati recall tertinggi sebesar 90%, sedangkan pencarian berdasarkan bentuk queri judul dokumen dengan nilai recall 70% dan pencarian berdasarkan bentuk query dokumen menempati posisi terendah dengan recall 33%.

Download Full-text

Text Clustering Based on Domain Ontology and Latent Semantic Analysis

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.556-562.3536 ◽

2014 ◽

Vol 556-562 ◽

pp. 3536-3540

Author(s):

Ya Xiong Li ◽

Deng Pan

Keyword(s):

Latent Semantic Analysis ◽

Clustering Analysis ◽

Semantic Analysis ◽

Vector Space Model ◽

Text Clustering ◽

Domain Ontology ◽

Initial Matrix ◽

Processing Stage ◽

Space Model ◽

Degree Of Similarity

One key step in text mining is the categorization of texts, i.e., to put texts of the same or similar contents into one group so as to distinguish texts of different contents. However, traditional word-frequency-based statistical approaches, such as VSM model, failed to reflect the complicated meaning in texts. This paper ushers in domain ontology and constructs new conceptual vector space model in the pre-processing stage of text clustering, substituting the initial matrix (lexicon-text matrix) in the latent semantic analysis with concept-text matrix. In the clustering analysis stage, this model adopts semantic similarity, partially overcoming the difficulty in accurately and effectively evaluating the degree of similarity of text due to simply taking into account the frequency of words and/or phrases in the text. Experimental results indicate that this method is helpful in improving the result of text clustering.

Download Full-text

Extended Vector Space Model with Semantic Relatedness on Java Archive Search Engine

Jurnal Teknik Informatika dan Sistem Informasi ◽

10.28932/jutisi.v1i2.372 ◽

2015 ◽

Vol 1 (2) ◽

Cited By ~ 2

Author(s):

Oscar Karnalim

Keyword(s):

Vector Space ◽

Search Engine ◽

Vector Space Model ◽

Semantic Relatedness ◽

Space Model

Download Full-text