scholarly journals Pembentukan Vector Space Model Bahasa Indonesia Menggunakan Metode Word to Vector

2019 ◽  
Vol 10 (1) ◽  
pp. 29
Author(s):  
Yulius Denny Prabowo ◽  
Tedi Lesmana Marselino ◽  
Meylisa Suryawiguna

Extracting information from a large amount of structured data requires expensive computing. The Vector Space Model method works by mapping words in continuous vector space where semantically similar words are mapped in adjacent vector spaces. The Vector Space Model model assumes words that appear in the same context, having the same semantic meaning. In the implementation, there are two different approaches: counting methods (eg: Latent Semantic Analysis) and predictive methods (eg Neural Probabilistic Language Model). This study aims to apply Word2Vec method using the Continuous Bag of Words approach in Indonesian language. Research data was obtained by crawling on several online news portals. The expected result of the research is the Indonesian words vector mapping based on the data used.Keywords: vector space model, word to vector, Indonesian vector space model.Ekstraksi informasi dari sekumpulan data terstruktur dalam jumlah yang besar membutuhkan komputasi yang mahal. Metode Vector Space Model bekerja dengan cara memetakan kata-kata dalam ruang vektor kontinu dimana kata-kata yang serupa secara semantis dipetakan dalam ruang vektor yang berdekatan. Metode Vector Space Model mengasumsikan kata-kata yang muncul pada konteks yang sama, memiliki makna semantik yang sama. Dalam penerapannya ada dua pendekatan yang berbeda yaitu: metode yang berbasis hitungan (misal: Latent Semantic Analysis) dan metode prediktif (misalnya Neural Probabilistic Language Model). Penelitian ini bertujuan untuk menerapkan metode Word2Vec menggunakan pendekatan Continuous Bag Of Words model dalam Bahasa Indonesia. Data penelitian yang digunakan didapatkan dengan cara crawling pada berberapa portal berita online. Hasil penelitian yang diharapkan adalah pemetaan vektor kata Bahasa Indonesia berdasarkan data yang digunakan.Kata Kunci: vector space model, word to vector, vektor kata bahasa Indonesia.

Author(s):  
Lucian Nicolae Vintan ◽  
Daniel Ionel Morariu ◽  
Radu George Cretulescu ◽  
Maria Vintan

In this paper we will present a new approach regarding the documents representation in order to be used in classification and/or clustering algorithms. In our new representation we will start from the classical "bag-of-words" representation but we will augment each word with its correspondent part-of-speech. Thus we will introduce a new concept called hyper-vectors where each document is represented in a hyper-space where each dimension is a different part-of-speech component. For each dimension the document is represented using the Vector Space Model (VSM). In this work we will use only five different parts of speech: noun, verb, adverb, adjective and others. In the hyper-space each dimension has a different weight. To compute the similarity between two documents we have developed a new hyper-cosine formula. Some interesting classification experiments are presented as validation cases.


SinkrOn ◽  
2021 ◽  
Vol 6 (1) ◽  
pp. 69-79
Author(s):  
Bita Parga Zen ◽  
Irwan Susanto ◽  
Dian Finaliamartha

Advances in information and technology have caused the use of the internet to be a concern of the general public. Online news sites are one of the technologies that have developed as a means of disseminating the latest information in the world. When viewed in terms of numbers, newsreaders are very sufficient to get the desired information. However, with this, the amount of information collected will result in an explosion of information and the possibility of information redundancy. The search system is one of the solutions which expected to help in finding the desired or relevant information by the input query. The methods commonly used in this case are TF-IDF and VSM (Vector Space Model) which are used in weighting to measure statistics from a collection of documents on the search for some information about the Covid 19 vaccine on kompas.com news then tokenizing it to separate the text, stopword removal or filtering to remove unnecessary words which usually consist of conjunctions and others. The next step is sentence stemming which aims to eliminate word inflection to its basic form. Then the TF-IDF and VSM calculations were carried out and the final result are news documents 3 (DOC 3) with a weight of 5.914226424; news documents 2 (DOC 2) with a weight of 1.767692186; news documents 5 (DOC 5) with weights 1.550165096; news document 4 (DOC 4) with a weight of 1.17141223;, and the last is news document 1 (DOC 1) with a weight of 0.5244103739.


2009 ◽  
Vol 18 (02) ◽  
pp. 239-272 ◽  
Author(s):  
SUJEEVAN ASEERVATHAM

Kernels are widely used in Natural Language Processing as similarity measures within inner-product based learning methods like the Support Vector Machine. The Vector Space Model (VSM) is extensively used for the spatial representation of the documents. However, it is purely a statistical representation. In this paper, we present a Concept Vector Space Model (CVSM) representation which uses linguistic prior knowledge to capture the meanings of the documents. We also propose a linear kernel and a latent kernel for this space. The linear kernel takes advantage of the linguistic concepts whereas the latent kernel combines statistical and linguistic concepts. Indeed, the latter kernel uses latent concepts extracted by the Latent Semantic Analysis (LSA) in the CVSM. The kernels were evaluated on a text categorization task in the biomedical domain. The Ohsumed corpus, well known for being difficult to categorize, was used. The results have shown that the CVSM improves performance compared to the VSM.


2019 ◽  
Vol 11 (5) ◽  
pp. 114 ◽  
Author(s):  
Korawit Orkphol ◽  
Wu Yang

Words have different meanings (i.e., senses) depending on the context. Disambiguating the correct sense is important and a challenging task for natural language processing. An intuitive way is to select the highest similarity between the context and sense definitions provided by a large lexical database of English, WordNet. In this database, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms interlinked through conceptual semantics and lexicon relations. Traditional unsupervised approaches compute similarity by counting overlapping words between the context and sense definitions which must match exactly. Similarity should compute based on how words are related rather than overlapping by representing the context and sense definitions on a vector space model and analyzing distributional semantic relationships among them using latent semantic analysis (LSA). When a corpus of text becomes more massive, LSA consumes much more memory and is not flexible to train a huge corpus of text. A word-embedding approach has an advantage in this issue. Word2vec is a popular word-embedding approach that represents words on a fix-sized vector space model through either the skip-gram or continuous bag-of-words (CBOW) model. Word2vec is also effectively capturing semantic and syntactic word similarities from a huge corpus of text better than LSA. Our method used Word2vec to construct a context sentence vector, and sense definition vectors then give each word sense a score using cosine similarity to compute the similarity between those sentence vectors. The sense definition also expanded with sense relations retrieved from WordNet. If the score is not higher than a specific threshold, the score will be combined with the probability of that sense distribution learned from a large sense-tagged corpus, SEMCOR. The possible answer senses can be obtained from high scores. Our method shows that the result (50.9% or 48.7% without the probability of sense distribution) is higher than the baselines (i.e., original, simplified, adapted and LSA Lesk) and outperforms many unsupervised systems participating in the SENSEVAL-3 English lexical sample task.


2011 ◽  
Vol 17 (1) ◽  
pp. 37-70 ◽  
Author(s):  
PETER A. CHEW ◽  
BRETT W. BADER ◽  
STEPHEN HELMREICH ◽  
AHMED ABDELALI ◽  
STEPHEN J. VERZI

AbstractIn this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a ‘standard’ approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.


2020 ◽  
Author(s):  
Lasarus Pelipus Malese

Seiring dengan semakin berharganya nilai sebuah informasi dan semakin banyak sumber-sumber informasi, maka semakin banyak pula kebutuhan manusia untuk dapat menemukan informasi yang sesuai keperluannya dengan cepat. Information Retrieval (Perolehan Informasi) merupakan suatu pencarian informasi(biasanya berupa dokumen) yang didasarkan pada suatu query (inputan user) yang diharapkan dapat memenuhi keinginan user dari dokumen yang ada. Dua aspek penting dalam konsep Information Retrieval yang diterapkan dalam sebuah perancangan mesin pencari yaitu repesentasi dari informasi dan pengukuran yang akan mengukur nilai kesamaan antara dua obyek. Informasi yang dapat direpresentasikan menjadi sebuah obyek yang direpresentasikan dalam berbagai bentuk dan model (heterogeneous). Keadaan ini mengakibatkan bahwa pencarian untuk suatu obyek informasi yang diinginkan akan dapat dipetakan kepada beberapa obyek informati yang dinilai relevan. Relevansi dua informasi di-ukur dari keberadaan kata kunci (keyword) dan bobotnya. Konsekuensi logis atas keadaan ini adalah bahwa dalam melakukan pencarian obyek yang diinginkan, ditemukan terdapat ketidakpastian (uncertainly) terhadap penggunaan keyword pada query oleh pengguna dengan keberadaan keyword pada dokumen. Pada penelitian ini akan difokuskan pada studi efektifitas vektor model dengan menggunakan algoritma steming poter untuk membentuk kata-kata menjadi sebuah kata baku serta pembobotan frekuwensi term untuk menentukan tingkat kepentingan setiap indeks term dalam sebuah dokumen dan fungsi kesamaan cosine dalam mengukur kemiripan queri dengan dokumen. Hasil pengujian nilai kualitas rata-rata precesion dan recall bahwa untuk semua bentuk query pencarian dokumen mempunyai nilai precesion 100% artinya baik pencarian berdasarkan bentuk query isi, judul dan dokumen mempunyai precesion yang baik sedangkan hasil pengujian recall menunjukkan hasil yang berbeda, dimana pencarian berdasarkan bentuk queri isi dokumen menempati recall tertinggi sebesar 90%, sedangkan pencarian berdasarkan bentuk queri judul dokumen dengan nilai recall 70% dan pencarian berdasarkan bentuk query dokumen menempati posisi terendah dengan recall 33%.


2014 ◽  
Vol 556-562 ◽  
pp. 3536-3540
Author(s):  
Ya Xiong Li ◽  
Deng Pan

One key step in text mining is the categorization of texts, i.e., to put texts of the same or similar contents into one group so as to distinguish texts of different contents. However, traditional word-frequency-based statistical approaches, such as VSM model, failed to reflect the complicated meaning in texts. This paper ushers in domain ontology and constructs new conceptual vector space model in the pre-processing stage of text clustering, substituting the initial matrix (lexicon-text matrix) in the latent semantic analysis with concept-text matrix. In the clustering analysis stage, this model adopts semantic similarity, partially overcoming the difficulty in accurately and effectively evaluating the degree of similarity of text due to simply taking into account the frequency of words and/or phrases in the text. Experimental results indicate that this method is helpful in improving the result of text clustering.


Sign in / Sign up

Export Citation Format

Share Document