An information-theoretic, vector-space-model approach to cross-language information retrieval

PETER A. CHEW; BRETT W. BADER; STEPHEN HELMREICH; AHMED ABDELALI; STEPHEN J. VERZI

doi:10.1017/s1351324910000185

An information-theoretic, vector-space-model approach to cross-language information retrieval

Natural Language Engineering ◽

10.1017/s1351324910000185 ◽

2011 ◽

Vol 17 (1) ◽

pp. 37-70 ◽

Cited By ~ 7

Author(s):

PETER A. CHEW ◽

BRETT W. BADER ◽

STEPHEN HELMREICH ◽

AHMED ABDELALI ◽

STEPHEN J. VERZI

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Computational Linguistics ◽

Semantic Analysis ◽

Statistical Machine Translation ◽

Vector Space Model ◽

Standard Approach ◽

Eigenvalue Decomposition ◽

Information Theoretic ◽

Space Model

AbstractIn this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a ‘standard’ approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.

Download Full-text

Aplikasi Deteksi Kemiripan Tugas Paper

Matrik Jurnal Manajemen Teknik Informatika dan Rekayasa Komputer ◽

10.30812/matrik.v15i2.39 ◽

2017 ◽

Vol 15 (2) ◽

pp. 5

Author(s):

Anthony Anggrawan ◽

Azhari

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Vector Space Model ◽

Mean Average Precision ◽

Average Precision ◽

Information Searching ◽

Space Model ◽

Model Method

Information searching based on users’ query, which is hopefully able to find the documents based on users’ need, is known as Information Retrieval. This research uses Vector Space Model method in determining the similarity percentage of each student’s assignment. This research uses PHP programming and MySQL database. The finding is represented by ranking the similarity of document with query, with mean average precision value of 0,874. It shows how accurate the application with the examination done by the experts, which is gained from the evaluation with 5 queries that is compared to 25 samples of documents. If the number of counted assignments has higher similarity, thus the process of similarity counting needs more time, it depends on the assignment’s number which is submitted.

Download Full-text

A relational vector-space model of information retrieval adapted to images

ACM SIGIR Forum ◽

10.1145/1067268.1067292 ◽

2005 ◽

Vol 39 (1) ◽

pp. 62-62

Author(s):

Jean Martinet

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Vector Space Model ◽

Space Model

Download Full-text

On Generalized Vector Space Model in Information Retrieval

Fundamenta Informaticae ◽

10.3233/fi-1985-8207 ◽

1985 ◽

Vol 8 (2) ◽

pp. 253-267

Author(s):

S.K.M. Wong ◽

Wojciech Ziarko

Keyword(s):

Information Retrieval ◽

Vector Space ◽

A Priori ◽

Vector Space Model ◽

Smart System ◽

Space Model ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

Index Terms ◽

Minimal Modification

In information retrieval, it is common to model index terms and documents as vectors in a suitably defined vector space. The main difficulty with this approach is that the explicit representation of term vectors is not known a priori. For this reason, the vector space model adopted by Salton for the SMART system treats the terms as a set of orthogonal vectors. In such a model it is often necessary to adopt a separate, corrective procedure to take into account the correlations between terms. In this paper, we propose a systematic method (the generalized vector space model) to compute term correlations directly from automatic indexing scheme. We also demonstrate how such correlations can be included with minimal modification in the existing vector based information retrieval systems.

Download Full-text

The Performance of Boolean Retrieval and Vector Space Model in Textual Information Retrieval

CommIT (Communication and Information Technology) Journal ◽

10.21512/commit.v11i1.2108 ◽

2017 ◽

Vol 11 (1) ◽

pp. 33 ◽

Cited By ~ 1

Author(s):

Budi Yulianto ◽

Widodo Budiharto ◽

Iman Herwidiana Kartowisastro

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Vector Space Model ◽

Experimental Results ◽

Inverted Index ◽

Exact Results ◽

Textual Information ◽

Space Model ◽

Corpus Data

Boolean Retrieval (BR) and Vector Space Model (VSM) are very popular methods in information retrieval for creating an inverted index and querying terms. BR method searches the exact results of the textual information retrieval without ranking the results. VSM method searches and ranks the results. This study empirically compares the two methods. The research utilizes a sample of the corpus data obtained from Reuters. The experimental results show that the required times to produce an inverted index by the two methods are nearly the same. However, a difference exists on the querying index. The results also show that the numberof generated indexes, the sizes of the generated files, and the duration of reading and searching an index are proportional with the file number in the corpus and thefile size.

Download Full-text

Pembentukan Vector Space Model Bahasa Indonesia Menggunakan Metode Word to Vector

Jurnal Buana Informatika ◽

10.24002/jbi.v10i1.2053 ◽

2019 ◽

Vol 10 (1) ◽

pp. 29

Author(s):

Yulius Denny Prabowo ◽

Tedi Lesmana Marselino ◽

Meylisa Suryawiguna

Keyword(s):

Vector Space ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Language Model ◽

Vector Space Model ◽

Online News ◽

Bag Of Words ◽

Space Model ◽

Language Research ◽

Bahasa Indonesia

Extracting information from a large amount of structured data requires expensive computing. The Vector Space Model method works by mapping words in continuous vector space where semantically similar words are mapped in adjacent vector spaces. The Vector Space Model model assumes words that appear in the same context, having the same semantic meaning. In the implementation, there are two different approaches: counting methods (eg: Latent Semantic Analysis) and predictive methods (eg Neural Probabilistic Language Model). This study aims to apply Word2Vec method using the Continuous Bag of Words approach in Indonesian language. Research data was obtained by crawling on several online news portals. The expected result of the research is the Indonesian words vector mapping based on the data used.Keywords: vector space model, word to vector, Indonesian vector space model.Ekstraksi informasi dari sekumpulan data terstruktur dalam jumlah yang besar membutuhkan komputasi yang mahal. Metode Vector Space Model bekerja dengan cara memetakan kata-kata dalam ruang vektor kontinu dimana kata-kata yang serupa secara semantis dipetakan dalam ruang vektor yang berdekatan. Metode Vector Space Model mengasumsikan kata-kata yang muncul pada konteks yang sama, memiliki makna semantik yang sama. Dalam penerapannya ada dua pendekatan yang berbeda yaitu: metode yang berbasis hitungan (misal: Latent Semantic Analysis) dan metode prediktif (misalnya Neural Probabilistic Language Model). Penelitian ini bertujuan untuk menerapkan metode Word2Vec menggunakan pendekatan Continuous Bag Of Words model dalam Bahasa Indonesia. Data penelitian yang digunakan didapatkan dengan cara crawling pada berberapa portal berita online. Hasil penelitian yang diharapkan adalah pemetaan vektor kata Bahasa Indonesia berdasarkan data yang digunakan.Kata Kunci: vector space model, word to vector, vektor kata bahasa Indonesia.

Download Full-text

Information retrieval from heterogeneous data sets using moderated IDF-cosine similarity in vector space model

2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS) ◽

10.1109/icecds.2017.8390174 ◽

2017 ◽

Cited By ~ 1

Author(s):

Bhagyashree Pathak ◽

Niranjan Lal

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Vector Space Model ◽

Heterogeneous Data ◽

Cosine Similarity ◽

Data Sets ◽

Space Model

Download Full-text

Adapting the tf idf vector-space model to domain specific information retrieval

Proceedings of the 2010 ACM Symposium on Applied Computing - SAC '10 ◽

10.1145/1774088.1774454 ◽

2010 ◽

Cited By ~ 10

Author(s):

Claire Fautsch ◽

Jacques Savoy

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Vector Space Model ◽

Specific Information ◽

Space Model ◽

Domain Specific

Download Full-text

Analysis of a Vector Space Model, Latent Semantic Indexing and Formal Concept Analysis for Information Retrieval

Cybernetics and Information Technologies ◽

10.2478/cait-2012-0003 ◽

2012 ◽

Vol 12 (1) ◽

pp. 34-48 ◽

Cited By ~ 11

Author(s):

Ch. Aswani Kumar ◽

M. Radvansky ◽

J. Annapurna

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Formal Concept Analysis ◽

Vector Space Model ◽

Latent Semantic Indexing ◽

Concept Analysis ◽

Formal Concept ◽

Semantic Indexing ◽

Space Model ◽

Classical Vector

Abstract Latent Semantic Indexing (LSI), a variant of classical Vector Space Model (VSM), is an Information Retrieval (IR) model that attempts to capture the latent semantic relationship between the data items. Mathematical lattices, under the framework of Formal Concept Analysis (FCA), represent conceptual hierarchies in data and retrieve the information. However, both LSI and FCA use the data represented in the form of matrices. The objective of this paper is to systematically analyze VSM, LSI and FCA for the task of IR using standard and real life datasets.

Download Full-text

Heterogeneous Multi-core Design for Information Retrieval Efficiency on the Vector Space Model

2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery ◽

10.1109/fskd.2008.229 ◽

2008 ◽

Cited By ~ 1

Author(s):

Tianzhou Chen ◽

Zhenwei Zheng ◽

Nan Zhang ◽

Jian Chen

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Vector Space Model ◽

Space Model ◽

Retrieval Efficiency

Download Full-text

INFORMATION RETRIEVAL PADA DATA JUDUL SKRIPSI BERBASIS TEXT MENGGUNAKAN VECTOR SPACE MODEL

Jurnal Ilmu Komputer ◽

10.33060/jik/2021/vol10.iss2.230 ◽

2021 ◽

Vol 10 (2) ◽

Author(s):

Eka Sabna

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Vector Space Model ◽

Vector Model ◽

Space Vector ◽

Space Model

Penyimpanan data judul skripsi mahasiswa semakin banyak dan akan terus bertambah. Untuk mencari informasi dari judul skripsi tersebut akan menjadi sulit. Untuk itu dikembangkanlah metode pencarian yang disebut dengan temu-kembali informasi (information retrieval). Metode-metode temu-kembali informasi sudah dikenal sejak lama, salah satu dari metode tersebut yang paling banyak digunakan karena kemudahan implementasinya adalah Space Vector Model (SVM). Tujuan penelitian ini adalah memberikan paparan tentang proses pencarian dokumen digital dengan metode Vektor Space Model. Pada model ini dilakukan dengan proses token dan indexing sehingga ditemukan hasil dari maksimal terdapat dalam data judul skripsi menggunakan kata kunci, sehingga di lakukan pencarian sesuai dengan kata kunci dan akan dibandingkan dengan data yang terdapat pada file dokumen judul skripsi, sehingga dapat menghasilkan informasi yang benar.

Download Full-text