scholarly journals Clustering of Authors’ Texts of English Fiction in the Vector Space of Semantic Fields

2014 ◽  
Vol 14 (3) ◽  
pp. 25-36
Author(s):  
Bohdan Pavlyshenko

Abstract This paper describes the analysis of possible differentiation of the author’s idiolect in the space of semantic fields; it also analyzes the clustering of text documents in the vector space of semantic fields and in the semantic space with orthogonal basis. The analysis showed that using the vector space model on the basis of semantic fields is efficient in cluster analysis algorithms of author’s texts in English fiction. The study of the distribution of authors' texts in the cluster structure showed the presence of the areas of semantic space that represent the idiolects of individual authors. Such areas are described by the clusters where only one author dominates. The clusters, where the texts of several authors dominate, can be considered as areas of semantic similarity of author’s styles. SVD factorization of the semantic fields matrix makes it possible to reduce significantly the dimension of the semantic space in the cluster analysis of author’s texts. Using the clustering of the semantic field vector space can be efficient in a comparative analysis of author's styles and idiolects. The clusters of some authors' idiolects are semantically invariant and do not depend on any changes in the basis of the semantic space and clustering method.

2020 ◽  
Vol 11 (2) ◽  
pp. 268-277
Author(s):  
Susanti Susanti ◽  
Muhammad Azmi ◽  
Edwar Ali ◽  
Rahmaddeni Rahmaddeni ◽  
Yansyah Saputra Wijaya

Perkembangan teknologi informasi di era globalisasi saat ini, membuat semua aspek kehidupan kita berubah dan tidak dapat dihindarkan dari pengaruh kemajuan zaman. Untuk mendapatkan data dan informasi yang kita inginkan bukanlah perkara mudah, mengingat sedemikian banyaknya informasi yang tersedia untuk berbagai keperluan dengan berbagai gaya penyajian. Pencarian data di komputer, baik itu secara online ataupun offline berkembang banyak metode yang semakin menyempurnakan hasil pencarian. Hal ini juga meningkatkan kepuasan pengguna dalam mencari informasi. Metode yang umum digunakan dalam melakukan pencarian adalah Boolean Model. Metode lainnya adalah Vector Space Model (VSM). VSM yaitu model yang digunakan untuk mengukur kueri antara suatu dokumen dengan suatu kata kunci. Oleh karena itu, penulis bertujuan untuk membandingkan kedua metode tersebut dari kecepatan (waktu) pencarian dan jumlah temuan. Kecepatan tersebut dihitung berdasarkan lama waktu pencarian untuk kedua metode tersebut. Hasil yang didapati adalah perbandingan waktu pencarian antara boolean model dan vector space model didapati bahwa boolean model lebih cepat dengan selisih 30 sampai 50 detik. Perbandingan untuk hasil temuan didapati bahwa vector space model mempunyai hasil temuan yang sama dengan boolean model yang menggunakan operator or, sedangkan dengan operator and dan gabungan and serta or didapati bahwa jumlah hasil temuan tidak sama dengan vector space model.   Kata kunci: Perbandingan, Boolean Model, Vector Space Model, Pencarian, Dokumen Teks   Abstract The development of information technology in the current era of globalization, makes all aspects of our lives change and cannot be avoided from the influence of the times. To get the data and information that we want is not an easy matter, considering that so much information is available for various purposes with various styles of presentation. Searching data on a computer, be it online or offline, there are many methods that improve the search results. It also increases user satisfaction in finding information. The most commonly method of searching is the Boolean Model. Another method is the Vector Space Model (VSM). VSM is a model used to measure queries between a document and a keyword. Therefore, the authors aim to compare the two methods from the speed (time) of the search and the number of findings. The speed is calculated based on the search time for both methods. The result is that the comparison between boolean model and vector space model shows that the boolean model is faster by a difference of 30 - 50 seconds. The comparison for the foundings document text shows that vector space model has the same findings as the boolean model using the or operator, whereas with the and operator and the combination of operator and or it is found that the number of findings is not the same as vector space model.   Keywords: Comparison, Boolean Model, Vector Space Model, Search, Text Documents    


2014 ◽  
Vol 596 ◽  
pp. 79-82
Author(s):  
Zhong Qiu Ding

The restoration of regular paper fragments can be solved by edge matching. Edge matching is based primarily on the length and location of the break in writing, however, it is possible that the edges of some regular paper fragments are blank. And so the error of restoration is the inevitable. In this case, it needs manual intervention based on the article content and then eliminate errors. Prior to the establishment of a mathematical model, there is a need for binarization of regular paper fragments with Matlab, then establish vector space model and construct edge contrast matrix, get the order after the Q cluster analysis of fragments. Finally the original picturecan be obtained through splicing and restoring in the order.


Author(s):  
Đorđe Petrović ◽  
Milena Stanković

Text mining to a great extent depends on the various text preprocessing techniques. The preprocessing methods and tools which are used to prepare texts for further mining can be divided into those which are and those which are not language-dependent. The subject matter of this research was the analysis of the influence of these methods and tools on further text mining. We first focused on the analysis of the influence on the reduction of the vector space model for the multidimensional represen-tation of text documents. We then analyzed the influence on calculating text similarity, which is the focus of this research. The conclusion we reached is that the implemen-tation of various text preprocessing methods in the Serbian language, which are used for the reduction of the vector space model for the multidimensional representation of text document, achieves the required results. But, the implementation of various text preprocessing methods specific to the Serbian language for the purpose of calculating text similarity can lead to great differences in the results.


Author(s):  
Anthony Anggrawan ◽  
Azhari

Information searching based on users’ query, which is hopefully able to find the documents based on users’ need, is known as Information Retrieval. This research uses Vector Space Model method in determining the similarity percentage of each student’s assignment. This research uses PHP programming and MySQL database. The finding is represented by ranking the similarity of document with query, with mean average precision value of 0,874. It shows how accurate the application with the examination done by the experts, which is gained from the evaluation with 5 queries that is compared to 25 samples of documents. If the number of counted assignments has higher similarity, thus the process of similarity counting needs more time, it depends on the assignment’s number which is submitted.


2018 ◽  
Vol 9 (2) ◽  
pp. 97-105
Author(s):  
Richard Firdaus Oeyliawan ◽  
Dennis Gunawan

Library is one of the facilities which provides information, knowledge resource, and acts as an academic helper for readers to get the information. The huge number of books which library has, usually make readers find the books with difficulty. Universitas Multimedia Nusantara uses the Senayan Library Management System (SLiMS) as the library catalogue. SLiMS has many features which help readers, but there is still no recommendation feature to help the readers finding the books which are relevant to the specific book that readers choose. The application has been developed using Vector Space Model to represent the document in vector model. The recommendation in this application is based on the similarity of the books description. Based on the testing phase using one-language sample of the relevant books, the F-Measure value gained is 55% using 0.1 as cosine similarity threshold. The books description and variety of languages affect the F-Measure value gained. Index Terms—Book Recommendation, Porter Stemmer, SLiMS Universitas Multimedia Nusantara, TF-IDF, Vector Space Model


1985 ◽  
Vol 8 (2) ◽  
pp. 253-267
Author(s):  
S.K.M. Wong ◽  
Wojciech Ziarko

In information retrieval, it is common to model index terms and documents as vectors in a suitably defined vector space. The main difficulty with this approach is that the explicit representation of term vectors is not known a priori. For this reason, the vector space model adopted by Salton for the SMART system treats the terms as a set of orthogonal vectors. In such a model it is often necessary to adopt a separate, corrective procedure to take into account the correlations between terms. In this paper, we propose a systematic method (the generalized vector space model) to compute term correlations directly from automatic indexing scheme. We also demonstrate how such correlations can be included with minimal modification in the existing vector based information retrieval systems.


Sign in / Sign up

Export Citation Format

Share Document