scholarly journals Predicting Marathi News Class using Semantic Entity Driven Clustering Approach

2021 ◽  
Vol 23 (4) ◽  
pp. 0-0

Document management is a need for an era and managing documents in the regional languages is a significant and untouched area. Marathi corpus consisting of news is processed to form Group Entity document matrix Marathi (GEDMM), Vector space model for Marathi (VSMM) and Hysynset Vector space model for Marathi (HSVSMM). GEDMM uses entity group extracted using Condition random field (CRF). The frequent terms are used to construct VSMM using TF-IDF. HSVSMM uses synsets using hypernyms-hyponyms and synonyms. GEDMM and HSVSMM use dimension reduction by selecting significant feature groups. Hierarchical agglomerative clustering (HAC) is used and a dendrogram is produced to visualize the clusters. The performance analysis is carried out using several parameters like entropy, purity, misclassification error and accuracy. The clusters produced using GEDMM shows the minimum entropy and the highest purity. A random forest classifier is applied and the results are evaluated using misclassification error and accuracy.

2021 ◽  
Vol 23 (4) ◽  
pp. 1-13
Author(s):  
Jatinderkumar R. Saini ◽  
Prafulla Bharat Bafna

Document management is a need for an era and managing documents in the regional languages is a significant and untouched area. Marathi corpus consisting of news is processed to form Group Entity document matrix Marathi (GEDMM), Vector space model for Marathi (VSMM) and Hysynset Vector space model for Marathi (HSVSMM). GEDMM uses entity group extracted using Condition random field (CRF). The frequent terms are used to construct VSMM using TF-IDF. HSVSMM uses synsets using hypernyms-hyponyms and synonyms. GEDMM and HSVSMM use dimension reduction by selecting significant feature groups. Hierarchical agglomerative clustering (HAC) is used and a dendrogram is produced to visualize the clusters. The performance analysis is carried out using several parameters like entropy, purity, misclassification error and accuracy. The clusters produced using GEDMM shows the minimum entropy and the highest purity. A random forest classifier is applied and the results are evaluated using misclassification error and accuracy.


Author(s):  
Anthony Anggrawan ◽  
Azhari

Information searching based on users’ query, which is hopefully able to find the documents based on users’ need, is known as Information Retrieval. This research uses Vector Space Model method in determining the similarity percentage of each student’s assignment. This research uses PHP programming and MySQL database. The finding is represented by ranking the similarity of document with query, with mean average precision value of 0,874. It shows how accurate the application with the examination done by the experts, which is gained from the evaluation with 5 queries that is compared to 25 samples of documents. If the number of counted assignments has higher similarity, thus the process of similarity counting needs more time, it depends on the assignment’s number which is submitted.


2018 ◽  
Vol 9 (2) ◽  
pp. 97-105
Author(s):  
Richard Firdaus Oeyliawan ◽  
Dennis Gunawan

Library is one of the facilities which provides information, knowledge resource, and acts as an academic helper for readers to get the information. The huge number of books which library has, usually make readers find the books with difficulty. Universitas Multimedia Nusantara uses the Senayan Library Management System (SLiMS) as the library catalogue. SLiMS has many features which help readers, but there is still no recommendation feature to help the readers finding the books which are relevant to the specific book that readers choose. The application has been developed using Vector Space Model to represent the document in vector model. The recommendation in this application is based on the similarity of the books description. Based on the testing phase using one-language sample of the relevant books, the F-Measure value gained is 55% using 0.1 as cosine similarity threshold. The books description and variety of languages affect the F-Measure value gained. Index Terms—Book Recommendation, Porter Stemmer, SLiMS Universitas Multimedia Nusantara, TF-IDF, Vector Space Model


1985 ◽  
Vol 8 (2) ◽  
pp. 253-267
Author(s):  
S.K.M. Wong ◽  
Wojciech Ziarko

In information retrieval, it is common to model index terms and documents as vectors in a suitably defined vector space. The main difficulty with this approach is that the explicit representation of term vectors is not known a priori. For this reason, the vector space model adopted by Salton for the SMART system treats the terms as a set of orthogonal vectors. In such a model it is often necessary to adopt a separate, corrective procedure to take into account the correlations between terms. In this paper, we propose a systematic method (the generalized vector space model) to compute term correlations directly from automatic indexing scheme. We also demonstrate how such correlations can be included with minimal modification in the existing vector based information retrieval systems.


Sign in / Sign up

Export Citation Format

Share Document