scholarly journals Graph-based Representation for Sentence Similarity Measure : A Comparative Analysis

2018 ◽  
Vol 7 (2.14) ◽  
pp. 32
Author(s):  
Siti Sakira Kamaruddin ◽  
Yuhanis Yusof ◽  
Nur Azzah Abu Bakar ◽  
Mohamed Ahmed Tayie ◽  
Ghaith Abdulsattar A.Jabbar Alkubaisi

Textual data are a rich source of knowledge; hence, sentence comparison has become one of the important tasks in text mining related works. Most previous work in text comparison are performed at document level, research suggest that comparing sentence level text is a non-trivial problem.  One of the reason is two sentences can convey the same meaning with totally dissimilar words.  This paper presents the results of a comparative analysis on three representation schemes i.e. term frequency inverse document frequency, Latent Semantic Analysis and Graph based representation using three similarity measures i.e. Cosine, Dice coefficient and Jaccard similarity to compare the similarity of sentences.  Results reveal that the graph based representation and the Jaccard similarity measure outperforms the others in terms of precision, recall and F-measures. 

2010 ◽  
Vol 29-32 ◽  
pp. 2620-2626
Author(s):  
Jing Li Zhou ◽  
Xue Jun Nie ◽  
Lei Hua Qin ◽  
Jian Feng Zhu

This paper proposes a novel fuzzy similarity measure based on the relationships between terms and categories. A term-category matrix is presented to represent such relationships and each element in the matrix denotes a membership degree that a term belongs to a category, which is computed using term frequency inverse document frequency and fuzzy relationships between documents and categories. Fuzzy similarity takes into account the situation that one document belongs to multiple categories and is computed using fuzzy operators. The experimental results show that the proposed fuzzy similarity surpasses other common similarity measures not only in the reliable derivation of document clustering results, but also in document clustering accuracies.


Author(s):  
Guido Giunti ◽  
Maëlick Claes ◽  
Enrique Dorronzoro Zubiete ◽  
Octavio Rivera-Romero ◽  
Elia Gabarron

Introduction: Multiple sclerosis (MS) is one of the world’s most common neurologic disorders. Social media have been proposed as a way to maintain and even increase social interaction for people with MS. The objective of this work is to identify and compare the topics on Twitter during the first wave of COVID-19 pandemic. Methods: Data was collected using the Twitter API between 9/2/2019 and 13/5/2020. SentiStrength was used to analyze data with the day that the pandemic was declared used as a turning point. Frequency-inverse document frequency (tf-idf) was used for each unigram and calculated the gains in tf-idf value. A comparative analysis of the relevance of words and categories among the datasets was performed. Results: The original dataset contained over 610k tweets, our final dataset had 147,963 tweets. After the 10th of march some categories gained relevance in positive tweets (“Healthcare professional”, “Chronic conditions”, “Condition burden”), while in negative tweets “Emotional aspects” became more relevant and “COVID-19” emerged as a new topic. Conclusions: Our work provides insight on how COVID-19 has changed the online discourse of people with MS.


Author(s):  
MV Shivaani

Comparative analysis commands special attention in financial analysis as it not only facilitates understanding of  year-on-year changes but also of trends in the performance and position of a company. It is often a go-to tool for competitor analysis. In this note, I illustrate the use of  R (software), its allied packages, and textual analysis algorithms to extend the use of comparative analysis to ‘unstructured’ information presented in the MD&A section of annual reports. For this use case, I consider two giant tech rivals, Apple and Amazon, and present a comparative analysis of their MD&A section using Cosine and Jaccard similarity measures. I also compare the most important words based on tf-idf and sentiments for each company and across the two companies. When supplemented with financial information, comparative analysis can offer novel insights for analysts, managers, researchers, and academics and is a valuable tool to include in accounting curricula.


2021 ◽  
Vol 7 (2) ◽  
pp. 153
Author(s):  
Yunita Maulidia Sari ◽  
Nenden Siti Fatonah

Perkembangan teknologi yang pesat membuat kita lebih mudah dalam menemukan informasi-informasi yang dibutuhkan. Permasalahan muncul ketika informasi tersebut sangat banyak. Semakin banyak informasi dalam sebuah modul maka akan semakin panjang isi teks dalam modul tersebut. Hal tersebut akan memakan waktu yang cukup lama untuk memahami inti informasi dari modul tersebut. Salah satu solusi untuk mendapatkan inti informasi dari keseluruhan modul dengan cepat dan menghemat waktu adalah dengan membaca ringkasannya. Cara cepat untuk mendapatkan ringkasan sebuah dokumen adalah dengan cara peringkasan teks otomatis. Peringkasan teks otomatis (Automatic Text Summarization) merupakan teks yang dihasilkan dari satu atau lebih dokumen, yang mana hasil teks tersebut memberikan informasi penting dari sumber dokumen asli, serta secara otomatis hasil teks tersebut tidak lebih panjang dari setengah sumber dokumen aslinya. Penelitian ini bertujuan untuk menghasilkan peringkasan teks otomatis pada modul pembelajaran berbahasa Indonesia dan mengetahui hasil akurasi peringkasan teks otomatis yang menerapkan metode Cross Latent Semantic Analysis (CLSA). Jumlah data yang digunakan pada penelitian ini sebanyak 10 file modul pembelajaran yang berasal dari modul para dosen Universitas Mercu Buana, dengan format .docx sebanyak 5 file dan format .pdf sebanyak 5 file. Penelitian ini menerapkan metode Term Frequency-Inverse Document Frequency (TF-IDF) untuk pembobotan kata dan metode Cross Latent Semantic Analysis (CLSA) untuk peringkasan teks. Pengujian akurasi pada peringkasan modul pembelajaran dilakukan dengan cara membandingkan hasil ringkasan manual oleh manusia dan hasil ringkasan sistem. Yang mana pengujian ini menghasilkan rata-rata nilai f-measure, precision, dan recall tertinggi pada compression rate 20% dengan nilai berturut-turut 0.3853, 0.432, dan 0.3715.


In this study we propose an automatic single document text summarization technique using Latent Semantic Analysis (LSA) and diversity constraint in combination. The proposed technique uses the query based sentence ranking. Here we are not considering the concept of IR (Information Retrieval) so we generate the query by using the TF-IDF(Term Frequency-Inverse Document Frequency). For producing the query vector, we identify the terms having the high IDF. We know that LSA utilizes the vectorial semantics to analyze the relationships between documents in a corpus or between sentences within a document and key terms they carry by producing a list of ideas interconnected to the documents and terms. LSA helps to represent the latent structure of documents. For selecting the sentences from the document Latent Semantic Indexing (LSI) is used. LSI helps to arrange the sentences with its score. Traditionally the highest score sentences have been chosen for summary but here we calculate the diversity between chosen sentences and produce the final summary as a good summary should have maximum level of diversity. The proposed technique is evaluated on OpinosisDataset1.0.


2021 ◽  
Author(s):  
Alvin Subakti ◽  
Hendri Murfi ◽  
Nora Hariadi

Abstract Text clustering is the task of grouping a set of texts so that text in the same group will be more similar than those from a different group. The process of grouping text manually requires a significant amount of time and labor. Therefore, automation utilizing machine learning is necessary. The standard method used to represent textual data is Term Frequency Inverse Document Frequency (TFIDF). However, TFIDF cannot consider the position and context of a word in a sentence. Bidirectional Encoder Representation from Transformers (BERT) model can produce text representation that incorporates the position and context of a word in a sentence. This research analyzed the performance of the BERT model as data representation for text. Moreover, various feature extraction and normalization methods are also applied for the data representation of the BERT model. To examine the performances of BERT, we use four clustering algorithms, i.e., k-means clustering, eigenspace-based fuzzy c-means, deep embedded clustering, and improved deep embedded clustering. Our simulations show that BERT outperforms the standard TFIDF method in 28 out of 36 metrics. Furthermore, different feature extraction and normalization produced varied performances. The usage of these feature extraction and normalization must be altered depending on the text clustering algorithm used.


Author(s):  
Pedram Vahdani Amoli ◽  
Omid Sojoodi Sh.

<table border="1" cellspacing="0" cellpadding="0" width="593"><tbody><tr><td width="387" valign="top"><p>In this paper a novel method is proposed for scientific document clustering. The proposed method is a summarization-based hybrid algorithm which comprises a preprocessing phase. In the preprocessing phase unimportant words which are frequently used in the text are removed. This process reduces the amount of data for the clustering purpose. Furthermore frequent items cause overlapping between the clusters which leads to inefficiency of the cluster separation. After the preprocessing phase, Term Frequency/Inverse Document Frequency (TFIDF) is calculated for all words and stems over the document to score them in the document. Text summarization is performed then in the sentence level. Document clustering is finally done according to the scores of calculated TFIDF. The hybrid progress of the proposed scheme, from preprocessing phase to document clustering, gains a rapid and efficient clustering method which is evaluated by 400 English texts extracted from scientific databases of 11 different topics. The proposed method is compared with CSSA, SMTC and Max-Capture methods. The results demonstrate the proficiency of the proposed scheme in terms of computation time and efficiency using F-measure criterion.</p></td></tr></tbody></table>


2020 ◽  
Author(s):  
Mete Eminağaoğlu ◽  
Yılmaz Gökşen

Accurate, efficient and fast processing of textual data and classification of electronic documents have become an important key factor in knowledge management and related businesses in today’s world. Text mining, information retrieval, and document classification systems have a strong positive impact on digital libraries and electronic content management, e-marketing, electronic archives, customer relationship management, decision support systems, copyright infringement, and plagiarism detection, which strictly affect economics, businesses, and organizations. In this study, we propose a new similarity measure that can be used with k-nearest neighbors (k-NN) and Rocchio algorithms, which are some of the well-known algorithms for document classification, information retrieval, and some other text mining purposes. We have tested our novel similarity measure with some structured textual data sets and we have compared the results with some other standard distance metrics and similarity measures such as Cosine similarity, Euclidean distance, and Pearson correlation coefficient. We have obtained some promising results, which show that this proposed similarity measure could be alternatively used within all suitable algorithms, methods, and models for text mining, document classification, and relevant knowledge management systems. Keywords: text mining, document classification, similarity measures, k-NN, Rocchio algorithm


Sign in / Sign up

Export Citation Format

Share Document