Graph-based Representation for Sentence Similarity Measure : A Comparative Analysis

Siti Sakira Kamaruddin; Yuhanis Yusof; Nur Azzah Abu Bakar; Mohamed Ahmed Tayie; Ghaith Abdulsattar A.Jabbar Alkubaisi

doi:10.14419/ijet.v7i2.14.11149

Graph-based Representation for Sentence Similarity Measure : A Comparative Analysis

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.14.11149 ◽

2018 ◽

Vol 7 (2.14) ◽

pp. 32

Author(s):

Siti Sakira Kamaruddin ◽

Yuhanis Yusof ◽

Nur Azzah Abu Bakar ◽

Mohamed Ahmed Tayie ◽

Ghaith Abdulsattar A.Jabbar Alkubaisi

Keyword(s):

Comparative Analysis ◽

Similarity Measure ◽

Semantic Analysis ◽

Similarity Measures ◽

Jaccard Similarity ◽

Inverse Document Frequency ◽

Document Frequency ◽

Sentence Level ◽

Textual Data ◽

Document Level

Textual data are a rich source of knowledge; hence, sentence comparison has become one of the important tasks in text mining related works. Most previous work in text comparison are performed at document level, research suggest that comparing sentence level text is a non-trivial problem. One of the reason is two sentences can convey the same meaning with totally dissimilar words. This paper presents the results of a comparative analysis on three representation schemes i.e. term frequency inverse document frequency, Latent Semantic Analysis and Graph based representation using three similarity measures i.e. Cosine, Dice coefficient and Jaccard similarity to compare the similarity of sentences. Results reveal that the graph based representation and the Jaccard similarity measure outperforms the others in terms of precision, recall and F-measures.

Download Full-text

Document Clustering Based on Fuzzy Similarity

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.29-32.2620 ◽

2010 ◽

Vol 29-32 ◽

pp. 2620-2626

Author(s):

Jing Li Zhou ◽

Xue Jun Nie ◽

Lei Hua Qin ◽

Jian Feng Zhu

Keyword(s):

Similarity Measure ◽

Document Clustering ◽

Similarity Measures ◽

Experimental Results ◽

Membership Degree ◽

Inverse Document Frequency ◽

Fuzzy Similarity ◽

Document Frequency ◽

The Matrix ◽

Multiple Categories

This paper proposes a novel fuzzy similarity measure based on the relationships between terms and categories. A term-category matrix is presented to represent such relationships and each element in the matrix denotes a membership degree that a term belongs to a category, which is computed using term frequency inverse document frequency and fuzzy relationships between documents and categories. Fuzzy similarity takes into account the situation that one document belongs to multiple categories and is computed using fuzzy operators. The experimental results show that the proposed fuzzy similarity surpasses other common similarity measures not only in the reliable derivation of document clustering results, but also in document clustering accuracies.

Download Full-text

Term Frequency-Inverse Document Frequency Answer Categorization with Support Vector Machine on Automatic Short Essay Grading System with Latent Semantic Analysis for Japanese Language

2019 International Conference on Electrical Engineering and Computer Science (ICECOS) ◽

10.1109/icecos47637.2019.8984530 ◽

2019 ◽

Author(s):

Anak Agung Putri Ratna ◽

Aaliyah Kaltsum ◽

Lea Santiar ◽

Hanifah Khairunissa ◽

Ihsan Ibrahim ◽

...

Keyword(s):

Support Vector Machine ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Support Vector ◽

Grading System ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency ◽

Essay Grading ◽

Short Essay

Download Full-text

Impact of COVID-19 on Multiple Sclerosis Topic Discussion on Twitter

Studies in Health Technology and Informatics - Public Health and Informatics ◽

10.3233/shti210302 ◽

2021 ◽

Author(s):

Guido Giunti ◽

Maëlick Claes ◽

Enrique Dorronzoro Zubiete ◽

Octavio Rivera-Romero ◽

Elia Gabarron

Keyword(s):

Multiple Sclerosis ◽

Social Media ◽

Comparative Analysis ◽

Social Interaction ◽

Healthcare Professional ◽

Inverse Document Frequency ◽

Original Dataset ◽

Document Frequency ◽

Neurologic Disorders ◽

Emotional Aspects

Introduction: Multiple sclerosis (MS) is one of the world’s most common neurologic disorders. Social media have been proposed as a way to maintain and even increase social interaction for people with MS. The objective of this work is to identify and compare the topics on Twitter during the first wave of COVID-19 pandemic. Methods: Data was collected using the Twitter API between 9/2/2019 and 13/5/2020. SentiStrength was used to analyze data with the day that the pandemic was declared used as a turning point. Frequency-inverse document frequency (tf-idf) was used for each unigram and calculated the gains in tf-idf value. A comparative analysis of the relevance of words and categories among the datasets was performed. Results: The original dataset contained over 610k tweets, our final dataset had 147,963 tweets. After the 10th of march some categories gained relevance in positive tweets (“Healthcare professional”, “Chronic conditions”, “Condition burden”), while in negative tweets “Emotional aspects” became more relevant and “COVID-19” emerged as a new topic. Conclusions: Our work provides insight on how COVID-19 has changed the online discourse of people with MS.

Download Full-text

Comparing Apple to Amazon: Just a Matter of Words in Machine Learning World

Journal of Emerging Technologies in Accounting ◽

10.2308/jeta-2020-045 ◽

2021 ◽

Author(s):

MV Shivaani

Keyword(s):

Comparative Analysis ◽

Financial Analysis ◽

Similarity Measures ◽

Annual Reports ◽

Use Case ◽

Jaccard Similarity ◽

R Software ◽

Competitor Analysis ◽

Unstructured Information ◽

A Company

Comparative analysis commands special attention in financial analysis as it not only facilitates understanding of year-on-year changes but also of trends in the performance and position of a company. It is often a go-to tool for competitor analysis. In this note, I illustrate the use of R (software), its allied packages, and textual analysis algorithms to extend the use of comparative analysis to ‘unstructured’ information presented in the MD&A section of annual reports. For this use case, I consider two giant tech rivals, Apple and Amazon, and present a comparative analysis of their MD&A section using Cosine and Jaccard similarity measures. I also compare the most important words based on tf-idf and sentiments for each company and across the two companies. When supplemented with financial information, comparative analysis can offer novel insights for analysts, managers, researchers, and academics and is a valuable tool to include in accounting curricula.

Download Full-text

Peringkasan Teks Otomatis pada Modul Pembelajaran Berbahasa Indonesia Menggunakan Metode Cross Latent Semantic Analysis (CLSA)

Jurnal Edukasi dan Penelitian Informatika (JEPIN) ◽

10.26418/jp.v7i2.47768 ◽

2021 ◽

Vol 7 (2) ◽

pp. 153

Author(s):

Yunita Maulidia Sari ◽

Nenden Siti Fatonah

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Text Summarization ◽

Compression Rate ◽

Inverse Document Frequency ◽

Term Frequency ◽

Automatic Text Summarization ◽

Document Frequency ◽

Automatic Text ◽

F Measure

Perkembangan teknologi yang pesat membuat kita lebih mudah dalam menemukan informasi-informasi yang dibutuhkan. Permasalahan muncul ketika informasi tersebut sangat banyak. Semakin banyak informasi dalam sebuah modul maka akan semakin panjang isi teks dalam modul tersebut. Hal tersebut akan memakan waktu yang cukup lama untuk memahami inti informasi dari modul tersebut. Salah satu solusi untuk mendapatkan inti informasi dari keseluruhan modul dengan cepat dan menghemat waktu adalah dengan membaca ringkasannya. Cara cepat untuk mendapatkan ringkasan sebuah dokumen adalah dengan cara peringkasan teks otomatis. Peringkasan teks otomatis (Automatic Text Summarization) merupakan teks yang dihasilkan dari satu atau lebih dokumen, yang mana hasil teks tersebut memberikan informasi penting dari sumber dokumen asli, serta secara otomatis hasil teks tersebut tidak lebih panjang dari setengah sumber dokumen aslinya. Penelitian ini bertujuan untuk menghasilkan peringkasan teks otomatis pada modul pembelajaran berbahasa Indonesia dan mengetahui hasil akurasi peringkasan teks otomatis yang menerapkan metode Cross Latent Semantic Analysis (CLSA). Jumlah data yang digunakan pada penelitian ini sebanyak 10 file modul pembelajaran yang berasal dari modul para dosen Universitas Mercu Buana, dengan format .docx sebanyak 5 file dan format .pdf sebanyak 5 file. Penelitian ini menerapkan metode Term Frequency-Inverse Document Frequency (TF-IDF) untuk pembobotan kata dan metode Cross Latent Semantic Analysis (CLSA) untuk peringkasan teks. Pengujian akurasi pada peringkasan modul pembelajaran dilakukan dengan cara membandingkan hasil ringkasan manual oleh manusia dan hasil ringkasan sistem. Yang mana pengujian ini menghasilkan rata-rata nilai f-measure, precision, dan recall tertinggi pada compression rate 20% dengan nilai berturut-turut 0.3853, 0.432, dan 0.3715.

Download Full-text

LSA Based Text Summarization

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b3288.079220 ◽

2020 ◽

Vol 9 (2) ◽

pp. 150-156

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Maximum Level ◽

Latent Semantic Indexing ◽

Text Summarization ◽

Semantic Indexing ◽

Inverse Document Frequency ◽

Document Frequency ◽

Key Terms ◽

Diversity Constraint

In this study we propose an automatic single document text summarization technique using Latent Semantic Analysis (LSA) and diversity constraint in combination. The proposed technique uses the query based sentence ranking. Here we are not considering the concept of IR (Information Retrieval) so we generate the query by using the TF-IDF(Term Frequency-Inverse Document Frequency). For producing the query vector, we identify the terms having the high IDF. We know that LSA utilizes the vectorial semantics to analyze the relationships between documents in a corpus or between sentences within a document and key terms they carry by producing a list of ideas interconnected to the documents and terms. LSA helps to represent the latent structure of documents. For selecting the sentences from the document Latent Semantic Indexing (LSI) is used. LSI helps to arrange the sentences with its score. Traditionally the highest score sentences have been chosen for summary but here we calculate the diversity between chosen sentences and produce the final summary as a good summary should have maximum level of diversity. The proposed technique is evaluated on OpinosisDataset1.0.

Download Full-text

The Performance of BERT as Data Representation of Text Clustering

10.21203/rs.3.rs-940164/v1 ◽

2021 ◽

Author(s):

Alvin Subakti ◽

Hendri Murfi ◽

Nora Hariadi

Keyword(s):

Feature Extraction ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Text Clustering ◽

Data Representation ◽

Text Representation ◽

Inverse Document Frequency ◽

Document Frequency ◽

Normalization Methods ◽

Textual Data

Abstract Text clustering is the task of grouping a set of texts so that text in the same group will be more similar than those from a different group. The process of grouping text manually requires a significant amount of time and labor. Therefore, automation utilizing machine learning is necessary. The standard method used to represent textual data is Term Frequency Inverse Document Frequency (TFIDF). However, TFIDF cannot consider the position and context of a word in a sentence. Bidirectional Encoder Representation from Transformers (BERT) model can produce text representation that incorporates the position and context of a word in a sentence. This research analyzed the performance of the BERT model as data representation for text. Moreover, various feature extraction and normalization methods are also applied for the data representation of the BERT model. To examine the performances of BERT, we use four clustering algorithms, i.e., k-means clustering, eigenspace-based fuzzy c-means, deep embedded clustering, and improved deep embedded clustering. Our simulations show that BERT outperforms the standard TFIDF method in 28 out of 36 metrics. Furthermore, different feature extraction and normalization produced varied performances. The usage of these feature extraction and normalization must be altered depending on the text clustering algorithm used.

Download Full-text

Scientific Documents clustering based on Text Summarization

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v5i4.pp782-787 ◽

2015 ◽

Vol 5 (4) ◽

pp. 782

Author(s):

Pedram Vahdani Amoli ◽

Omid Sojoodi Sh.

Keyword(s):

Document Clustering ◽

Computation Time ◽

Text Summarization ◽

Inverse Document Frequency ◽

Scientific Databases ◽

Document Frequency ◽

Sentence Level ◽

Frequent Items ◽

Novel Method ◽

Phase Term

<table border="1" cellspacing="0" cellpadding="0" width="593"><tbody><tr><td width="387" valign="top"><p>In this paper a novel method is proposed for scientific document clustering. The proposed method is a summarization-based hybrid algorithm which comprises a preprocessing phase. In the preprocessing phase unimportant words which are frequently used in the text are removed. This process reduces the amount of data for the clustering purpose. Furthermore frequent items cause overlapping between the clusters which leads to inefficiency of the cluster separation. After the preprocessing phase, Term Frequency/Inverse Document Frequency (TFIDF) is calculated for all words and stems over the document to score them in the document. Text summarization is performed then in the sentence level. Document clustering is finally done according to the scores of calculated TFIDF. The hybrid progress of the proposed scheme, from preprocessing phase to document clustering, gains a rapid and efficient clustering method which is evaluated by 400 English texts extracted from scientific databases of 11 different topics. The proposed method is compared with CSSA, SMTC and Max-Capture methods. The results demonstrate the proficiency of the proposed scheme in terms of computation time and efficiency using F-measure criterion.</p></td></tr></tbody></table>

Download Full-text

A New Similarity Measure for Document Classification and Text Mining

KnE Social Sciences ◽

10.18502/kss.v4i1.5999 ◽

2020 ◽

Author(s):

Mete Eminağaoğlu ◽

Yılmaz Gökşen

Keyword(s):

Knowledge Management ◽

Information Retrieval ◽

Text Mining ◽

Similarity Measure ◽

Pearson Correlation ◽

Similarity Measures ◽

Content Management ◽

Document Classification ◽

Classification Systems ◽

Textual Data

Accurate, efficient and fast processing of textual data and classification of electronic documents have become an important key factor in knowledge management and related businesses in today’s world. Text mining, information retrieval, and document classification systems have a strong positive impact on digital libraries and electronic content management, e-marketing, electronic archives, customer relationship management, decision support systems, copyright infringement, and plagiarism detection, which strictly affect economics, businesses, and organizations. In this study, we propose a new similarity measure that can be used with k-nearest neighbors (k-NN) and Rocchio algorithms, which are some of the well-known algorithms for document classification, information retrieval, and some other text mining purposes. We have tested our novel similarity measure with some structured textual data sets and we have compared the results with some other standard distance metrics and similarity measures such as Cosine similarity, Euclidean distance, and Pearson correlation coefficient. We have obtained some promising results, which show that this proposed similarity measure could be alternatively used within all suitable algorithms, methods, and models for text mining, document classification, and relevant knowledge management systems. Keywords: text mining, document classification, similarity measures, k-NN, Rocchio algorithm

Download Full-text

Comparative analysis of similarity measures for sentence level semantic measurement of text

2013 IEEE International Conference on Control System, Computing and Engineering ◽

10.1109/iccsce.2013.6719938 ◽

2013 ◽

Cited By ~ 6

Author(s):

Sazianti Mohd Saad ◽

Siti Sakira Kamarudin

Keyword(s):

Comparative Analysis ◽

Similarity Measures ◽

Sentence Level ◽

Analysis Of Similarity

Download Full-text