Improving Document Similarity Calculation Using Cosine-Similarity Graphs

<p><em>Document similarity can be measured and used to discover other similar documents in a document collection (corpus). In a small corpus, measuring document similarity is not a problem. In a bigger corpus, comparing similarity rate between documents can be time consuming. A clustering method can be used to minimize number of document collection that has to be compared to a document to save time. This research is aimed to discover the effect of clustering technique in measuring document similarity and evaluate the performance. Corpus used was undergraduate thesis of Politeknik Statistika STIS students from year 2007-2016 as many as 2.049 documents. These documents were represented as bag of words model and clustered using k-means clustering method. Measurement of similarity used is Cosine similarity. From the simulation, clustering process for 3 clusters needs longer preparation time (17,32%) but resulting in faster query processing (77,88%) with accuracy of 0,98. Clustering process for 5 clusters needs longer preparation time (31,10%) but resulting in faster query processing (83,79%) with accuracy of 0,86. Clustering process for 7 clusters needs longer preparation time (45,10%) but resulting in faster query processing (85,30%) with accuracy of 0,98.</em></p>

Download Full-text

Document Clustering based on Phrase and Single Term Similarity using Neo4j

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c9050.109320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 3188-3192

Keyword(s):

Cosine Similarity ◽

Experimental Results ◽

Document Representation ◽

Document Similarity ◽

Informative Feature ◽

Clustering Technique ◽

Density Based Clustering ◽

Single Term ◽

Term Similarity

Document similarity generally rely on single term similarity such as cosine similarity. To achieve better document similarity, along with single term phrase- more informative feature can be used. To find out shared phrases across the corpus the Document Index graph (DIG) representation model is used. Document representation - DIG model incrementally construct the graph and simultaneously finds the shared phrase between current document and previously inserted documents from the graph. The similarity between documents is mainly depends on the number of shared phrases and single term similarity – known as hybrid similarity. The hybrid similarities are used with wellknown density based clustering technique DBSCAN to assess their effect on quality of the clusters. Experimental results shows that hybrid similarity gives more accurate degree of document similarity and performs better cohesive clustering.

Download Full-text

Research on document similarity calculation and detection based on deep learning

Journal of Physics Conference Series ◽

10.1088/1742-6596/1757/1/012007 ◽

2021 ◽

Vol 1757 (1) ◽

pp. 012007

Author(s):

Cui Xing ◽

Yan Yang ◽

Jian Luo

Keyword(s):

Deep Learning ◽

Document Similarity ◽

Similarity Calculation

Download Full-text

Application of the Deep Learning Algorithm and Similarity Calculation Model in Optimization of Personalized Online Teaching System of English Course

Computational Intelligence and Neuroscience ◽

10.1155/2021/8249625 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Yuan Fang ◽

Jingning Li

Keyword(s):

Deep Learning ◽

Collaborative Filtering ◽

Online Teaching ◽

Calculation Model ◽

Cosine Similarity ◽

Personalized Recommendation ◽

Functional Modules ◽

Recommendation Algorithm ◽

Teaching System ◽

Similarity Calculation

This study provides an in-depth study and analysis of English course recommendation techniques through a combination of bee colony algorithm and neural network algorithm. In this study, the acquired text is trained with a document vector by a deep learning model and combined with a collaborative filtering method to recommend suitable courses for users. Based on the analysis of the current research status and development of the technology related to course resource recommendation, the deep learning technology is applied to the course resource recommendation based on the current problems of sparse data and low accuracy of the course recommendation. For the problem that the importance of learning resources to users changes with time, this study proposes to fuse the time information into the neural collaborative filtering algorithm through the clustering classification algorithm and proposes a deep learning-based course resource recommendation algorithm to better recommend the course that users want to learn at this stage promptly. Secondly, the course cosine similarity calculation model is improved for the course recommendation algorithm. Considering the impact of the number of times users rate courses and the time interval between users rating different courses on the course similarity calculation, the contribution of active users to the cosine similarity is reduced and a time decay penalty is given to users rating courses at different periods. By improving the hybrid recommendation algorithm and similarity calculation model, the error value, recall, and accuracy of course recommendation results outperform other algorithmic models. The requirements analysis identifies the personalized online teaching system with rural primary and secondary school students as the main service target and then designs the overall architecture and functional modules of the recommendation system and the database table structure to implement the user registration, login, and personal center functional modules, course publishing, popular recommendation, personalized recommendation, Q&A, and rating functional modules.

Download Full-text

ANALISIS TINGKAT PLAGIASI DOKUMEN SKRIPSI DENGAN METODE COSINE SIMILARITY DAN PEMBOBOTAN TF-IDF

TEKNIMEDIA: Teknologi Informasi dan Multimedia ◽

10.46764/teknimedia.v2i2.51 ◽

2022 ◽

Vol 2 (2) ◽

pp. 90-95

Author(s):

Muhammad Azmi

Keyword(s):

Cosine Similarity ◽

Plagiarism Detection ◽

Document Similarity ◽

Final Project ◽

The Right ◽

Modify Technique ◽

Similarity Method

Plagiarism is the activity of duplicating or imitating the work of others then recognized as his own work without the author's permission or listing the source. Plagiarism or plagiarism is not something that is difficult to do because by using a copy-paste-modify technique in part or all of the document, the document can be said to be the result of plagiarism or duplication. The practice of plagiarism occurs because students are accustomed to taking the writings of others without including the source of origin, even copying in its entirety and exactly the same. Plagiarism practices are mostly carried out by students, especially when completing the final project or thesis One way that can be used to prevent the practice of plagiarism is by doing prevention and detecting. Plagiarism detection uses the concept of similarity or document similarity is one way to detect copy & paste plagiarism and disguised plagiarism. one of the right methods that can be done to detect plagiarism by analyzing the level of document plagiarism using the Cosine Similarity method and the TF-IDF weighting. This research produces an application that is able to process the similarity value of the document to be tested. Hasik testing shows that it is appropriate between manual calculations and implementation of algorithms in the application made. Use of the Literature Library is quite effective in the Stemming process. Calculations that use stemming will have a higher similarity value compared to calculations without stemming methods.

Download Full-text