Plagiarism detection using document similarity based on distributed representation

Plagiarism occurs when the students have tasks and pursued by the deadline. Plagiarism is considered as the fastest way to accomplish the tasks. This reason makes the author tried to build a plagiarism detection system with Winnowing algorithm as document similarity search algorithm. The documents that being tested are Indonesian journals with extension .doc, .docx, and/or .txt. Similarity calculation process through two stages, the first is the process of making a document fingerprint using Winnowing algorithm and the second is using Jaccard coefficient similarity. In order to develop this system, the author used iterative waterfall model approach. The main objective of this project is to determine the level of plagiarism. It is expected to prevent plagiarism either intentionally or unintentionally before our journal published by displaying the percentage of similarity in the journals that we make.

Download Full-text

ANALISIS TINGKAT PLAGIASI DOKUMEN SKRIPSI DENGAN METODE COSINE SIMILARITY DAN PEMBOBOTAN TF-IDF

TEKNIMEDIA: Teknologi Informasi dan Multimedia ◽

10.46764/teknimedia.v2i2.51 ◽

2022 ◽

Vol 2 (2) ◽

pp. 90-95

Author(s):

Muhammad Azmi

Keyword(s):

Cosine Similarity ◽

Plagiarism Detection ◽

Document Similarity ◽

Final Project ◽

The Right ◽

Modify Technique ◽

Similarity Method

Plagiarism is the activity of duplicating or imitating the work of others then recognized as his own work without the author's permission or listing the source. Plagiarism or plagiarism is not something that is difficult to do because by using a copy-paste-modify technique in part or all of the document, the document can be said to be the result of plagiarism or duplication. The practice of plagiarism occurs because students are accustomed to taking the writings of others without including the source of origin, even copying in its entirety and exactly the same. Plagiarism practices are mostly carried out by students, especially when completing the final project or thesis One way that can be used to prevent the practice of plagiarism is by doing prevention and detecting. Plagiarism detection uses the concept of similarity or document similarity is one way to detect copy & paste plagiarism and disguised plagiarism. one of the right methods that can be done to detect plagiarism by analyzing the level of document plagiarism using the Cosine Similarity method and the TF-IDF weighting. This research produces an application that is able to process the similarity value of the document to be tested. Hasik testing shows that it is appropriate between manual calculations and implementation of algorithms in the application made. Use of the Literature Library is quite effective in the Stemming process. Calculations that use stemming will have a higher similarity value compared to calculations without stemming methods.

Download Full-text

Towards Neural Situation Evolution Modeling: Learning a Distributed Representation for Predicting Complex Event Sequences

2020 IEEE 23rd International Conference on Information Fusion (FUSION) ◽

10.23919/fusion45008.2020.9190165 ◽

2020 ◽

Author(s):

Andrea Salfinger ◽

Lauro Snidaro

Keyword(s):

Distributed Representation ◽

Event Sequences ◽

Evolution Modeling

Download Full-text

Plagiarism Detection and Avoidance Consequences in Academic World

Journal of Advanced Research in Library and Information Science ◽

10.24321/2395.2288.201706 ◽

2017 ◽

Vol 04 (04) ◽

pp. 6-13

Author(s):

Akhandanand Shukla ◽

Keyword(s):

Plagiarism Detection ◽

Academic World

Download Full-text

Analysis of Stylometric Features and Segmentation Strategies in Intrinsic Plagiarism Detection System

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) ◽

10.29207/resti.v4i5.2486 ◽

2020 ◽

Vol 4 (5) ◽

pp. 988-997

Author(s):

Sylvia Putri Gunawan ◽

Lucia Dwi Krisnawati ◽

Antonius Rachmat Chrismanto

Keyword(s):

Detection System ◽

Plagiarism Detection ◽

Development System ◽

Intrinsic Plagiarism Detection

Two different paradigms in the field of plagiarism detection resulting in External Plagiarism Detection (EPD) and Intrinsic Plagiarism Detection (IPD) systems. The most common applied system is EPD, which requires its algorithm to make a heuristic comparison between a suspicious document with documents in a corpus. In contrast, given a suspicious document only, an algorithm of IPD should be able to find the plagiarism section by looking for text segments having different writing styles. Previous researches for Indonesian texts fell only in the field of the EPD development system. Therefore, this research focuses on and contributes to experimenting and analyzing the stylometric features and segmentation strategies to build an IPD system for Indonesian texts. The experimentation results show that the paragraph segment performs better by scoring 0.92 for Macro Averaged-Accuracy and 0.54 for Macro Averaged-F1. The stylometric features achieving the highest scores of F-1 and Accuracy are the frequency of punctuation, the average paragraph length, and the type-token ratio.

Download Full-text

Pembobotan Berdasarkan Tingkat Kesamaan Semantik pada Metode Fuzzy Semi-Supervised Co-Clustering untuk Pengelompokkan Dokumen Teks

Jurnal ULTIMATICS ◽

10.31937/ti.v6i2.333 ◽

2014 ◽

Vol 6 (2) ◽

pp. 46-51

Author(s):

Galang Amanda Dwi P. ◽

Gregorius Edwadr ◽

Agus Zainal Arifin

Keyword(s):

Supervised Learning ◽

Semantic Similarity ◽

The Other ◽

Classification Result ◽

Document Similarity ◽

The Matrix ◽

Index Terms ◽

Membership Value ◽

Degree Of Similarity

Nowadays, a large number of information can not be reached by the reader because of the misclassification of text-based documents. The misclassified data can also make the readers obtain the wrong information. The method which is proposed by this paper is aiming to classify the documents into the correct group. Each document will have a membership value in several different classes. The method will be used to find the degree of similarity between the two documents is the semantic similarity. In fact, there is no document that doesn’t have a relationship with the other but their relationship might be close to 0. This method calculates the similarity between two documents by taking into account the level of similarity of words and their synonyms. After all inter-document similarity values obtained, a matrix will be created. The matrix is then used as a semi-supervised factor. The output of this method is the value of the membership of each document, which must be one of the greatest membership value for each document which indicates where the documents are grouped. Classification result computed by the method shows a good value which is 90 %. Index Terms - Fuzzy co-clustering, Heuristic, Semantica Similiarity, Semi-supervised learning.

Download Full-text