PERANCANGAN DAN PENERAPAN ALGORITMA RIZKI TANJUNG 24 (RTG24) UNTUK KOMPARASI KATA PADA FILE TEXT

Rizki Tanjung; Haruno Sajati; Dwi Nugraheny

doi:10.28989/compiler.v3i1.68

PERANCANGAN DAN PENERAPAN ALGORITMA RIZKI TANJUNG 24 (RTG24) UNTUK KOMPARASI KATA PADA FILE TEXT

Compiler ◽

10.28989/compiler.v3i1.68 ◽

2014 ◽

Vol 3 (1) ◽

Author(s):

Rizki Tanjung ◽

Haruno Sajati ◽

Dwi Nugraheny

Keyword(s):

String Matching ◽

Plagiarism Detection ◽

Text Documents ◽

Text Document ◽

Basic Word ◽

Root Word

Plagiarism is the act of taking essay or work of others, and recognize it as his own work. Plagiarism of the text is very common and difficult to avoid. Therefore, many created a system that can assist in plagiarism detection text document. To make the detection of plagiarism of text documents at its core is to perform string matching. This makes the emergence of the idea to build an algorithm that will be implemented in RTG24 Comparison file.txt applications. Document to be compared must be a file. Txt or plaintext, and every word contained in the document must be in the dictionary of Indonesian. RTG24 algorithm works by determining the number of same or similar words in any text between the two documents. In the process RTG24 algorithm has several stages: parsing, filtering, stemming and comparison. Parsing stage is the stage where every sentence in the document will be broken down into basic words, filtering step is cleaning the particles are not important. The next stage, stemming is the stage where every word searchable basic word or root word, this is done to simplify and facilitate comparison between the two documents. Right after through the process of parsing, filtering, and stemming, then the document should be inserted into the array for the comparison or the comparison between the two documents. So it can be determined the percentage of similarity between the two documents.

Download Full-text

Text Documents Plagiarism Detection using Rabin-Karp and Jaro-Winkler Distance Algorithms

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v5.i2.pp462-471 ◽

2017 ◽

Vol 5 (2) ◽

pp. 462 ◽

Cited By ~ 3

Author(s):

Brinardi Leonardo ◽

Seng Hansun

Keyword(s):

Detection System ◽

String Matching ◽

Experimental Results ◽

Plagiarism Detection ◽

Text Documents ◽

Matching Algorithm ◽

Text Document ◽

Different Types ◽

The University

Plagiarism is an act that is considered by the university as a fraud by taking someone ideas or writings without mentioning the references and claimed as his own. Plagiarism detection system is generally implement string matching algorithm in a text document to search for common words between documents. There are some algorithms used for string matching, two of them are Rabin-Karp and Jaro-Winkler Distance algorithms. Rabin-Karp algorithm is one of compatible algorithms to solve the problem of multiple string patterns, while, Jaro-Winkler Distance algorithm has advantages in terms of time. A plagiarism detection application is developed and tested on different types of documents, i.e. doc, docx, pdf and txt. From the experimental results, we obtained that both of these algorithms can be used to perform plagiarism detection of those documents, but in terms of their effectiveness, Rabin-Karp algorithm is much more effective and faster in the process of detecting the document with the size more than 1000 KB.

Download Full-text

Aplikasi Pendeteksi Tingkat Kesamaan Dokumen Teks: Algoritma Rabin Karp Vs. Winnowing

Digital Zone Jurnal Teknologi Informasi dan Komunikasi ◽

10.31849/digitalzone.v9i1.1242 ◽

2018 ◽

Vol 9 (1) ◽

pp. 82-93

Author(s):

Sugiono Sugiono ◽

Herwin Herwin ◽

Hamdani Hamdani ◽

Erlin Erlin

Keyword(s):

Word Processing ◽

Processing Time ◽

Code Of Conduct ◽

Scientific Writing ◽

Plagiarism Detection ◽

Text Similarity ◽

Text Documents ◽

Processing Application ◽

Text Document ◽

Copy And Paste

Tindakan copy paste dokumen teks sering terjadi dalam penulisan karya ilmiah tanpa memberikan kredit kepada yang mempunyai dokumen teks tersebut. Tindakan melanggar kode etik ini disebabkan karena tersedianya fasilitas menyalin dan menempel teks pada aplikasi pengolah kata. Tujuan dari penelitian ini adalah untuk membangun sebuah aplikasi yang mampu mendeteksi tingkat kesamaan dokumen teks dengan terlebih dahulu membandingkan tingkat kehandalan dari dua algoritma pendeteksi kesamaan teks yaitu algoritma rabin-karp dan algoritma winnowing. Perbandingan dilakukan terhadap dua variabel yaitu tingkat kemampuan mendeteksi dan waktu pemrosesan. Hasil menunjukkan bawah algoritma winnowing lebih unggul dibandingkan algoritma rabin-karp dari sisi tingkat akurasi maupun dari sisi waktu pemrosesan. Abstract The behavior of copy pastes the text document often occurs in scientific writing without giving credit to those who have the text document. The behavior of this missing code of conduct due to the availability of facility to copy and paste the text in a word processing application. The purpose of this study is to build an application that can detect the index of similarity of text documents by first comparing the level of reliability of the two text similarity algorithms, i.e., Rabin-Karp and Winnowing. The comparison is measured based on two variables; the level of capability of detecting and processing time. The result shows that Winnowing algorithm outperforms Rabin-Karp in term of both accuracy and processing time. Keywords: Rabin-Karp, Winnowing, Plagiarism Detection, Text Similarity

Download Full-text

Aplikasi Pengecekan Dokumen Digital Tugas Mahasiswa Berbasis Website

Jurnal Buana Informatika ◽

10.24002/jbi.v11i2.3706 ◽

2020 ◽

Vol 11 (2) ◽

pp. 93

Author(s):

Latius Hermawan ◽

Maria Bellaniar Ismiati

Keyword(s):

String Matching ◽

Plagiarism Detection ◽

Matching Method ◽

Text Documents ◽

Matching Algorithm ◽

The Common ◽

Copy And Paste ◽

Online Sources

Abstract. Website-Based Application for Checking Students’ Digital Assignment. Nowadays, technology is not only about computers as it has advanced to smartphones and other things. In UKMC, technology has certainly helped the job. However, in this university, there is no application for checking the plagiarism of the students’ digital assignments, whereas plagiarism is sometimes done by students when working on assignments from online sources. Students’ assignments can be easily done by doing copy and paste without mentioning its reference because students tend to think practically when working on assignments. Plagiarism is strictly prohibited in education because it is not permitted. Therefore, a plagiarism detection application should be created. It applies a string-matching algorithm in text documents to search the common words between documents. By applying the string-matching method in document that match with other documents, an output that will provide information on how similar the text documents are can be generated. After testing, it is obtained that this application can help lecturers and students to reduce the level of plagiarism.Keywords: Application, Plagiarism, Digital, Assignment Abstrak. Sekarang teknologi tidak hanya tentang computer karena kemajuannya telah merambah pada smartphone, dan hal- hal lainnya. Di UKMC, teknologi yang digunakan sudah sangat membantu pekerjaan. Namun di universitas ini, belum ada aplikasi yang dapat memeriksa plagiarisme dari tugas digital mahasiswa padahal plagiarisme terkadang dilakukan oleh mahasiswa saat mengerjakan tugas dari sumber online. Tugas mahasiswa dapat dengan mudah dibuat dengan cara copy-paste tanpa menyebutkan referensi, karena siswa cenderung berpikir praktis ketika mengerjakan tugas. Plagiarisme sangat dilarang dalam pendidikan karena tidak diizinkan. Oleh karena itu aplikasi pendeteksi plagiarisme perlu dibuat. Aplikasi ini menerapkan algoritma pencocokan string dalam dokumen teks untuk mencari kata-kata umum antar dokumen. Dengan metode pencocokan string pada dokumen yang cocok dengan beberapa dokumen lainnya dapat dihasilkan suatu keluaran yang akan memberikan informasi seberapa dekat antar dokumen teks tersebut. Setelah dilakukan pengujian, didapat hasil bahwa aplikasi ini dapat membantu dosen dan mahasiswa untuk mengurangi tingkat plagiarisme.Kata Kunci: aplikasi, plagiarisme, tugas kuliah.

Download Full-text

An Improved B-hill Climbing Optimization Technique for Solving the Text Documents Clustering Problem

Current Medical Imaging Formerly Current Medical Imaging Reviews ◽

10.2174/1573405614666180903112541 ◽

2020 ◽

Vol 16 (4) ◽

pp. 296-306 ◽

Cited By ~ 3

Author(s):

Laith Mohammad Abualigah ◽

Essam Said Hanandeh ◽

Ahamad Tajudin Khader ◽

Mohammed Abdallh Otair ◽

Shishir Kumar Shandilya

Keyword(s):

Optimization Technique ◽

Document Clustering ◽

Text Clustering ◽

Hill Climbing ◽

Text Documents ◽

Clustering Problem ◽

Text Document ◽

Text Information ◽

Amount Of Knowledge ◽

The Hill

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.

Download Full-text

Text Document Summarization Using POS tagging for Kannada Text Documents

2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence) ◽

10.1109/confluence51648.2021.9377106 ◽

2021 ◽

Author(s):

Jayashree R ◽

Basavaraj S Anami ◽

Poornima B K

Keyword(s):

Text Documents ◽

Document Summarization ◽

Pos Tagging ◽

Text Document

Download Full-text

Development of the documents comparison module for an electronic document management system

Information Technology and Nanotechnology ◽

10.18287/1613-0073-2019-2416-527-533 ◽

2019 ◽

pp. 527-533

Author(s):

M A Mikheev ◽

P Y Yakimov

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Document Management ◽

Electronic Document ◽

Text Documents ◽

Text Document ◽

Document Management System ◽

Optical Character ◽

Electronic Document Management ◽

Scanned Image

The article is devoted to solving the problem of document versions comparison in electronic document management systems. Systems-analogues were considered, the process of comparing text documents was studied. In order to recognize the text on the scanned image, the technology of optical character recognition and its implementation — Tesseract library were chosen. The Myers algorithm is applied to compare received texts. The software implementation of the text document comparison module was implemented using the solutions described above.

Download Full-text

The Evaluation of Accuracy Performance in an Enhanced Embedded Feature Selection for Unstructured Text Classification

Iraqi Journal of Science ◽

10.24996/ijs.2020.61.12.28 ◽

2020 ◽

pp. 3397-3407

Author(s):

Nur Syafiqah Mohd Nafis ◽

Suryanti Awang

Keyword(s):

Feature Selection ◽

Text Classification ◽

Training Dataset ◽

Recursive Feature Elimination ◽

High Dimensional ◽

Significant Feature ◽

Support Vector ◽

Svm Classifier ◽

Text Documents ◽

Text Document

Text documents are unstructured and high dimensional. Effective feature selection is required to select the most important and significant feature from the sparse feature space. Thus, this paper proposed an embedded feature selection technique based on Term Frequency-Inverse Document Frequency (TF-IDF) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for unstructured and high dimensional text classificationhis technique has the ability to measure the feature’s importance in a high-dimensional text document. In addition, it aims to increase the efficiency of the feature selection. Hence, obtaining a promising text classification accuracy. TF-IDF act as a filter approach which measures features importance of the text documents at the first stage. SVM-RFE utilized a backward feature elimination scheme to recursively remove insignificant features from the filtered feature subsets at the second stage. This research executes sets of experiments using a text document retrieved from a benchmark repository comprising a collection of Twitter posts. Pre-processing processes are applied to extract relevant features. After that, the pre-processed features are divided into training and testing datasets. Next, feature selection is implemented on the training dataset by calculating the TF-IDF score for each feature. SVM-RFE is applied for feature ranking as the next feature selection step. Only top-rank features will be selected for text classification using the SVM classifier. Based on the experiments, it shows that the proposed technique able to achieve 98% accuracy that outperformed other existing techniques. In conclusion, the proposed technique able to select the significant features in the unstructured and high dimensional text document.

Download Full-text

Assessment of Twitter Data Clusters with Cosine-Based Validation Metrics Using Hybrid Topic Models

Ingénierie des systèmes d information ◽

10.18280/isi.250606 ◽

2020 ◽

Vol 25 (6) ◽

pp. 755-769

Author(s):

Noorullah R. Mohammed ◽

Moulana Mohammed

Keyword(s):

Data Clustering ◽

Topic Models ◽

Cluster Validity ◽

Text Documents ◽

Text Data ◽

Validity Assessment ◽

Text Document ◽

Cluster Validity Indices ◽

Validity Indices ◽

Data Clusters

Text data clustering is performed for organizing the set of text documents into the desired number of coherent and meaningful sub-clusters. Modeling the text documents in terms of topics derivations is a vital task in text data clustering. Each tweet is considered as a text document, and various topic models perform modeling of tweets. In existing topic models, the clustering tendency of tweets is assessed initially based on Euclidean dissimilarity features. Cosine metric is more suitable for more informative assessment, especially of text clustering. Thus, this paper develops a novel cosine based external and interval validity assessment of cluster tendency for improving the computational efficiency of tweets data clustering. In the experimental, tweets data clustering results are evaluated using cluster validity indices measures. Experimentally proved that cosine based internal and external validity metrics outperforms the other using benchmarked and Twitter-based datasets.

Download Full-text

Latent Dirichlet Allocation and POS Tags Based Method for External Plagiarism Detection

Scholarly Ethics and Publishing ◽

10.4018/978-1-5225-8057-7.ch015 ◽

2019 ◽

pp. 319-336

Author(s):

Ali Daud ◽

Jamal Ahmad Khan ◽

Jamal Abdul Nasir ◽

Rabeeh Ayaz Abbasi ◽

Naif Radi Aljohani ◽

...

Keyword(s):

Latent Dirichlet Allocation ◽

Plagiarism Detection ◽

Text Documents ◽

Parts Of Speech ◽

Stop Word ◽

Processing Step ◽

Syntactic Information ◽

N Gram ◽

Basic Hypothesis ◽

Dirichlet Allocation

In this article we present a new semantic and syntactic-based method for external plagiarism detection. In the proposed approach, latent dirichlet allocation (LDA) and parts of speech (POS) tags are used together to detect plagiarism between the sample and a number of source documents. The basic hypothesis is that considering semantic and syntactic information between two text documents may improve the performance of the plagiarism detection task. Our method is based on two steps, naming, which is a pre-processing where we detect the topics from the sentences in documents using the LDA and convert each sentence in POS tags array; then a post processing step where the suspicious cases are verified purely on the basis of semantic rules. For two types of external plagiarism (copy and random obfuscation), we empirically compare our approach to the state-of-the-art N-gram based and stop-word N-gram based methods and observe significant improvements.

Download Full-text

Rabin Karp And Winnowing Algorithm For Statistics Of Text Document Plagiarism Detection

2019 7th International Conference on Cyber and IT Service Management (CITSM) ◽

10.1109/citsm47753.2019.8965422 ◽

2019 ◽

Cited By ~ 1

Author(s):

Dedi Leman ◽

Maulia Rahman ◽

Frans Ikorasaki ◽

Bob Subhan Riza ◽

Muhammad Barkah Akbbar

Keyword(s):

Plagiarism Detection ◽

Text Document

Download Full-text