Plagiarism Detection in Students' Theses Using The Cosine Similarity Method

The main requirement for graduation from students is to make a final scientific paper. One of the factors determining the quality of a student's scientific work is the uniqueness and innovation of the work. This research aims to apply data mining methods to detect similarities in titles, abstracts, or topics of students' final scientific papers so that plagiarism does not occur. In this research, the cosine similarity method is combined with the preprocessing method and TF-IDF to calculate the level of similarity between the title and the abstract of a student's final scientific paper, then the results will be displayed and compared with the existing final project repository based on the threshold value to make a decision whether scientific work can be accepted or rejected. Based on the test data and training data that has been applied to the TF-IDF method, it shows that the percentage level of similarity between the training data document and the test data document is 8%. This shows that the student thesis is still classified as unique and does not contain plagiarism content. The findings of this study can help the university in managing the administration of student theses so that plagiarism does not occur. Furthermore, it is necessary to study further adding methods to increase the accuracy of system performance so that when the process is run the system will work faster and optimally.

Download Full-text

Aplikasi Pendeteksi Plagiasi pada Universitas Indo Global Mandiri Berbasis Web

Jurnal Ilmiah Informatika Global ◽

10.36982/jig.v10i2.864 ◽

2019 ◽

Vol 10 (2) ◽

Author(s):

Dhamayanti Dhamayanti ◽

Lidia Permata Sari

Keyword(s):

Information Systems ◽

Cosine Similarity ◽

Plagiarism Detection ◽

Original Source ◽

Final Project ◽

Processing And Storage ◽

And Storage ◽

Popular Texts ◽

Similarity Method ◽

Student Thesis

ABSTRACTThesis is a final project that must be taken by students to complete their studies at the Indo Global Mandiri University in Palembang. Thesis data processing and storage, especially in the Information Systems department is still done conventionally, so checking the similarity of the title even the contents of the student thesis is difficult to detect. Difficulties in detecting the title and content of the thesis cause students to easily and freely plagiarize the proposal preparation and thesis report from beginning to end without being known by the lecturer and the Information System department. Plagiarism is the act of a shortcut that steals ideas, takes the work, and recognizes the work of others as their own without including references from the original source. This research will discuss the problem of plagiarism in the Information Systems department through making applications that can detect the plagiarism of titles and contents of the thesis especially in the information systems department so as to overcome the plagiarism problems faced by the information systems department. This plagiarism detection application is built using the cosine similarity method. Cosine similarity is a method for calculating similarity (level of similarity) between two object. In testing the similarity of documents with the results of the study, cosine similarity has a higher degree of accuracy. Cosine similarity is used to calculate the similarity value by equating said words and become one of the techniques to measure the similarity of popular texts. Plagiarism detection application using cosine similarity method which is implemented with PHP and MySQL as the database can help efforts to reduce the occurrence of plagiarism in the title and contents of the thesis in the Information Systems department.<pre> </pre> <pre>Keywords : Plagiarism, Plagiarism Detection Application, Cosine Similarity, PHP</pre>ABSTRAKSkripsi merupakan tugas akhir yang wajib ditempuh mahasiswa untuk menyelesaikan studi di Universitas Indo Global Mandiri Palembang. Pengolahan dan penyimpanan data skripsi khususnya pada program studi Sistem Informasi masih dilakukan secara konvensional, sehingga pengecekan kemiripan judul bahkan isi skripsi mahasiswa sulit untuk dideteksi. Kesulitan pendeteksian judul dan isi skripsi menyebabkan mahasiswa dengan mudah dan bebas melakukan plagiasi pada pembuatan proposal maupun laporan skripsi dari awal hingga akhir tanpa diketahui oleh dosen dan pihak program studi. Plagiasi merupakan tindakan sebuah jalan pintas yang mencuri ide, mengambil hasil karya, dan mengakui hasil karya orang lain sebagai miliknya sendiri tanpa mencantumkan referensi dari sumber aslinya. Penelitian ini akan membahas permasalahan plagiasi pada program studi Sistem Informasi melalui pembuatan aplikasi yang dapat mendeteksi plagiasi judul dan isi skripsi khusunya pada program studi sistem informasi sehingga dapat mengatasi permasalahan plagiasi yang dihadapi oleh program studi sistem informasi. Aplikasi pendeteksi plagiasi ini dibagun dengan menggunakan metode cosine similarity. Cosine similarity adalah metode untuk menghitung similarity (tingkat kesamaan) antar dua buah objek. Pada pengujian kesamaan dokumen dengan hasil penelitian menunjukkan cosine similarity memiliki tingkat akurasi yang lebih tinggi. Cosine similarity digunakan untuk menghitung nilai kemiripan dengan menyamakan kata perkata dan menjadi salah satu teknik untuk mengukur kemiripan teks yang popular. Aplikasi pendeteksi plagiasi dengan menggunakan metode cosine similarity yang diimplemntasikan dengan PHP dan MySQL sebagai databasenya dapat membantu upaya mengurangi terjadinya plagisi pada judul dan isi skripsi di program studi Sistem Informasi.Kata kunci : Plagiasi, Aplikasi Pendeteksi Plagiasi, Cosine Similarity, PHP

Download Full-text

Uji Kemiripan Kalimat Judul Tugas Akhir dengan Metode Cosine Similarity dan Pembobotan TF-IDF

JURNAL MEDIA INFORMATIKA BUDIDARMA ◽

10.30865/mib.v5i2.2935 ◽

2021 ◽

Vol 5 (2) ◽

pp. 726

Author(s):

Indra Mawanta ◽

T S Gunawan ◽

Wanayumini Wanayumini

Keyword(s):

Cosine Similarity ◽

Training Data ◽

Final Report ◽

Test Results ◽

Study Program ◽

Health Institute ◽

Project Report ◽

Final Project ◽

Similarity Method ◽

Project Title

Deli Husada Health Institute is a health campus that has been established for 34 years, currently it has 30000 students, each student at the final level will submit a final project of study program every year, each student before doing his final project report must provide the title of an assignment report. Finally, to the study program, to reduce the level of similarity in the title of the student's final report, the study program usually conducts a manual check, the result that appears is that it is not effective in determining the title of the final project for students, so that it creates quite a lot of similarities between students. So that many final project reports look the same. With the above conditions, the Sentence Similarity Test of the Final Project Title was carried out with the Cosine Similarity Method and TF-IDF Weighting at the Deli Husada Delitua Health Institute Campus. At the end of the test results on the training data against the training data, the results obtained were 43% of the titles in Submitted is not eligible to be submitted again and 53% is eligible to be submitted as the title of the final project because it has high similarities to the title of the final project report. And get the average time 0.12117 in minutes

Download Full-text

ANALISIS TINGKAT PLAGIASI DOKUMEN SKRIPSI DENGAN METODE COSINE SIMILARITY DAN PEMBOBOTAN TF-IDF

TEKNIMEDIA: Teknologi Informasi dan Multimedia ◽

10.46764/teknimedia.v2i2.51 ◽

2022 ◽

Vol 2 (2) ◽

pp. 90-95

Author(s):

Muhammad Azmi

Keyword(s):

Cosine Similarity ◽

Plagiarism Detection ◽

Document Similarity ◽

Final Project ◽

The Right ◽

Modify Technique ◽

Similarity Method

Plagiarism is the activity of duplicating or imitating the work of others then recognized as his own work without the author's permission or listing the source. Plagiarism or plagiarism is not something that is difficult to do because by using a copy-paste-modify technique in part or all of the document, the document can be said to be the result of plagiarism or duplication. The practice of plagiarism occurs because students are accustomed to taking the writings of others without including the source of origin, even copying in its entirety and exactly the same. Plagiarism practices are mostly carried out by students, especially when completing the final project or thesis One way that can be used to prevent the practice of plagiarism is by doing prevention and detecting. Plagiarism detection uses the concept of similarity or document similarity is one way to detect copy & paste plagiarism and disguised plagiarism. one of the right methods that can be done to detect plagiarism by analyzing the level of document plagiarism using the Cosine Similarity method and the TF-IDF weighting. This research produces an application that is able to process the similarity value of the document to be tested. Hasik testing shows that it is appropriate between manual calculations and implementation of algorithms in the application made. Use of the Literature Library is quite effective in the Stemming process. Calculations that use stemming will have a higher similarity value compared to calculations without stemming methods.

Download Full-text

Recites fidelity detection system of al-Kautsar verse based on words using mel frequency cepstrum coefficients and cosine similarity

Jurnal Teknologi dan Sistem Komputer ◽

10.14710/jtsiskom.8.1.2020.27-35 ◽

2019 ◽

Vol 8 (1) ◽

pp. 27-35

Author(s):

Jans Hendry ◽

Aditya Rachman ◽

Dodi Zulherman

Keyword(s):

Feature Extraction ◽

Word Recognition ◽

Test Data ◽

Extraction Method ◽

Detection System ◽

Test Word ◽

Word Segmentation ◽

Cosine Similarity ◽

Feature Extraction Method ◽

Similarity Method

In this study, a system has been developed to help detect the accuracy of the reading of the Koran in the Surah Al-Kautsar based on the accuracy of the number and pronunciation of words in one complete surah. This system is very dependent on the accuracy of word segmentation based on envelope signals. The feature extraction method used was Mel Frequency Cepstrum Coefficients (MFCC), while the Cosine Similarity method was used to detect the accuracy of the reading. From 60 data, 30 data were used for training, while the rest were for testing. From each of the 30 training and test data, 15 data were correct readings, and 15 other data were incorrect readings. System accuracy was measured by word-for-word recognition, which results in 100 % of recall and 98.96 % of precision for the training word data, and 100 % of recall and 99.65 % of precision for the test word data. For the overall reading of the surah, there were 15 correct readings and 14 incorrect readings that were recognized correctly.

Download Full-text

The optimalization of cosine similarity method in detecting similarity degree of final project by the college students

IOP Conference Series Materials Science and Engineering ◽

10.1088/1757-899x/830/3/032003 ◽

2020 ◽

Vol 830 ◽

pp. 032003

Author(s):

R A Purba ◽

S Suparno ◽

M Giatman

Keyword(s):

College Students ◽

Cosine Similarity ◽

Similarity Degree ◽

Final Project ◽

Similarity Method

Download Full-text

KLASIFIKASI SENTIMENT ANALYSIS PADA KOMENTAR PESERTA DIKLAT MENGGUNAKAN METODE K-NEAREST NEIGHBOR

Kilat ◽

10.33322/kilat.v8i1.421 ◽

2019 ◽

Vol 8 (1) ◽

Author(s):

Riki Ruli A. Siregar ◽

Zuhdiyyah Ulfah Siregar ◽

Rakhmat Arianto

Keyword(s):

Sentiment Analysis ◽

Test Data ◽

Nearest Neighbor ◽

Cosine Similarity ◽

Training Data ◽

K Nearest Neighbor ◽

Term Frequency ◽

Document Frequency ◽

Negative Comments

The process of analyzing and classifying comment data done by reading and sorting one by one negative comments and classifying them one by one using Ms. Excel not effective if the data to be processed in large quantities. Therefore, this study aims to apply sentiment analysis on comment data using K-Nearest Neighbor (KNN) method. The comment data used is the comments of the participants of the training on Udiklat Jakarta filled by each participant who followed the training. Furthermore, the comment data is processed by pre-processing, weighting the word using Term Frequency-Invers Document Frequency, calculating the similarity level between the training data and test data with cosine similarity. The process of applying sentiment analysis is done to determine whether the comment is positive or negative. Furthermore, these comments will be classified into four categories, namely: instructors, materials, facilities and infrastructure. The results of this study resulted in a system that can classify comment data automatically with an accuracy of 94.23%

Download Full-text

New polyp image classification technique using transfer learning of network-in-network structure in endoscopic images

Scientific Reports ◽

10.1038/s41598-021-83199-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Young Jae Kim ◽

Jang Pyo Bae ◽

Jun-Won Chung ◽

Dong Kyun Park ◽

Kwang Gi Kim ◽

...

Keyword(s):

Colorectal Cancer ◽

Transfer Learning ◽

Test Data ◽

State Of The Art ◽

Early Stage ◽

Statistical Significance ◽

Recall Rate ◽

Training Data ◽

Fine Tuning ◽

Accuracy Evaluation

AbstractWhile colorectal cancer is known to occur in the gastrointestinal tract. It is the third most common form of cancer of 27 major types of cancer in South Korea and worldwide. Colorectal polyps are known to increase the potential of developing colorectal cancer. Detected polyps need to be resected to reduce the risk of developing cancer. This research improved the performance of polyp classification through the fine-tuning of Network-in-Network (NIN) after applying a pre-trained model of the ImageNet database. Random shuffling is performed 20 times on 1000 colonoscopy images. Each set of data are divided into 800 images of training data and 200 images of test data. An accuracy evaluation is performed on 200 images of test data in 20 experiments. Three compared methods were constructed from AlexNet by transferring the weights trained by three different state-of-the-art databases. A normal AlexNet based method without transfer learning was also compared. The accuracy of the proposed method was higher in statistical significance than the accuracy of four other state-of-the-art methods, and showed an 18.9% improvement over the normal AlexNet based method. The area under the curve was approximately 0.930 ± 0.020, and the recall rate was 0.929 ± 0.029. An automatic algorithm can assist endoscopists in identifying polyps that are adenomatous by considering a high recall rate and accuracy. This system can enable the timely resection of polyps at an early stage.

Download Full-text

Improved Training for Machine Learning: The Additional Potential of Innovative Algorithmic Approaches.

10.5194/egusphere-egu21-4683 ◽

2021 ◽

Author(s):

Octavian Dumitru ◽

Gottfried Schwarz ◽

Mihai Datcu ◽

Dongyang Ao ◽

Zhongling Huang ◽

...

Keyword(s):

Machine Learning ◽

Remote Sensing ◽

Test Data ◽

Satellite Images ◽

Training Data ◽

Data Selection ◽

Generative Adversarial Networks ◽

Radar Images ◽

Basic Work ◽

Selection Of

During the last years, much progress has been reached with machine learning algorithms. Among the typical application fields of machine learning are many technical and commercial applications as well as Earth science analyses, where most often indirect and distorted detector data have to be converted to well-calibrated scientific data that are a prerequisite for a correct understanding of the desired physical quantities and their relationships.However, the provision of sufficient calibrated data is not enough for the testing, training, and routine processing of most machine learning applications. In principle, one also needs a clear strategy for the selection of necessary and useful training data and an easily understandable quality control of the finally desired parameters.At a first glance, one could guess that this problem could be solved by a careful selection of representative test data covering many typical cases as well as some counterexamples. Then these test data can be used for the training of the internal parameters of a machine learning application. At a second glance, however, many researchers found out that a simple stacking up of plain examples is not the best choice for many scientific applications.To get improved machine learning results, we concentrated on the analysis of satellite images depicting the Earth&#8217;s surface under various conditions such as the selected instrument type, spectral bands, and spatial resolution. In our case, such data are routinely provided by the freely accessible European Sentinel satellite products (e.g., Sentinel-1, and Sentinel-2). Our basic work then included investigations of how some additional processing steps &#8211; to be linked with the selected training data &#8211; can provide better machine learning results.To this end, we analysed and compared three different approaches to find out machine learning strategies for the joint selection and processing of training data for our Earth observation images:<ul><li>One can optimize the training data selection by adapting the data selection to the specific instrument, target, and application characteristics [1].</li> <li>As an alternative, one can dynamically generate new training parameters by Generative Adversarial Networks. This is comparable to the role of a sparring partner in boxing [2].</li> <li>One can also use a hybrid semi-supervised approach for Synthetic Aperture Radar images with limited labelled data. The method is split in: polarimetric scattering classification, topic modelling for scattering labels, unsupervised constraint learning, and supervised label prediction with constraints [3].</li> </ul>We applied these strategies in the ExtremeEarth sea-ice monitoring project (http://earthanalytics.eu/). As a result, we can demonstrate for which application cases these three strategies will provide a promising alternative to a simple conventional selection of available training data.[1] C.O. Dumitru et. al, &#8220;Understanding Satellite Images: A Data Mining Module for Sentinel Images&#8221;, Big Earth Data, 2020, 4(4), pp. 367-408.[2] D. Ao et. al., &#8220;Dialectical GAN for SAR Image Translation: From Sentinel-1 to TerraSAR-X&#8221;, Remote Sensing, 2018, 10(10), pp. 1-23.[3] Z. Huang, et. al., "HDEC-TFA: An Unsupervised Learning Approach for Discovering Physical Scattering Properties of Single-Polarized SAR Images", IEEE Transactions on Geoscience and Remote Sensing, 2020, pp.1-18.

Download Full-text

Application of the C4.5 Algorithm to Predict the Types of Disease in Pigs Based on Android

JELIKU (Jurnal Elektronik Ilmu Komputer Udayana) ◽

10.24843/jlk.2021.v10.i01.p14 ◽

2021 ◽

Vol 10 (1) ◽

pp. 105

Author(s):

I Gusti Ayu Purnami Indryaswari ◽

Ida Bagus Made Mahendra

Keyword(s):

Programming Language ◽

Test Data ◽

Training Data ◽

Data Sets ◽

Android Application ◽

C4.5 Algorithm ◽

Sqlite Database

Many Indonesian people, especially in Bali, make pigs as livestock. Pig livestock are susceptible to various types of diseases and there have been many cases of pig deaths due to diseases that cause losses to breeders. Therefore, the author wants to create an Android-based application that can predict the type of disease in pigs by applying the C4.5 Algorithm. The C4.5 algorithm is an algorithm for classifying data in order to obtain a rule that is used to predict something. In this study, 50 training data sets were used with 8 types of diseases in pigs and 31 symptoms of disease. which is then inputted into the system so that the data is processed so that the system in the form of an Android application can predict the type of disease in pigs. In the testing process, it was carried out by testing 15 test data sets and producing an accuracy value that is 86.7%. In testing the application features built using the Kotlin programming language and the SQLite database, it has been running as expected.

Download Full-text

Synthetic Sonic Log Generation With Machine Learning: A Contest Summary From Five Methods

Petrophysics – The SPWLA Journal of Formation Evaluation and Reservoir Description ◽

10.30632/pjv62n4-2021a4 ◽

2021 ◽

Vol 62 (4) ◽

pp. 393-406

Author(s):

Yanxiang Yu ◽

◽

Chicheng Xu ◽

Siddharth Misra ◽

Weichang Li ◽

...

Keyword(s):

Machine Learning ◽

Test Data ◽

Short Term Memory ◽

Rock Physics ◽

Training Data ◽

Machine Learning Techniques ◽

Blind Test ◽

Data Set ◽

Benchmark Model ◽

Sonic Log

Compressional and shear sonic traveltime logs (DTC and DTS, respectively) are crucial for subsurface characterization and seismic-well tie. However, these two logs are often missing or incomplete in many oil and gas wells. Therefore, many petrophysical and geophysical workflows include sonic log synthetization or pseudo-log generation based on multivariate regression or rock physics relations. Started on March 1, 2020, and concluded on May 7, 2020, the SPWLA PDDA SIG hosted a contest aiming to predict the DTC and DTS logs from seven “easy-to-acquire” conventional logs using machine-learning methods (GitHub, 2020). In the contest, a total number of 20,525 data points with half-foot resolution from three wells was collected to train regression models using machine-learning techniques. Each data point had seven features, consisting of the conventional “easy-to-acquire” logs: caliper, neutron porosity, gamma ray (GR), deep resistivity, medium resistivity, photoelectric factor, and bulk density, respectively, as well as two sonic logs (DTC and DTS) as the target. The separate data set of 11,089 samples from a fourth well was then used as the blind test data set. The prediction performance of the model was evaluated using root mean square error (RMSE) as the metric, shown in the equation below: RMSE=sqrt(1/2*1/m* [∑_(i=1)^m▒〖(〖DTC〗_pred^i-〖DTC〗_true^i)〗^2 + 〖(〖DTS〗_pred^i-〖DTS〗_true^i)〗^2 ] In the benchmark model, (Yu et al., 2020), we used a Random Forest regressor and conducted minimal preprocessing to the training data set; an RMSE score of 17.93 was achieved on the test data set. The top five models from the contest, on average, beat the performance of our benchmark model by 27% in the RMSE score. In the paper, we will review these five solutions, including preprocess techniques and different machine-learning models, including neural network, long short-term memory (LSTM), and ensemble trees. We found that data cleaning and clustering were critical for improving the performance in all models.

Download Full-text