scholarly journals Constructing an Academic Thai Plagiarism Corpus for Benchmarking Plagiarism Detection Systems

2018 ◽  
Vol 18 (3) ◽  
pp. 186-202
Author(s):  
Supawat Taerungruang ◽  
Wirote Aroonmanakun
Author(s):  
Beata Bielska ◽  
Mateusz Rutkowski

AbstractThe article offers analyses of the phenomenon of copying (plagiarism) in higher education. The analyses were based on a quantitative survey using questionnaires, conducted in 2019 at one of the Polish universities. Plagiarism is discussed here both as an element of the learning process and a subject of public practices. The article presents students’ definitions of plagiarism, their strategies for unclear or difficult situations, their experiences with plagiarism and their opinions on how serious and widespread this phenomenon is. Focusing on the non-plagiarism norm, that is the rule that students are not allowed to plagiarize, and in order to redefine it we have determined two strategies adopted by students. The first is withdrawing in fear of making a mistake (omitting the norm), which means not using referencing in unclear situations, e.g. when the data about the source of information are absent. The second is reducing the scope of the norm applicability (limiting the norm), characterized by the fact that there are areas where the non-plagiarism norm must be observed more closely and those where it is not so important, e.g. respondents classify works as credit-level and diploma-level texts, as in the credit-level work they “can” sometimes plagiarize since the detection rate is poor and consequences are not severe. The presented results are particularly significant for interpreting plagiarism in an international context (no uniform definition of plagiarism) and for policies aimed at limiting the scale of the phenomenon (plagiarism detection systems1).


Author(s):  
Samuel P. M. Choi ◽  
Sze Sing Lam

Academic plagiarism is regarded as a serious offense and much effort in the past has been devoted to build stand-alone plagiarism detection systems for a specific language. This paper proposes a new information retrieval-based plagiarism detection algorithm that handles multilingual documents and enables seamless integration with learning management systems. The proposed algorithm employs information retrieval and sequence matching techniques to identify suspected plagiarized sentences and permits parametric control to reduce both false-positive and false-negative results. The full-featured implementation, called iChecker, not only could quickly identify suspected plagiarized works but also ease academics' effort to evaluate the severity of the offence by a quantified measure. Currently iChecker is adopted by over 300 courses (with some having several hundred of students) and has obtained satisfactory results. During 2012 to 2016, iChecker has processed and verified a total of 276,943 documents in English, Traditional Chinese and Simplified Chinese text.


Author(s):  
Alaa Altheneyan ◽  
Mohamed El Bachir Menai

Paraphrase identification is a natural language processing (NLP) problem that involves the determination of whether two text segments have the same meaning. Various NLP applications rely on a solution to this problem, including automatic plagiarism detection, text summarization, machine translation (MT), and question answering. The methods for identifying paraphrases found in the literature fall into two main classes: similarity-based methods and classification methods. This paper presents a critical study and an evaluation of existing methods for paraphrase identification and its application to automatic plagiarism detection. It presents the classes of paraphrase phenomena, the main methods, and the sets of features used by each particular method. All the methods and features used are discussed and enumerated in a table for easy comparison. Their performances on benchmark corpora are also discussed and compared via tables. Automatic plagiarism detection is presented as an application of paraphrase identification. The performances on benchmark corpora of existing plagiarism detection systems able to detect paraphrases are compared and discussed. The main outcome of this study is the identification of word overlap, structural representations, and MT measures as feature subsets that lead to the best performance results for support vector machines in both paraphrase identification and plagiarism detection on corpora. The performance results achieved by deep learning techniques highlight that these techniques are the most promising research direction in this field.


Author(s):  
Hadj Ahmed Bouarara ◽  
Reda Mohamed Hamou ◽  
Amine Rahmani ◽  
Abdelmalek Amine

Day after day, the plagiarism cases increase and become a crucial problem in the modern world, caused by the quantity of textual information available in the web and the development of communication means such as email service. This paper deals on the unveiling of two plagiarism detection systems: Firstly boosting system based on machine learning algorithm (decision tree C4.5 and K nearest neighbour) composed on three steps (text pre-processing, first detection, and second detection). Secondly using genetic algorithm based on an initial population generated from the dataset used a fitness function fixed and the reproduction rules (selection, crossover, and mutation). For their experimentation, the authors have used the benchmark pan 09 and a set of validation measures (precision, recall, f-measure, FNR, FPR, and entropy) with a variation in configuration of each system; They have compared their results with the performance of other approaches found in literature; Finally, the visualisation service was developed that provides a graphical vision of the results using two methods (3D cub and a cobweb) with the possibility to have a detailed and global view using the functionality of zooming and rotation. The authors' aims are to improve the quality of plagiarism detection systems and preservation of copyright.


2013 ◽  
Vol 39 (4) ◽  
pp. 917-947 ◽  
Author(s):  
Alberto Barrón-Cedeño ◽  
Marta Vila ◽  
M. Martí ◽  
Paolo Rosso

Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attention has been paid to its analysis in the framework of automatic plagiarism detection. Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase plagiarism. In this article, we analyze the relationship between paraphrasing and plagiarism, paying special attention to which paraphrase phenomena underlie acts of plagiarism and which of them are detected by plagiarism detection systems. With this aim in mind, we created the P4P corpus, a new resource that uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection. The results of the Second International Competition on Plagiarism Detection were analyzed in the light of this annotation. The presented experiments show that (i) more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, (ii) lexical substitutions are the paraphrase mechanisms used the most when plagiarizing, and (iii) paraphrase mechanisms tend to shorten the plagiarized text. For the first time, the paraphrase mechanisms behind plagiarism have been analyzed, providing critical insights for the improvement of automatic plagiarism detection systems.


Sign in / Sign up

Export Citation Format

Share Document