scholarly journals Paraphrase type identification for plagiarism detection using contexts and word embeddings

Author(s):  
Faisal Alvi ◽  
Mark Stevenson ◽  
Paul Clough

AbstractParaphrase types have been proposed by researchers as the paraphrasing mechanisms underlying acts of plagiarism. Synonymous substitution, word reordering and insertion/deletion have been identified as some of the common paraphrasing strategies used by plagiarists. However, similarity reports generated by most plagiarism detection systems provide a similarity score and produce matching sections of text with their possible sources. In this research we propose methods to identify two important paraphrase types – synonymous substitution and word reordering in paraphrased, plagiarised sentence pairs. We propose a three staged approach that uses context matching and pretrained word embeddings for identifying synonymous substitution and word reordering. Our proposed approach indicates that the use of Smith Waterman Algorithm for Plagiarism Detection and ConceptNet Numberbatch pretrained word embeddings produces the best performance in terms of $$\hbox {F}_1$$ F 1 scores. This research can be used to complement similarity reports generated by currently available plagiarism detection systems by incorporating methods to identify paraphrase types for plagiarism detection.

Author(s):  
Beata Bielska ◽  
Mateusz Rutkowski

AbstractThe article offers analyses of the phenomenon of copying (plagiarism) in higher education. The analyses were based on a quantitative survey using questionnaires, conducted in 2019 at one of the Polish universities. Plagiarism is discussed here both as an element of the learning process and a subject of public practices. The article presents students’ definitions of plagiarism, their strategies for unclear or difficult situations, their experiences with plagiarism and their opinions on how serious and widespread this phenomenon is. Focusing on the non-plagiarism norm, that is the rule that students are not allowed to plagiarize, and in order to redefine it we have determined two strategies adopted by students. The first is withdrawing in fear of making a mistake (omitting the norm), which means not using referencing in unclear situations, e.g. when the data about the source of information are absent. The second is reducing the scope of the norm applicability (limiting the norm), characterized by the fact that there are areas where the non-plagiarism norm must be observed more closely and those where it is not so important, e.g. respondents classify works as credit-level and diploma-level texts, as in the credit-level work they “can” sometimes plagiarize since the detection rate is poor and consequences are not severe. The presented results are particularly significant for interpreting plagiarism in an international context (no uniform definition of plagiarism) and for policies aimed at limiting the scale of the phenomenon (plagiarism detection systems1).


2020 ◽  
Vol 27 (12) ◽  
pp. 1894-1902 ◽  
Author(s):  
Lana Yeganova ◽  
Sun Kim ◽  
Qingyu Chen ◽  
Grigory Balasanov ◽  
W John Wilbur ◽  
...  

Abstract Objective In a biomedical literature search, the link between a query and a document is often not established, because they use different terms to refer to the same concept. Distributional word embeddings are frequently used for detecting related words by computing the cosine similarity between them. However, previous research has not established either the best embedding methods for detecting synonyms among related word pairs or how effective such methods may be. Materials and Methods In this study, we first create the BioSearchSyn set, a manually annotated set of synonyms, to assess and compare 3 widely used word-embedding methods (word2vec, fastText, and GloVe) in their ability to detect synonyms among related pairs of words. We demonstrate the shortcomings of the cosine similarity score between word embeddings for this task: the same scores have very different meanings for the different methods. To address the problem, we propose utilizing pool adjacent violators (PAV), an isotonic regression algorithm, to transform a cosine similarity into a probability of 2 words being synonyms. Results Experimental results using the BioSearchSyn set as a gold standard reveal which embedding methods have the best performance in identifying synonym pairs. The BioSearchSyn set also allows converting cosine similarity scores into probabilities, which provides a uniform interpretation of the synonymy score over different methods. Conclusions We introduced the BioSearchSyn corpus of 1000 term pairs, which allowed us to identify the best embedding method for detecting synonymy for biomedical search. Using the proposed method, we created PubTermVariants2.0: a large, automatically extracted set of synonym pairs that have augmented PubMed searches since the spring of 2019.


Author(s):  
Samuel P. M. Choi ◽  
Sze Sing Lam

Academic plagiarism is regarded as a serious offense and much effort in the past has been devoted to build stand-alone plagiarism detection systems for a specific language. This paper proposes a new information retrieval-based plagiarism detection algorithm that handles multilingual documents and enables seamless integration with learning management systems. The proposed algorithm employs information retrieval and sequence matching techniques to identify suspected plagiarized sentences and permits parametric control to reduce both false-positive and false-negative results. The full-featured implementation, called iChecker, not only could quickly identify suspected plagiarized works but also ease academics' effort to evaluate the severity of the offence by a quantified measure. Currently iChecker is adopted by over 300 courses (with some having several hundred of students) and has obtained satisfactory results. During 2012 to 2016, iChecker has processed and verified a total of 276,943 documents in English, Traditional Chinese and Simplified Chinese text.


Author(s):  
Alaa Altheneyan ◽  
Mohamed El Bachir Menai

Paraphrase identification is a natural language processing (NLP) problem that involves the determination of whether two text segments have the same meaning. Various NLP applications rely on a solution to this problem, including automatic plagiarism detection, text summarization, machine translation (MT), and question answering. The methods for identifying paraphrases found in the literature fall into two main classes: similarity-based methods and classification methods. This paper presents a critical study and an evaluation of existing methods for paraphrase identification and its application to automatic plagiarism detection. It presents the classes of paraphrase phenomena, the main methods, and the sets of features used by each particular method. All the methods and features used are discussed and enumerated in a table for easy comparison. Their performances on benchmark corpora are also discussed and compared via tables. Automatic plagiarism detection is presented as an application of paraphrase identification. The performances on benchmark corpora of existing plagiarism detection systems able to detect paraphrases are compared and discussed. The main outcome of this study is the identification of word overlap, structural representations, and MT measures as feature subsets that lead to the best performance results for support vector machines in both paraphrase identification and plagiarism detection on corpora. The performance results achieved by deep learning techniques highlight that these techniques are the most promising research direction in this field.


Sign in / Sign up

Export Citation Format

Share Document