Using word semantic concepts for plagiarism detection in text documents

Latent Dirichlet Allocation and POS Tags Based Method for External Plagiarism Detection

Scholarly Ethics and Publishing ◽

10.4018/978-1-5225-8057-7.ch015 ◽

2019 ◽

pp. 319-336

Author(s):

Ali Daud ◽

Jamal Ahmad Khan ◽

Jamal Abdul Nasir ◽

Rabeeh Ayaz Abbasi ◽

Naif Radi Aljohani ◽

...

Keyword(s):

Latent Dirichlet Allocation ◽

Plagiarism Detection ◽

Text Documents ◽

Parts Of Speech ◽

Stop Word ◽

Processing Step ◽

Syntactic Information ◽

N Gram ◽

Basic Hypothesis ◽

Dirichlet Allocation

In this article we present a new semantic and syntactic-based method for external plagiarism detection. In the proposed approach, latent dirichlet allocation (LDA) and parts of speech (POS) tags are used together to detect plagiarism between the sample and a number of source documents. The basic hypothesis is that considering semantic and syntactic information between two text documents may improve the performance of the plagiarism detection task. Our method is based on two steps, naming, which is a pre-processing where we detect the topics from the sentences in documents using the LDA and convert each sentence in POS tags array; then a post processing step where the suspicious cases are verified purely on the basis of semantic rules. For two types of external plagiarism (copy and random obfuscation), we empirically compare our approach to the state-of-the-art N-gram based and stop-word N-gram based methods and observe significant improvements.

Download Full-text

Text Documents Plagiarism Detection using Rabin-Karp and Jaro-Winkler Distance Algorithms

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v5.i2.pp462-471 ◽

2017 ◽

Vol 5 (2) ◽

pp. 462 ◽

Cited By ~ 3

Author(s):

Brinardi Leonardo ◽

Seng Hansun

Keyword(s):

Detection System ◽

String Matching ◽

Experimental Results ◽

Plagiarism Detection ◽

Text Documents ◽

Matching Algorithm ◽

Text Document ◽

Different Types ◽

The University

Plagiarism is an act that is considered by the university as a fraud by taking someone ideas or writings without mentioning the references and claimed as his own. Plagiarism detection system is generally implement string matching algorithm in a text document to search for common words between documents. There are some algorithms used for string matching, two of them are Rabin-Karp and Jaro-Winkler Distance algorithms. Rabin-Karp algorithm is one of compatible algorithms to solve the problem of multiple string patterns, while, Jaro-Winkler Distance algorithm has advantages in terms of time. A plagiarism detection application is developed and tested on different types of documents, i.e. doc, docx, pdf and txt. From the experimental results, we obtained that both of these algorithms can be used to perform plagiarism detection of those documents, but in terms of their effectiveness, Rabin-Karp algorithm is much more effective and faster in the process of detecting the document with the size more than 1000 KB.

Download Full-text

Tool support for plagiarism detection in text documents

Proceedings of the 2005 ACM symposium on Applied computing - SAC '05 ◽

10.1145/1066677.1066854 ◽

2005 ◽

Cited By ~ 16

Author(s):

Stefan Gruner ◽

Stuart Naven

Keyword(s):

Tool Support ◽

Plagiarism Detection ◽

Text Documents

Download Full-text

Unsupervised Multi-Document Abstractive Summarization Using Recursive Neural Network with Attention Mechanism

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.8976 ◽

2020 ◽

Vol 17 (9) ◽

pp. 3867-3872

Author(s):

Aniv Chakravarty ◽

Jagadish S. Kallimani

Keyword(s):

Neural Network ◽

Neural Networks ◽

Attention Mechanism ◽

Text Summarization ◽

Text Generation ◽

Text Documents ◽

Current State ◽

Semantic Concepts ◽

Text Information ◽

Abstractive Summarization

Text summarization is an active field of research with a goal to provide short and meaningful gists from large amount of text documents. Extractive text summarization methods have been extensively studied where text is extracted from the documents to build summaries. There are various type of multi document ranging from different formats to domains and topics. With the recent advancement in technology and use of neural networks for text generation, interest for research in abstractive text summarization has increased significantly. The use of graph based methods which handle semantic information has shown significant results. When given a set of documents of English text files, we make use of abstractive method and predicate argument structures to retrieve necessary text information and pass it through a neural network for text generation. Recurrent neural networks are a subtype of recursive neural networks which try to predict the next sequence based on the current state and considering the information from previous states. The use of neural networks allows generation of summaries for long text sentences as well. This paper implements a semantic based filtering approach using a similarity matrix while keeping all stop-words. The similarity is calculated using semantic concepts and Jiang–Conrath similarity and making use of a recurrent neural network with an attention mechanism to generate summary. ROUGE score is used for measuring accuracy, precision and recall scores.

Download Full-text

Disguised plagiarism detection in Arabic text documents

2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP) ◽

10.1109/icnlsp.2018.8374395 ◽

2018 ◽

Cited By ~ 3

Author(s):

El Moatez Billah Nagoudi ◽

Hadda Cherroun ◽

Ali Alshehri

Keyword(s):

Arabic Text ◽

Plagiarism Detection ◽

Text Documents

Download Full-text

Plagiarism Detection of Paraphrases in Text Documents with Document Retrieval

Advances in Computing and Information Technology - Communications in Computer and Information Science ◽

10.1007/978-3-642-22555-0_34 ◽

2011 ◽

pp. 330-338

Author(s):

S. Sandhya ◽

S. Chitrakala

Keyword(s):

Document Retrieval ◽

Plagiarism Detection ◽

Text Documents

Download Full-text

Multi-Document Abstractive Summarization using Recursive Neural Network

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.g5274.059720 ◽

2020 ◽

Vol 9 (7) ◽

pp. 364-370

Keyword(s):

Neural Network ◽

Neural Networks ◽

Text Summarization ◽

Similarity Matrix ◽

Text Documents ◽

Short Text ◽

Current State ◽

Recursive Neural Networks ◽

Semantic Concepts ◽

Abstractive Summarization

Text summarization is an area of research with a goal to provide short text from huge text documents. Extractive text summarization methods have been extensively studied by many researchers. There are various type of multi document ranging from different formats to domains and topic specific. With the application of neural networks for text generation, interest for research in abstractive text summarization has increased significantly. This approach has been attempted for English and Telugu languages in this article. Recurrent neural networks are a subtype of recursive neural networks which try to predict the next sequence based on the current state and considering the information from previous states. The use of neural networks allows generation of summaries for long text sentences as well. The work implements semantic based filtering using a similarity matrix while keeping all stop-words. The similarity is calculated using semantic concepts and Jiang Similarity and making use of a Recurrent Neural Network (RNN) with an attention mechanism to generate summary. ROUGE score is used for measuring the performance of the applied method on Telugu and English langauges .

Download Full-text

PERANCANGAN DAN PENERAPAN ALGORITMA RIZKI TANJUNG 24 (RTG24) UNTUK KOMPARASI KATA PADA FILE TEXT

Compiler ◽

10.28989/compiler.v3i1.68 ◽

2014 ◽

Vol 3 (1) ◽

Author(s):

Rizki Tanjung ◽

Haruno Sajati ◽

Dwi Nugraheny

Keyword(s):

String Matching ◽

Plagiarism Detection ◽

Text Documents ◽

Text Document ◽

Basic Word ◽

Root Word

Plagiarism is the act of taking essay or work of others, and recognize it as his own work. Plagiarism of the text is very common and difficult to avoid. Therefore, many created a system that can assist in plagiarism detection text document. To make the detection of plagiarism of text documents at its core is to perform string matching. This makes the emergence of the idea to build an algorithm that will be implemented in RTG24 Comparison file.txt applications. Document to be compared must be a file. Txt or plaintext, and every word contained in the document must be in the dictionary of Indonesian. RTG24 algorithm works by determining the number of same or similar words in any text between the two documents. In the process RTG24 algorithm has several stages: parsing, filtering, stemming and comparison. Parsing stage is the stage where every sentence in the document will be broken down into basic words, filtering step is cleaning the particles are not important. The next stage, stemming is the stage where every word searchable basic word or root word, this is done to simplify and facilitate comparison between the two documents. Right after through the process of parsing, filtering, and stemming, then the document should be inserted into the array for the comparison or the comparison between the two documents. So it can be determined the percentage of similarity between the two documents.

Download Full-text

Performance Analysis and Evaluation of Feature Selection Techniques to Find Coherence Using Cosine Similarity

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9027 ◽

2020 ◽

Vol 17 (9) ◽

pp. 4106-4110

Author(s):

Mausumi Goswami ◽

B. S. Purkayastha

Keyword(s):

Feature Selection ◽

Vital Role ◽

Unstructured Data ◽

Sensor Data ◽

Computational Techniques ◽

Plagiarism Detection ◽

Feature Selection Technique ◽

Text Documents ◽

Selection Technique ◽

Feature Selection Techniques

Unstructured Data is utilized in many major applications. It seems 80% of the data generated by various business applications are unstructured. Unstructured data can not be directly processed to generate information. Few major applications which uses AI are Recommendation systems, Sentiment Analysis of customer’s emotions, finding duplicate content through plagiarism detection, document organization based on the requirements etc. Different forms of origin of such data can be categorized as unstructured text on the World Wide Web, sensor data, digital images, videos, sound, result of scientific experiments and user profiles for marketing. Information retrieval from huge text datasets is quite challenging. This is caused by the various characteristics associated with natural languages and a major concern in text mining. Before we apply computational techniques on documents it is important to make the documents ready for processing. Document Preprocessing is one such method applied for text documents. Document Preprocessing plays a vital role in document grouping. In this paper, four feature selection techniques are implemented and empirical investigation results are included. The evaluation of the grouping outcomes are used to evaluate the effectiveness of each feature selection technique. The evaluation of the grouping outcomes are done to evaluate the effectiveness of each feature selection technique.

Download Full-text

Taxonomy of academic plagiarism methods

Zbornik Veleučilišta u Rijeci ◽

10.31784/zvr.9.1.17 ◽

2021 ◽

Vol 9 (1) ◽

pp. 283-300

Author(s):

Tedo Vrbanec ◽

Ana Meštrović

Keyword(s):

Academic Community ◽

Plagiarism Detection ◽

New Classification ◽

Software Developers ◽

Text Documents ◽

Comprehensive Classification ◽

Academic Plagiarism

The article gives an overview of the plagiarism domain, with focus on academic plagiarism. The article defines plagiarism, explains the origin of the term, as well as plagiarism related terms. It identifies the extent of the plagiarism domain and then focuses on the plagiarism subdomain of text documents, for which it gives an overview of current classifications and taxonomies and then proposes a more comprehensive classification according to several criteria: their origin and purpose, technical implementation, consequence, complexity of detection and according to the number of linguistic sources. The article suggests the new classification of academic plagiarism, describes sorts and methods of plagiarism, types and categories, approaches and phases of plagiarism detection, the classification of methods and algorithms for plagiarism detection. The title of the article explicitly targets the academic community, but it is sufficiently general and interdisciplinary, so it can be useful for many other professionals like software developers, linguists and librarians.

Download Full-text