Cross-lingual Text Reuse Detection Using Translation Plus Monolingual Analysis for English-Urdu Language Pair

Iqra Muneer; Rao Muhammad Adeel Nawab

doi:10.1145/3473331

Cross-lingual Text Reuse Detection Using Translation Plus Monolingual Analysis for English-Urdu Language Pair

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3473331 ◽

2022 ◽

Vol 21 (2) ◽

pp. 1-18

Author(s):

Iqra Muneer ◽

Rao Muhammad Adeel Nawab

Keyword(s):

Probabilistic Approach ◽

Detailed Comparison ◽

Longest Common Subsequence ◽

Common Subsequence ◽

Classification Tasks ◽

N Gram ◽

Cross Lingual ◽

Recurrent Architecture ◽

Translation Systems ◽

Language Pair

Cross-Lingual Text Reuse Detection (CLTRD) has recently attracted the attention of the research community due to a large amount of digital text readily available for reuse in multiple languages through online digital repositories. In addition, efficient machine translation systems are freely and readily available to translate text from one language into another, which makes it quite easy to reuse text across languages, and consequently difficult to detect it. In the literature, the most prominent and widely used approach for CLTRD is Translation plus Monolingual Analysis (T+MA). To detect CLTR for English-Urdu language pair, T+MA has been used with lexical approaches, namely, N-gram Overlap, Longest Common Subsequence, and Greedy String Tiling. This clearly shows that T+MA has not been thoroughly explored for the English-Urdu language pair. To fulfill this gap, this study presents an in-depth and detailed comparison of 26 approaches that are based on T+MA. These approaches include semantic similarity approaches (semantic tagger based approaches, WordNet-based approaches), probabilistic approach (Kullback-Leibler distance approach), monolingual word embedding-based approaches siamese recurrent architecture, and monolingual sentence transformer-based approaches for English-Urdu language pair. The evaluation was carried out using the CLEU benchmark corpus, both for the binary and the ternary classification tasks. Our extensive experimentation shows that our proposed approach that is a combination of 26 approaches obtained an F 1 score of 0.77 and 0.61 for the binary and ternary classification tasks, respectively, and outperformed the previously reported approaches [ 41 ] ( F 1 = 0.73) for the binary and ( F 1 = 0.55) for the ternary classification tasks) on the CLEU corpus.

Download Full-text

Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair

Scientific Programming ◽

10.1155/2019/2962040 ◽

2019 ◽

Vol 2019 ◽

pp. 1-11

Author(s):

Israr Haneef ◽

Rao Muhammad Adeel Nawab ◽

Ehsan Ullah Munir ◽

Imran Sarwar Bajwa

Keyword(s):

English Language ◽

Longest Common Subsequence ◽

Original Text ◽

Plagiarism Detection ◽

Digital Text ◽

Detection Systems ◽

Automatic Translation ◽

Common Subsequence ◽

Cross Lingual ◽

Language Pair

Cross-lingual plagiarism occurs when the source (or original) text(s) is in one language and the plagiarized text is in another language. In recent years, cross-lingual plagiarism detection has attracted the attention of the research community because a large amount of digital text is easily accessible in many languages through online digital repositories and machine translation systems are readily available, making it easier to perform cross-lingual plagiarism and harder to detect it. To develop and evaluate cross-lingual plagiarism detection systems, standard evaluation resources are needed. The majority of earlier studies have developed cross-lingual plagiarism corpora for English and other European language pairs. However, for Urdu-English language pair, the problem of cross-lingual plagiarism detection has not been thoroughly explored although a large amount of digital text is readily available in Urdu and it is spoken in many countries of the world (particularly in Pakistan, India, and Bangladesh). To fulfill this gap, this paper presents a large benchmark cross-lingual corpus for Urdu-English language pair. The proposed corpus contains 2,395 source-suspicious document pairs (540 are automatic translation, 539 are artificially paraphrased, 508 are manually paraphrased, and 808 are nonplagiarized). Furthermore, our proposed corpus contains three types of cross-lingual examples including artificial (automatic translation and artificially paraphrased), simulated (manually paraphrased), and real (nonplagiarized), which have not been previously reported in the development of cross-lingual corpora. Detailed analysis of our proposed corpus was carried out using n-gram overlap and longest common subsequence approaches. Using Word unigrams, mean similarity scores of 1.00, 0.68, 0.52, and 0.22 were obtained for automatic translation, artificially paraphrased, manually paraphrased, and nonplagiarized documents, respectively. These results show that documents in the proposed corpus are created using different obfuscation techniques, which makes the dataset more realistic and challenging. We believe that the corpus developed in this study will help to foster research in an underresourced language of Urdu and will be useful in the development, comparison, and evaluation of cross-lingual plagiarism detection systems for Urdu-English language pair. Our proposed corpus is free and publicly available for research purposes.

Download Full-text

XLCS: A New Bit-Parallel Longest Common Subsequence Algorithm on Xeon Phi Clusters

2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS) ◽

10.1109/hpcc/smartcity/dss.2019.00204 ◽

2019 ◽

Author(s):

Zekun Yin ◽

Hao Zhang ◽

Kai Xu ◽

Yuandong Chan ◽

Shaoliang Peng ◽

...

Keyword(s):

Longest Common Subsequence ◽

Xeon Phi ◽

Common Subsequence

Download Full-text

Longest common subsequence as private search

Proceedings of the 8th ACM workshop on Privacy in the electronic society - WPES '09 ◽

10.1145/1655188.1655200 ◽

2009 ◽

Cited By ~ 5

Author(s):

Mark Gondree ◽

Payman Mohassel

Keyword(s):

Longest Common Subsequence ◽

Common Subsequence ◽

Private Search

Download Full-text

Side Channel Leakage Alignment Based on Longest Common Subsequence

2020 IEEE 14th International Conference on Big Data Science and Engineering (BigDataSE) ◽

10.1109/bigdatase50710.2020.00025 ◽

2020 ◽

Author(s):

Anni Jia ◽

Wei Yang ◽

Gongxuan Zhang

Keyword(s):

Longest Common Subsequence ◽

Side Channel ◽

Common Subsequence

Download Full-text

Longest Common Subsequence based Multistage Collaborative Filtering for Recommender Systems

2020 21st International Arab Conference on Information Technology (ACIT) ◽

10.1109/acit50332.2020.9300068 ◽

2020 ◽

Author(s):

Dilip Singh Sisodia ◽

Inakollu NehaPriyanka ◽

Prodduturi Amulya

Keyword(s):

Collaborative Filtering ◽

Recommender Systems ◽

Longest Common Subsequence ◽

Common Subsequence

Download Full-text

Research on longest common subsequence fast algorithm

2011 International Conference on Consumer Electronics, Communications and Networks (CECNet) ◽

10.1109/cecnet.2011.5768323 ◽

2011 ◽

Cited By ~ 3

Author(s):

Jiamei Liu ◽

Suping Wu

Keyword(s):

Fast Algorithm ◽

Longest Common Subsequence ◽

Common Subsequence

Download Full-text

Algorithms for computing variants of the longest common subsequence problem

Theoretical Computer Science ◽

10.1016/j.tcs.2008.01.009 ◽

2008 ◽

Vol 395 (2-3) ◽

pp. 255-267 ◽

Cited By ~ 15

Author(s):

Costas S. Iliopoulos ◽

M. Sohel Rahman

Keyword(s):

Longest Common Subsequence ◽

Longest Common Subsequence Problem ◽

Common Subsequence

Download Full-text

Log Posterior Approach in Learning Rules Generated using N-Gram based Edit distance for Keyword Search

Journal of Intelligent Systems ◽

10.1515/jisys-2016-0067 ◽

2018 ◽

Vol 27 (4) ◽

pp. 555-563

Author(s):

M. Priya ◽

R. Kalpana

Keyword(s):

Edit Distance ◽

Keyword Search ◽

Probabilistic Approach ◽

Likelihood Estimation ◽

Learning Phase ◽

Rule Generation ◽

Machine Model ◽

Approximate Search ◽

N Gram ◽

Rule Based Approach

Abstract Challenging searching mechanisms are required to cater to the needs of search engine users in probing the voluminous web database. Searching the query matching keyword based on a probabilistic approach is attractive in most of the application areas, viz. spell checking and data cleaning, because it allows approximate search. A probabilistic approach with maximum likelihood estimation is used to handle real-world problems; however, it suffers from overfitting data. In this paper, a rule-based approach is presented for keyword searching. The process consists of two phases called the rule generation phase and the learning phase. The rule generation phase uses a new technique called N-Gram based Edit distance (NGE) to generate the rule dictionary. The Turing machine model is implemented to describe the rule generation using the NGE technique. In the learning phase, a log model with maximum-a-posterior estimation is used to select the best rule. When evaluated in real time, our system produces the best result in terms of efficiency and accuracy.

Download Full-text

Efficient Longest Common Subsequence Computation Using Bulk-Synchronous Parallelism

Computational Science and Its Applications - ICCSA 2006 - Lecture Notes in Computer Science ◽

10.1007/11751649_18 ◽

2006 ◽

pp. 165-174 ◽

Cited By ~ 8

Author(s):

Peter Krusche ◽

Alexander Tiskin

Keyword(s):

Longest Common Subsequence ◽

Common Subsequence

Download Full-text

Model-based guidance by the longest common subsequence algorithm for indoor autonomous vehicle navigation using computer vision

Automation in Construction ◽

10.1016/0926-5805(93)90005-i ◽

1993 ◽

Vol 2 (2) ◽

pp. 123-137 ◽

Cited By ~ 7

Author(s):

Ling-Ling Wang ◽

Pao-Yu Ku ◽

Wen-Hsiang Tsai

Keyword(s):

Computer Vision ◽

Autonomous Vehicle ◽

Longest Common Subsequence ◽

Vehicle Navigation ◽

Model Based ◽

Common Subsequence ◽

Autonomous Vehicle Navigation

Download Full-text