scholarly journals A Survey of Cross-Lingual Plagiarism Detection using Natural Language Processing

Author(s):  
Dr. K. S. Aravind
2020 ◽  
pp. 016555152096278
Author(s):  
Rouzbeh Ghasemi ◽  
Seyed Arad Ashrafi Asli ◽  
Saeedeh Momtazi

With the advent of deep neural models in natural language processing tasks, having a large amount of training data plays an essential role in achieving accurate models. Creating valid training data, however, is a challenging issue in many low-resource languages. This problem results in a significant difference between the accuracy of available natural language processing tools for low-resource languages compared with rich languages. To address this problem in the sentiment analysis task in the Persian language, we propose a cross-lingual deep learning framework to benefit from available training data of English. We deployed cross-lingual embedding to model sentiment analysis as a transfer learning model which transfers a model from a rich-resource language to low-resource ones. Our model is flexible to use any cross-lingual word embedding model and any deep architecture for text classification. Our experiments on English Amazon dataset and Persian Digikala dataset using two different embedding models and four different classification networks show the superiority of the proposed model compared with the state-of-the-art monolingual techniques. Based on our experiment, the performance of Persian sentiment analysis improves 22% in static embedding and 9% in dynamic embedding. Our proposed model is general and language-independent; that is, it can be used for any low-resource language, once a cross-lingual embedding is available for the source–target language pair. Moreover, by benefitting from word-aligned cross-lingual embedding, the only required data for a reliable cross-lingual embedding is a bilingual dictionary that is available between almost all languages and the English language, as a potential source language.


2016 ◽  
Vol 55 ◽  
pp. 1-15
Author(s):  
Marta R. Costa-jussà ◽  
Srinivas Bangalore ◽  
Patrik Lambert ◽  
Lluís Màrquez ◽  
Elena Montiel-Ponsoda

With the increasingly global nature of our everyday interactions, the need for multilin- gual technologies to support efficient and effective information access and communication cannot be overemphasized. Computational modeling of language has been the focus of Natural Language Processing, a subdiscipline of Artificial Intelligence. One of the current challenges for this discipline is to design methodologies and algorithms that are cross- language in order to create multilingual technologies rapidly. The goal of this JAIR special issue on Cross-Language Algorithms and Applications (CLAA) is to present leading re- search in this area, with emphasis on developing unifying themes that could lead to the development of the science of multi- and cross-lingualism. In this introduction, we provide the reader with the motivation for this special issue and summarize the contributions of the papers that have been included. The selected papers cover a broad range of cross-lingual technologies including machine translation, domain and language adaptation for sentiment analysis, cross-language lexical resources, dependency parsing, information retrieval and knowledge representation. We anticipate that this special issue will serve as an invaluable resource for researchers interested in topics of cross-lingual natural language processing.


2018 ◽  
Vol 18 (1) ◽  
pp. 18-24
Author(s):  
Sri Reski Anita Muhsini

Implementasi pengukuran kesamaan semantik memiliki peran yang sangat penting dalam beberapa bidang Natural Language Processing (NLP), dimana hasilnya seringkali dijadikan dasar dalam melakukan task NLP yang lebih lanjut. Salah satu penerapannya yaitu dengan melakukan pengukuran kesamaan semantik multibahasa antar kata. Pengukuran ini dilatarbelakangi oleh suatu masalah dimana saat ini banyak sistem pencarian informasi yang harus berurusan dengan teks atau dokumen multibahasa. Sepasang kata dinyatakan memiliki kesamaan semantik jika pasangan kata tersebut memiliki kesamaan dari sisi makna atau konsep. Pada penelitian ini, diimplementasikan perhitungan kesamaan semantik antar kata pada bahasa yang berbeda yaitu bahasa Inggris dan bahasa Spanyol. Korpus yang digunakan pada penelitian ini yakni Europarl Parallel Corpus pada bahasa Inggris dan bahasa Spanyol. Konteks kata bersumber dari Swadesh list, serta hasil dari kesamaan semantiknya dibandingkan dengan datasetGold Standard SemEval 2017 Crosslingual Semantic Similarity untuk diukur nilai korelasinya. Hasil pengujian yang didapat terlihat bahwa pengukuran metode PMI mampu menghasilkan korelasi sebesar 0,5781 untuk korelasi Pearson dan 0.5762 untuk korelasi Spearman. Dari hasil penelitian dapat disimpulkan bahwa Implementasi pengukuran Crosslingual Semantic Similarity menggunakan metode Pointwise Mutual Information (PMI) mampu menghasilkan korelasi terbaik. Peneliti merekomendasikan pada penelitian selanjutnya dapat dilakukan dengan menggunakan dataset lain untuk membuktikan seberapa efektif metode pengukuran Poitnwise Mutual Information (PMI) dalam mengukur Crosslingual Semantic Similarity antar kata.


Author(s):  
Nikhil Paymode ◽  
Rahul Yadav ◽  
Sudarshan Vichare ◽  
Suvarna Bhoir

Plagiarism is a big intricacy for companies, Schools, Colleges, and those who published their document on the web. In-Schools and Colleges maximum students write their assignments and experiments by copying other documents. Using this system teachers and examiners can detect the documents and sheets either it is written by a respective student or it is copied from someone else. For checking plagiarism the system takes two or more documents as a input and after using string matching algorithms, NLP ( natural language processing) technique, as well as an NLTK toolkit (natural language toolkit), produces output. In the output, the system returns some score which is an interval of 0 to 1. Where 1 and 0 refer to exactly similar and nothing is similar (Unique) respectively. If a score between 0 to 1 then it shows only some part of the document is similar. The main objective of the system is to find the more accurate plagiarism content in the documents with similar meanings and concepts that are correctly identified in an efficient manner. It is very easy to copy the data from different sources which includes the internet, papers, books over the internet, newspapers, etc. there is a need of detecting plagiarism to increase and improve the learning of students. To solve this problem, a student program plagiarism detection approach is proposed based on Natural Language Processing.


Author(s):  
Dr. Kamlesh Sharma ◽  
◽  
Nidhi Garg ◽  
Arun Pandey ◽  
Daksh Yadav ◽  
...  

Plagiarism is an act of using another person’s words, idea or information without giving credit to that person and presenting them as your own. With the development of the technologies in recent years, the act of Plagiarism increases significantly. But luckily the plagiarism detection techniques are available and they are improving day by day to detect the attempts of plagiarizing the content in education. The software like Turnitin, iThenticate or Safe Assign is available in the markets that are doing a great job in this context. But the problem is not fully solved yet. These software(s) still doesn’t detect the rephrasing of statements of another writer in other words. This paper primarily focuses to detect the plagiarism in the suspicious document based on the meaning and linguistic variation of the content. The techniques used for this context is based on Natural language processing. In this Paper, we present how the semantic analysis and syntactic driven Parsing can be used to detect the plagiarism.


Author(s):  
Toluwase Victor Asubiaro ◽  
Ebelechukwu Gloria Igwe

African languages, including those that are natives to Nigeria, are low-resource languages because they lack basic computing resources such as language-dependent hardware keyboard. Speakers of these low-resource languages are therefore unfairly deprived of information access on the internet. There is no information about the level of progress that has been made on the computation of Nigerian languages. Hence, this chapter presents a state-of-the-art review of Nigerian languages natural language processing. The review reveals that only four Nigerian languages; Hausa, Ibibio, Igbo, and Yoruba have been significantly studied in published NLP papers. Creating alternatives to hardware keyboard is one of the most popular research areas, and means such as automatic diacritics restoration, virtual keyboard, and optical character recognition have been explored. There was also an inclination towards speech and computational morphological analysis. Resource development and knowledge representation modeling of the languages using rapid resource development and cross-lingual methods are recommended.


2021 ◽  
Vol 11 (5) ◽  
pp. 1974 ◽  
Author(s):  
Chanhee Lee ◽  
Kisu Yang ◽  
Taesun Whang ◽  
Chanjun Park ◽  
Andrew Matteson ◽  
...  

Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.


Sign in / Sign up

Export Citation Format

Share Document