Cross Lingual Snippet Generation Using Snippet Translation System

Author(s):  
Pintu Lohar ◽  
Pinaki Bhaskar ◽  
Santanu Pal ◽  
Sivaji Bandyopadhyay
2020 ◽  
pp. 016555152091267 ◽  
Author(s):  
Kazuhiro Seki

This article studies cross-lingual text similarity using neural machine translation models. A straightforward approach based on machine translation is to use translated text so as to make the problem monolingual. Another possible approach is to use intermediate states of machine translation models as recently proposed in the related work, which could avoid propagation of translation errors. We aim at improving both approaches independently and then combine the two types of information, that is, translations and intermediate states, in a learning-to-rank framework to compute cross-lingual text similarity. To evaluate the effectiveness and generalisability of our approach, we conduct empirical experiments on English–Japanese and English–Hindi translation corpora for a cross-lingual sentence retrieval task. It is demonstrated that our approach using translations and intermediate states outperforms other neural network–based approaches and is even comparable with a strong baseline based on a state-of-the-art machine translation system.


2016 ◽  
Author(s):  
Raj Dabre ◽  
Yevgeniy Puzikov ◽  
Fabien Cromieres ◽  
Sadao Kurohashi

2021 ◽  
Vol 7 ◽  
pp. e681
Author(s):  
Salim Sazzed

Bengali is a low-resource language that lacks tools and resources for various natural language processing (NLP) tasks, such as sentiment analysis or profanity identification. In Bengali, only the translated versions of English sentiment lexicons are available. Moreover, no dictionary exists for detecting profanity in Bengali social media text. This study introduces a Bengali sentiment lexicon, BengSentiLex, and a Bengali swear lexicon, BengSwearLex. For creating BengSentiLex, a cross-lingual methodology is proposed that utilizes a machine translation system, a review corpus, two English sentiment lexicons, pointwise mutual information (PMI), and supervised machine learning (ML) classifiers in various stages. A semi-automatic methodology is presented to develop BengSwearLex that leverages an obscene corpus, word embedding, and part-of-speech (POS) taggers. The performance of BengSentiLex compared with the translated English lexicons in three evaluation datasets. BengSentiLex achieves 5%–50% improvement over the translated lexicons. For identifying profanity, BengSwearLex achieves documentlevel coverage of around 85% in an document-level in the evaluation dataset. The experimental results imply that BengSentiLex and BengSwearLex are effective resources for classifying sentiment and identifying profanity in Bengali social media content, respectively.


2018 ◽  
Vol 45 (4) ◽  
pp. 443-459 ◽  
Author(s):  
Nava Ehsan ◽  
Azadeh Shakery ◽  
Frank Wm Tompa

Fast and easy access to a wide range of documents in various languages, in conjunction with the wide availability of translation and editing tools, has led to the need to develop effective tools for detecting cross-lingual plagiarism. Given a suspicious document, cross-lingual plagiarism detection comprises two main subtasks: retrieving documents that are candidate sources for that document and analysing those candidates one by one to determine their similarity to the suspicious document. In this article, we examine the second subtask, also called the detailed analysis subtask, where the goal is to align plagiarised fragments from source and suspicious documents in different languages. Our proposed approach has two main steps: the first step tries to find candidate plagiarised fragments and focuses on high recall, followed by a more precise similarity analysis based on dynamic text alignment that will filter the results by finding alignments between the identified fragments. With these two steps, the proximity of the terms will be considered in different levels of granularity. In both steps, our approach uses a dictionary to obtain translations of individual terms instead of using a machine translation system to convert longer passages from one language to another. We used a weighting scheme to distinct multiple translations of the terms. Experimental results show that our method outperforms the methods used by the systems that achieved the best results in the PAN-2012 and PAN-2014 competitions.


2019 ◽  
Vol 2019 ◽  
pp. 1-7
Author(s):  
ShaoLin Zhu ◽  
Xiao Li ◽  
YaTing Yang ◽  
Lei Wang ◽  
ChengGang Mi

Machine translation needs a large number of parallel sentence pairs to make sure of having a good translation performance. However, the lack of parallel corpus heavily limits machine translation for low-resources language pairs. We propose a novel method that combines the continuous word embeddings with deep learning to obtain parallel sentences. Since parallel sentences are very invaluable for low-resources language pair, we introduce cross-lingual semantic representation to induce bilingual signals. Our experiments show that we can achieve promising results under lacking external resources for low-resource languages. Finally, we construct a state-of-the-art machine translation system in low-resources language pair.


Information ◽  
2020 ◽  
Vol 11 (10) ◽  
pp. 492
Author(s):  
Aishan Wumaier ◽  
Cuiyun Xu ◽  
Zaokere Kadeer ◽  
Wenqi Liu ◽  
Yingbo Wang ◽  
...  

The recognition and translation of organization names (ONs) is challenging due to the complex structures and high variability involved. ONs consist not only of common generic words but also names, rare words, abbreviations and business and industry jargon. ONs are a sub-class of named entity (NE) phrases, which convey key information in text. As such, the correct translation of ONs is critical for machine translation and cross-lingual information retrieval. The existing Chinese–Uyghur neural machine translation systems have performed poorly when applied to ON translation tasks. As there are no publicly available Chinese–Uyghur ON translation corpora, an ON translation corpus is developed here, which includes 191,641 ON translation pairs. A word segmentation approach involving characterization, tagged characterization, byte pair encoding (BPE) and syllabification is proposed here for ON translation tasks. A recurrent neural network (RNN) attention framework and transformer are adapted here for ON translation tasks with different sequence granularities. The experimental results indicate that the transformer model not only outperforms the RNN attention model but also benefits from the proposed word segmentation approach. In addition, a Chinese–Uyghur ON translation system is developed here to automatically generate new translation pairs. This work significantly improves Chinese–Uyghur ON translation and can be applied to improve Chinese–Uyghur machine translation and cross-lingual information retrieval. It can also easily be extended to other agglutinative languages.


2012 ◽  
Author(s):  
Xin Liu ◽  
Xiaobin Zhou ◽  
Jianjun Zhu ◽  
Jing-Jen Wang

Author(s):  
kitoshi Okumura ◽  
Ken-ichi Iso ◽  
Shin-ichi Doi ◽  
Kiyoshi Yamabana ◽  
Ken Hanazawa ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document