Comparable Multilingual Patents as Large-Scale Parallel Corpora

Author(s):  
Bin Lu ◽  
Ka Po Chow ◽  
Benjamin K. Tsou
Keyword(s):  
2003 ◽  
Vol 29 (3) ◽  
pp. 349-380 ◽  
Author(s):  
Philip Resnik ◽  
Noah A. Smith

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.


Author(s):  
Shizhe Chen ◽  
Qin Jin ◽  
Jianlong Fu

The neural machine translation model has suffered from the lack of large-scale parallel corpora. In contrast, we humans can learn multi-lingual translations even without parallel texts by referring our languages to the external world. To mimic such human learning behavior, we employ images as pivots to enable zero-resource translation learning. However, a picture tells a thousand words, which makes multi-lingual sentences pivoted by the same image noisy as mutual translations and thus hinders the translation model learning. In this work, we propose a progressive learning approach for image-pivoted zero-resource machine translation. Since words are less diverse when grounded in the image, we first learn word-level translation with image pivots, and then progress to learn the sentence-level translation by utilizing the learned word translation to suppress noises in image-pivoted multi-lingual sentences. Experimental results on two widely used image-pivot translation datasets, IAPR-TC12 and Multi30k, show that the proposed approach significantly outperforms other state-of-the-art methods.


Author(s):  
Ellie Pavlick ◽  
Matt Post ◽  
Ann Irvine ◽  
Dmitry Kachaev ◽  
Chris Callison-Burch

We present a large scale study of the languages spoken by bilingual workers on Mechanical Turk (MTurk). We establish a methodology for determining the language skills of anonymous crowd workers that is more robust than simple surveying. We validate workers’ self-reported language skill claims by measuring their ability to correctly translate words, and by geolocating workers to see if they reside in countries where the languages are likely to be spoken. Rather than posting a one-off survey, we posted paid tasks consisting of 1,000 assignments to translate a total of 10,000 words in each of 100 languages. Our study ran for several months, and was highly visible on the MTurk crowdsourcing platform, increasing the chances that bilingual workers would complete it. Our study was useful both to create bilingual dictionaries and to act as census of the bilingual speakers on MTurk. We use this data to recommend languages with the largest speaker populations as good candidates for other researchers who want to develop crowdsourced, multilingual technologies. To further demonstrate the value of creating data via crowdsourcing, we hire workers to create bilingual parallel corpora in six Indian languages, and use them to train statistical machine translation systems.


2010 ◽  
Vol 55 (2) ◽  
pp. 387-408 ◽  
Author(s):  
Chunshen Zhu ◽  
Po-Ching Yip

This article presents a report on a pilot project designed to construct a platform for large-scale teaching of translation or bilingual training at tertiary level. The programme, ClinkNotes, has the potential of accommodating parallel corpora of any language pairs, although the primary data used in this project are in English and Chinese. The report begins with a brief overview of the development of corpus-based approach to translation studies in relation to that of translation teaching as a profession. It then proceeds to describe the actual design (i.e., the theoretical framework, the methodology of annotation, and the simple execution of the software programme), and how it helps to cater to the pressing needs of the profession. The prospects of further development of the programme are also discussed.


Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Rui Wang

Relying on large-scale parallel corpora, neural machine translation has achieved great success in certain language pairs. However, the acquisition of high-quality parallel corpus is one of the main difficulties in machine translation research. In order to solve this problem, this paper proposes unsupervised domain adaptive neural network machine translation. This method can be trained using only two unrelated monolingual corpora and obtain a good translation result. This article first measures the matching degree of translation rules by adding relevant subject information to the translation rules and dynamically calculating the similarity between each translation rule and the document to be translated during the decoding process. Secondly, through the joint training of multiple training tasks, the source language can learn useful semantic and structural information from the monolingual corpus of a third language that is not parallel to the current two languages during the process of translation into the target language. Experimental results show that better results can be obtained than traditional statistical machine translation.


Author(s):  
Mingtong Liu ◽  
Erguang Yang ◽  
Deyi Xiong ◽  
Yujie Zhang ◽  
Chen Sheng ◽  
...  

Paraphrase generation is of great importance to many downstream tasks in natural language processing. Recent efforts have focused on generating paraphrases in specific syntactic forms, which, generally, heavily relies on manually annotated paraphrase data that is not easily available for many languages and domains. In this paper, we propose a novel end-to-end framework to leverage existing large-scale bilingual parallel corpora to generate paraphrases under the control of syntactic exemplars. In order to train one model over the two languages of parallel corpora, we embed sentences of them into the same content and style spaces with shared content and style encoders using cross-lingual word embeddings. We propose an adversarial discriminator to disentangle the content and style space, and employ a latent variable to model the syntactic style of a given exemplar in order to guide the two decoders for generation. Additionally, we introduce cycle and masking learning schemes to efficiently train the model. Experiments and analyses demonstrate that the proposed model trained only on bilingual parallel data is capable of generating diverse paraphrases with desirable syntactic styles. Fine-tuning the trained model on a small paraphrase corpus makes it substantially outperform state-of-the-art paraphrase generation models trained on a larger paraphrase dataset.


Author(s):  
Linqing Chen ◽  
Junhui Li ◽  
Zhengxian Gong ◽  
Xiangyu Duan ◽  
Boxing Chen ◽  
...  

Document context-aware machine translation remains challenging due to the lack of large-scale document parallel corpora. To make full use of source-side monolingual documents for context-aware NMT, we propose a Pre-training approach with Global Context (PGC). In particular, we first propose a novel self-supervised pre-training task, which contains two training objectives: (1) reconstructing the original sentence from a corrupted version; (2) generating a gap sentence from its left and right neighbouring sentences. Then we design a universal model for PGC which consists of a global context encoder, a sentence encoder and a decoder, with similar architecture to typical context-aware NMT models. We evaluate the effectiveness and generality of our pre-trained PGC model by adapting it to various downstream context-aware NMT models. Detailed experimentation on four different translation tasks demonstrates that our PGC approach significantly improves the translation performance of context-aware NMT. For example, based on the state-of-the-art SAN model, we achieve an averaged improvement of 1.85 BLEU scores and 1.59 Meteor scores on the four translation tasks.


1999 ◽  
Vol 173 ◽  
pp. 243-248
Author(s):  
D. Kubáček ◽  
A. Galád ◽  
A. Pravda

AbstractUnusual short-period comet 29P/Schwassmann-Wachmann 1 inspired many observers to explain its unpredictable outbursts. In this paper large scale structures and features from the inner part of the coma in time periods around outbursts are studied. CCD images were taken at Whipple Observatory, Mt. Hopkins, in 1989 and at Astronomical Observatory, Modra, from 1995 to 1998. Photographic plates of the comet were taken at Harvard College Observatory, Oak Ridge, from 1974 to 1982. The latter were digitized at first to apply the same techniques of image processing for optimizing the visibility of features in the coma during outbursts. Outbursts and coma structures show various shapes.


1994 ◽  
Vol 144 ◽  
pp. 29-33
Author(s):  
P. Ambrož

AbstractThe large-scale coronal structures observed during the sporadically visible solar eclipses were compared with the numerically extrapolated field-line structures of coronal magnetic field. A characteristic relationship between the observed structures of coronal plasma and the magnetic field line configurations was determined. The long-term evolution of large scale coronal structures inferred from photospheric magnetic observations in the course of 11- and 22-year solar cycles is described.Some known parameters, such as the source surface radius, or coronal rotation rate are discussed and actually interpreted. A relation between the large-scale photospheric magnetic field evolution and the coronal structure rearrangement is demonstrated.


Sign in / Sign up

Export Citation Format

Share Document