Sentence Level Alignment of Digitized Books Parallel Corpora

Informatica ◽  
2018 ◽  
Vol 29 (4) ◽  
pp. 693-710
Author(s):  
Algirdas Laukaitis ◽  
Darius Plikynas ◽  
Egidijus Ostasius
2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Michael Adjeisah ◽  
Guohua Liu ◽  
Douglas Omwenga Nyabuga ◽  
Richard Nuetey Nortey ◽  
Jinling Song

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.


Author(s):  
Liangchen Wei ◽  
Zhi-Hong Deng

Cross-language learning allows one to use training data from one language to build models for another language. Many traditional approaches require word-level alignment sentences from parallel corpora, in this paper we define a general bilingual training objective function requiring sentence level parallel corpus only. We propose a variational autoencoding approach for training bilingual word embeddings. The variational model introduces a continuous latent variable to explicitly model the underlying semantics of the parallel sentence pairs and to guide the generation of the sentence pairs. Our model restricts the bilingual word embeddings to represent words in exactly the same continuous vector space. Empirical results on the task of cross lingual document classification has shown that our method is effective.


Author(s):  
Shizhe Chen ◽  
Qin Jin ◽  
Jianlong Fu

The neural machine translation model has suffered from the lack of large-scale parallel corpora. In contrast, we humans can learn multi-lingual translations even without parallel texts by referring our languages to the external world. To mimic such human learning behavior, we employ images as pivots to enable zero-resource translation learning. However, a picture tells a thousand words, which makes multi-lingual sentences pivoted by the same image noisy as mutual translations and thus hinders the translation model learning. In this work, we propose a progressive learning approach for image-pivoted zero-resource machine translation. Since words are less diverse when grounded in the image, we first learn word-level translation with image pivots, and then progress to learn the sentence-level translation by utilizing the learned word translation to suppress noises in image-pivoted multi-lingual sentences. Experimental results on two widely used image-pivot translation datasets, IAPR-TC12 and Multi30k, show that the proposed approach significantly outperforms other state-of-the-art methods.


2015 ◽  
pp. 67-78
Author(s):  
Violetta Koseska-Toszewa

About Certain Semantic Annotation in Parallel CorporaThe semantic notation analyzed in this works is contained in the second stream of semantic theories presented here – in the direct approach semantics. We used this stream in our work on the Bulgarian-Polish Contrastive Grammar. Our semantic notation distinguishes quantificational meanings of names and predicates, and indicates aspectual and temporal meanings of verbs. It relies on logical scope-based quantification and on the contemporary theory of processes, known as “Petri nets”. Thanks to it, we can distinguish precisely between a language form and its contents, e.g. a perfective verb form has two meanings: an event or a sequence of events and states, finally ended with an event. An imperfective verb form also has two meanings: a state or a sequence of states and events, finally ended with a state. In turn, names are quantified universally or existentially when they are “undefined”, and uniquely (using the iota operator) when they are “defined”. A fact worth emphasizing is the possibility of quantifying not only names, but also the predicate, and then quantification concerns time and aspect.  This is a novum in elaborating sentence-level semantics in parallel corpora. For this reason, our semantic notation is manual. We are hoping that it will raise the interest of computer scientists working on automatic methods for processing the given natural languages. Semantic annotation defined like in this work will facilitate contrastive studies of natural languages, and this in turn will verify the results of those studies, and will certainly facilitate human and machine translations.


2015 ◽  
pp. 211-236
Author(s):  
Violetta Koseska-Toszewa ◽  
Roman Roszko

On Semantic Annotation in Clarin-PL Parallel CorporaIn the article, the authors present a proposal for semantic annotation in Clarin-PL parallel corpora: Polish-Bulgarian-Russian and Polish-Lithuanian ones. Semantic annotation of quantification is a novum in developing sentence level semantics in multilingual parallel corpora. This is why our semantic annotation is manual. The authors hope it will be interesting to IT specialists working on automatic processing of the given natural languages. Semantic annotation defined the way it is defined here will make contrastive studies of natural languages more efficient, which in turn will help verify the results of those studies, and will certainly improve human and machine translations.


2021 ◽  
Vol 12 ◽  
Author(s):  
Kanglong Liu ◽  
Muhammad Afzaal

Based on a corpus-driven analysis of two translated versions of Hongloumeng (one by David Hawkes and the other by Xianyi Yang and Gladys Yang) in parallel corpora, this article investigates the use of lexical bundles in an attempt to trace the stylistic features and differences in the translations produced by the respective translators. The Hongloumeng corpus is developed at the sentence level to facilitate co-occurrence of the source texts and the two corresponding translations. For this purpose, the three-word and four-word lexical bundles were first extracted and then analyzed with respect to the functional classification proposed by Biber et al. (2004). The results of the study show that Hawkes' translation is embedded with a greater number and variety of lexical bundles than the one by the Yang couple. The study also identified the differences between the two versions which can be traced back to the deployment of different translation strategies of the translators, appearing in turn to be influenced by the language backgrounds of the translators, the translation skopos and settings, and the social, political, and ideological milieu in which the translations were produced.


2004 ◽  
Vol 30 (2) ◽  
pp. 181-204 ◽  
Author(s):  
Sonja Nießen ◽  
Hermann Ney

In statistical machine translation, correspondences between the words in the source and the target language are learned from parallel corpora, and often little or no linguistic knowledge is used to structure the underlying models. In particular, existing statistical systems for machine translation often treat different inflected forms of the same lemma as if they were independent of one another. The bilingual training data can be better exploited by explicitly taking into account the interdependencies of related inflected forms. We propose the construction of hierarchical lexicon models on the basis of equivalence classes of words. In addition, we introduce sentence-level restructuring transformations which aim at the assimilation of word order in related sentences. We have systematically investigated the amount of bilingual training data required to maintain an acceptable quality of machine translation. The combination of the suggested methods for improving translation quality in frameworks with scarce resources has been successfully tested: We were able to reduce the amount of bilingual training data to less than 10% of the original corpus, while losing only 1.6% in translation quality. The improvement of the translation results is demonstrated on two German-English corpora taken from the Verbmobil task and the Nespole! task.


2020 ◽  
Vol 63 (7) ◽  
pp. 2281-2292
Author(s):  
Ying Zhao ◽  
Xinchun Wu ◽  
Hongjun Chen ◽  
Peng Sun ◽  
Ruibo Xie ◽  
...  

Purpose This exploratory study aimed to investigate the potential impact of sentence-level comprehension and sentence-level fluency on passage comprehension of deaf students in elementary school. Method A total of 159 deaf students, 65 students ( M age = 13.46 years) in Grades 3 and 4 and 94 students ( M age = 14.95 years) in Grades 5 and 6, were assessed for nonverbal intelligence, vocabulary knowledge, sentence-level comprehension, sentence-level fluency, and passage comprehension. Group differences were examined using t tests, whereas the predictive and mediating mechanisms were examined using regression modeling. Results The regression analyses showed that the effect of sentence-level comprehension on passage comprehension was not significant, whereas sentence-level fluency was an independent predictor in Grades 3–4. Sentence-level comprehension and fluency contributed significant variance to passage comprehension in Grades 5–6. Sentence-level fluency fully mediated the influence of sentence-level comprehension on passage comprehension in Grades 3–4, playing a partial mediating role in Grades 5–6. Conclusions The relative contributions of sentence-level comprehension and fluency to deaf students' passage comprehension varied, and sentence-level fluency mediated the relationship between sentence-level comprehension and passage comprehension.


2017 ◽  
Vol 24 (1) ◽  
pp. 46-72
Author(s):  
Jacob Tootalian

Ben Jonson's early plays show a marked interest in prose as a counterpoint to the blank verse norm of the Renaissance stage. This essay presents a digital analysis of Jonson's early mixed-mode plays and his two later full-prose comedies. It examines this selection of the Jonsonian corpus using DocuScope, a piece of software that catalogs sentence-level features of texts according to a series of rhetorical categories, highlighting the distinctive linguistic patterns associated with Jonson's verse and prose. Verse tends to employ abstract, morally and emotionally charged language, while prose is more often characterized by expressions that are socially explicit, interrogative, and interactive. In the satirical economy of these plays, Jonson's characters usually adopt verse when they articulate censorious judgements, descending into prose when they wade into the intractable banter of the vicious world. Surprisingly, the prosaic signature that Jonson fashioned in his earlier drama persisted in the two later full-prose comedies. The essay presents readings of Every Man Out of his Humour and Bartholomew Fair, illustrating how the tension between verse and prose that motivated the satirical dynamics of the mixed-mode plays was released in the full-prose comedies. Jonson's final experiments with theatrical prose dramatize the exhaustion of the satirical impulse by submerging his characters almost entirely in the prosaic world of interactive engagement.


Sign in / Sign up

Export Citation Format

Share Document