Sentence Level Alignment of Digitized Books Parallel Corpora

Algirdas Laukaitis; Darius Plikynas; Egidijus Ostasius

doi:10.15388/informatica.2018.188

Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation

Computational Intelligence and Neuroscience ◽

10.1155/2021/6682385 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Michael Adjeisah ◽

Guohua Liu ◽

Douglas Omwenga Nyabuga ◽

Richard Nuetey Nortey ◽

Jinling Song

Keyword(s):

Machine Translation ◽

Language Processing ◽

Training Data ◽

Target Language ◽

Similarity Metrics ◽

Mahalanobis Distances ◽

Parallel Corpora ◽

Parallel Corpus ◽

Low Resource ◽

Sentence Level

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.

Download Full-text

A Variational Autoencoding Approach for Inducing Cross-lingual Word Embeddings

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/582 ◽

2017 ◽

Author(s):

Liangchen Wei ◽

Zhi-Hong Deng

Keyword(s):

Language Learning ◽

Latent Variable ◽

Training Data ◽

Variational Model ◽

Word Embeddings ◽

Parallel Corpora ◽

Word Level ◽

Sentence Level ◽

Cross Lingual ◽

Traditional Approaches

Cross-language learning allows one to use training data from one language to build models for another language. Many traditional approaches require word-level alignment sentences from parallel corpora, in this paper we define a general bilingual training objective function requiring sentence level parallel corpus only. We propose a variational autoencoding approach for training bilingual word embeddings. The variational model introduces a continuous latent variable to explicitly model the underlying semantics of the parallel sentence pairs and to guide the generation of the sentence pairs. Our model restricts the bilingual word embeddings to represent words in exactly the same continuous vector space. Empirical results on the task of cross lingual document classification has shown that our method is effective.

Download Full-text

From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/685 ◽

2019 ◽

Cited By ~ 1

Author(s):

Shizhe Chen ◽

Qin Jin ◽

Jianlong Fu

Keyword(s):

Machine Translation ◽

Large Scale ◽

Learning Approach ◽

Model Learning ◽

Neural Machine Translation ◽

Parallel Corpora ◽

Translation Model ◽

Word Level ◽

Sentence Level ◽

Progressive Learning

The neural machine translation model has suffered from the lack of large-scale parallel corpora. In contrast, we humans can learn multi-lingual translations even without parallel texts by referring our languages to the external world. To mimic such human learning behavior, we employ images as pivots to enable zero-resource translation learning. However, a picture tells a thousand words, which makes multi-lingual sentences pivoted by the same image noisy as mutual translations and thus hinders the translation model learning. In this work, we propose a progressive learning approach for image-pivoted zero-resource machine translation. Since words are less diverse when grounded in the image, we first learn word-level translation with image pivots, and then progress to learn the sentence-level translation by utilizing the learned word translation to suppress noises in image-pivoted multi-lingual sentences. Experimental results on two widely used image-pivot translation datasets, IAPR-TC12 and Multi30k, show that the proposed approach significantly outperforms other state-of-the-art methods.

Download Full-text

About Certain Semantic Annotation in Parallel Corpora

Cognitive Studies | Études cognitives ◽

10.11649/cs.2013.004 ◽

2015 ◽

pp. 67-78

Author(s):

Violetta Koseska-Toszewa

Keyword(s):

Semantic Annotation ◽

Contemporary Theory ◽

Natural Languages ◽

Parallel Corpora ◽

Language Form ◽

Sequence Of Events ◽

Sentence Level ◽

Computer Scientists ◽

Automatic Methods ◽

The Given

About Certain Semantic Annotation in Parallel CorporaThe semantic notation analyzed in this works is contained in the second stream of semantic theories presented here – in the direct approach semantics. We used this stream in our work on the Bulgarian-Polish Contrastive Grammar. Our semantic notation distinguishes quantificational meanings of names and predicates, and indicates aspectual and temporal meanings of verbs. It relies on logical scope-based quantification and on the contemporary theory of processes, known as “Petri nets”. Thanks to it, we can distinguish precisely between a language form and its contents, e.g. a perfective verb form has two meanings: an event or a sequence of events and states, finally ended with an event. An imperfective verb form also has two meanings: a state or a sequence of states and events, finally ended with a state. In turn, names are quantified universally or existentially when they are “undefined”, and uniquely (using the iota operator) when they are “defined”. A fact worth emphasizing is the possibility of quantifying not only names, but also the predicate, and then quantification concerns time and aspect. This is a novum in elaborating sentence-level semantics in parallel corpora. For this reason, our semantic notation is manual. We are hoping that it will raise the interest of computer scientists working on automatic methods for processing the given natural languages. Semantic annotation defined like in this work will facilitate contrastive studies of natural languages, and this in turn will verify the results of those studies, and will certainly facilitate human and machine translations.

Download Full-text

On Semantic Annotation in Clarin-PL Parallel Corpora

Cognitive Studies | Études cognitives ◽

10.11649/cs.2015.016 ◽

2015 ◽

pp. 211-236

Author(s):

Violetta Koseska-Toszewa ◽

Roman Roszko

Keyword(s):

Semantic Annotation ◽

Automatic Processing ◽

Natural Languages ◽

Parallel Corpora ◽

Sentence Level ◽

The Given ◽

The Way

On Semantic Annotation in Clarin-PL Parallel CorporaIn the article, the authors present a proposal for semantic annotation in Clarin-PL parallel corpora: Polish-Bulgarian-Russian and Polish-Lithuanian ones. Semantic annotation of quantification is a novum in developing sentence level semantics in multilingual parallel corpora. This is why our semantic annotation is manual. The authors hope it will be interesting to IT specialists working on automatic processing of the given natural languages. Semantic annotation defined the way it is defined here will make contrastive studies of natural languages more efficient, which in turn will help verify the results of those studies, and will certainly improve human and machine translations.

Download Full-text

Translator's Style Through Lexical Bundles: A Corpus-Driven Analysis of Two English Translations of Hongloumeng

Frontiers in Psychology ◽

10.3389/fpsyg.2021.633422 ◽

2021 ◽

Vol 12 ◽

Author(s):

Kanglong Liu ◽

Muhammad Afzaal

Keyword(s):

The Other ◽

Functional Classification ◽

Lexical Bundles ◽

Translation Strategies ◽

Parallel Corpora ◽

Sentence Level ◽

The Social ◽

English Translations ◽

The One

Based on a corpus-driven analysis of two translated versions of Hongloumeng (one by David Hawkes and the other by Xianyi Yang and Gladys Yang) in parallel corpora, this article investigates the use of lexical bundles in an attempt to trace the stylistic features and differences in the translations produced by the respective translators. The Hongloumeng corpus is developed at the sentence level to facilitate co-occurrence of the source texts and the two corresponding translations. For this purpose, the three-word and four-word lexical bundles were first extracted and then analyzed with respect to the functional classification proposed by Biber et al. (2004). The results of the study show that Hawkes' translation is embedded with a greater number and variety of lexical bundles than the one by the Yang couple. The study also identified the differences between the two versions which can be traced back to the deployment of different translation strategies of the translators, appearing in turn to be influenced by the language backgrounds of the translators, the translation skopos and settings, and the social, political, and ideological milieu in which the translations were produced.

Download Full-text

Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information

Computational Linguistics ◽

10.1162/089120104323093285 ◽

2004 ◽

Vol 30 (2) ◽

pp. 181-204 ◽

Cited By ~ 31

Author(s):

Sonja Nießen ◽

Hermann Ney

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Training Data ◽

Target Language ◽

Linguistic Knowledge ◽

Parallel Corpora ◽

Scarce Resources ◽

Translation Quality ◽

Sentence Level ◽

Statistical Systems

In statistical machine translation, correspondences between the words in the source and the target language are learned from parallel corpora, and often little or no linguistic knowledge is used to structure the underlying models. In particular, existing statistical systems for machine translation often treat different inflected forms of the same lemma as if they were independent of one another. The bilingual training data can be better exploited by explicitly taking into account the interdependencies of related inflected forms. We propose the construction of hierarchical lexicon models on the basis of equivalence classes of words. In addition, we introduce sentence-level restructuring transformations which aim at the assimilation of word order in related sentences. We have systematically investigated the amount of bilingual training data required to maintain an acceptable quality of machine translation. The combination of the suggested methods for improving translation quality in frameworks with scarce resources has been successfully tested: We were able to reduce the amount of bilingual training data to less than 10% of the original corpus, while losing only 1.6% in translation quality. The improvement of the translation results is demonstrated on two German-English corpora taken from the Verbmobil task and the Nespole! task.

Download Full-text

Exploring the Potential Impact of Sentence-Level Comprehension and Sentence-Level Fluency on Deaf Students' Passage Comprehension

Journal of Speech Language and Hearing Research ◽

10.1044/2020_jslhr-19-00205 ◽

2020 ◽

Vol 63 (7) ◽

pp. 2281-2292

Author(s):

Ying Zhao ◽

Xinchun Wu ◽

Hongjun Chen ◽

Peng Sun ◽

Ruibo Xie ◽

...

Keyword(s):

Independent Predictor ◽

Vocabulary Knowledge ◽

Group Differences ◽

Regression Analyses ◽

Deaf Students ◽

Mediating Role ◽

Sentence Level ◽

Mediating Mechanisms ◽

Potential Impact ◽

The Relationship

Purpose This exploratory study aimed to investigate the potential impact of sentence-level comprehension and sentence-level fluency on passage comprehension of deaf students in elementary school. Method A total of 159 deaf students, 65 students ( M age = 13.46 years) in Grades 3 and 4 and 94 students ( M age = 14.95 years) in Grades 5 and 6, were assessed for nonverbal intelligence, vocabulary knowledge, sentence-level comprehension, sentence-level fluency, and passage comprehension. Group differences were examined using t tests, whereas the predictive and mediating mechanisms were examined using regression modeling. Results The regression analyses showed that the effect of sentence-level comprehension on passage comprehension was not significant, whereas sentence-level fluency was an independent predictor in Grades 3–4. Sentence-level comprehension and fluency contributed significant variance to passage comprehension in Grades 5–6. Sentence-level fluency fully mediated the influence of sentence-level comprehension on passage comprehension in Grades 3–4, playing a partial mediating role in Grades 5–6. Conclusions The relative contributions of sentence-level comprehension and fluency to deaf students' passage comprehension varied, and sentence-level fluency mediated the relationship between sentence-level comprehension and passage comprehension.

Download Full-text

Mapping Treatment: An Approach To Treating Sentence Level Impairments In Agrammatism

Perspectives on Neurophysiology and Neurogenic Speech and Language Disorders ◽

10.1044/nnsld11.3.14 ◽

2001 ◽

Vol 11 (3) ◽

pp. 14-23 ◽

Cited By ~ 4

Author(s):

Ruth B. Fink

Keyword(s):

Sentence Level

Download Full-text

“To Corrupt a Man in the Midst of a Verse”: Ben Jonson and the Prose of the World

Ben Jonson Journal ◽

10.3366/bjj.2017.0179 ◽

2017 ◽

Vol 24 (1) ◽

pp. 46-72

Author(s):

Jacob Tootalian

Keyword(s):

Mixed Mode ◽

Ben Jonson ◽

Digital Analysis ◽

Interactive Engagement ◽

Sentence Level ◽

Linguistic Patterns ◽

The World ◽

Selection Of ◽

Bartholomew Fair

Ben Jonson's early plays show a marked interest in prose as a counterpoint to the blank verse norm of the Renaissance stage. This essay presents a digital analysis of Jonson's early mixed-mode plays and his two later full-prose comedies. It examines this selection of the Jonsonian corpus using DocuScope, a piece of software that catalogs sentence-level features of texts according to a series of rhetorical categories, highlighting the distinctive linguistic patterns associated with Jonson's verse and prose. Verse tends to employ abstract, morally and emotionally charged language, while prose is more often characterized by expressions that are socially explicit, interrogative, and interactive. In the satirical economy of these plays, Jonson's characters usually adopt verse when they articulate censorious judgements, descending into prose when they wade into the intractable banter of the vicious world. Surprisingly, the prosaic signature that Jonson fashioned in his earlier drama persisted in the two later full-prose comedies. The essay presents readings of Every Man Out of his Humour and Bartholomew Fair, illustrating how the tension between verse and prose that motivated the satirical dynamics of the mixed-mode plays was released in the full-prose comedies. Jonson's final experiments with theatrical prose dramatize the exhaustion of the satirical impulse by submerging his characters almost entirely in the prosaic world of interactive engagement.

Download Full-text