Subwords-Only Alternatives to fastText for Morphologically Rich Languages

2021 ◽  
Vol 47 (1) ◽  
pp. 56-66
Author(s):  
Tsolak Ghukasyan ◽  
Yeva Yeshilbashyan ◽  
Karen Avetisyan
Author(s):  
Rashmini Naranpanawa ◽  
Ravinga Perera ◽  
Thilakshi Fonseka ◽  
Uthayasanker Thayasivam

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.


2018 ◽  
Vol 6 ◽  
pp. 451-465 ◽  
Author(s):  
Daniela Gerz ◽  
Ivan Vulić ◽  
Edoardo Ponti ◽  
Jason Naradowsky ◽  
Roi Reichart ◽  
...  

Neural architectures are prominent in the construction of language models (LMs). However, word-level prediction is typically agnostic of subword-level information (characters and character sequences) and operates over a closed vocabulary, consisting of a limited word set. Indeed, while subword-aware models boost performance across a variety of NLP tasks, previous work did not evaluate the ability of these models to assist next-word prediction in language modeling tasks. Such subword-level informed models should be particularly effective for morphologically-rich languages (MRLs) that exhibit high type-to-token ratios. In this work, we present a large-scale LM study on 50 typologically diverse languages covering a wide variety of morphological systems, and offer new LM benchmarks to the community, while considering subword-level information. The main technical contribution of our work is a novel method for injecting subword-level information into semantic word vectors, integrated into the neural language modeling training, to facilitate word-level prediction. We conduct experiments in the LM setting where the number of infrequent words is large, and demonstrate strong perplexity gains across our 50 languages, especially for morphologically-rich languages. Our code and data sets are publicly available.


Author(s):  
Roberts Darģis ◽  
Ilze Auzin̦a ◽  
Kristīne Levāne-Petrova ◽  
Inga Kaija

This paper presents a detailed error annotation for morphologically rich languages. The described approach is used to create Latvian Language Learner corpus (LaVA) which is part of a currently ongoing project Development of Learner corpus of Latvian: methods, tools and applications. There is no need for an advanced multi-token error annotation schema, because error annotated texts are written by beginner level (A1 and A2) who use simple syntactic structures. This schema focuses on in-depth categorization of spelling and word formation errors. The annotation schema will work best for languages with relatively free word order and rich morphology.


2017 ◽  
Vol 108 (1) ◽  
pp. 257-269 ◽  
Author(s):  
Nasser Zalmout ◽  
Nizar Habash

AbstractTokenization is very helpful for Statistical Machine Translation (SMT), especially when translating from morphologically rich languages. Typically, a single tokenization scheme is applied to the entire source-language text and regardless of the target language. In this paper, we evaluate the hypothesis that SMT performance may benefit from different tokenization schemes for different words within the same text, and also for different target languages. We apply this approach to Arabic as a source language, with five target languages of varying morphological complexity: English, French, Spanish, Russian and Chinese. Our results show that different target languages indeed require different source-language schemes; and a context-variable tokenization scheme can outperform a context-constant scheme with a statistically significant performance enhancement of about 1.4 BLEU points.


Author(s):  
Malka Rappaport Hovav

Theories of argument realization typically associate verbs with an argument structure and provide algorithms for the mapping of argument structure to morphosyntactic realization. A major challenge to such theories comes from the fact that most verbs have more than one option for argument realization. Sometimes a particular range of realization options for a verb is systematic in that it is consistently available to a relatively well-defined class of verbs; it is then considered to be one of a set of recognized argument alternations. Often—but not always—these argument alternations are associated morphological marking. An examination of cross-linguistic patterns of morphology associated with the causative alternation and the dative alternation reveals that the alternation is not directly encoded in the morphology. For both alternations, understanding the morphological patterns requires an understanding of the interaction between the semantics of the verb and the construction the verb is integrated into. Strikingly, similar interactions between the verb and the construction are found in languages that do not mark the alternations morphologically, and the patterns of morphological marking in morphologically rich languages can shed light on the appropriate analysis of the alternations in languages that do not mark the alternations morphologically.


2012 ◽  
Vol 39 (16) ◽  
pp. 12709-12718 ◽  
Author(s):  
Krunoslav Zubrinic ◽  
Damir Kalpic ◽  
Mario Milicevic

2017 ◽  
Vol 43 (2) ◽  
pp. 311-347 ◽  
Author(s):  
Miguel Ballesteros ◽  
Chris Dyer ◽  
Yoav Goldberg ◽  
Noah A. Smith

We introduce a greedy transition-based parser that learns to represent parser states using recurrent neural networks. Our primary innovation that enables us to do this efficiently is a new control structure for sequential neural networks—the stack long short-term memory unit (LSTM). Like the conventional stack data structures used in transition-based parsers, elements can be pushed to or popped from the top of the stack in constant time, but, in addition, an LSTM maintains a continuous space embedding of the stack contents. Our model captures three facets of the parser's state: (i) unbounded look-ahead into the buffer of incoming words, (ii) the complete history of transition actions taken by the parser, and (iii) the complete contents of the stack of partially built tree fragments, including their internal structures. In addition, we compare two different word representations: (i) standard word vectors based on look-up tables and (ii) character-based models of words. Although standard word embedding models work well in all languages, the character-based models improve the handling of out-of-vocabulary words, particularly in morphologically rich languages. Finally, we discuss the use of dynamic oracles in training the parser. During training, dynamic oracles alternate between sampling parser states from the training data and from the model as it is being learned, making the model more robust to the kinds of errors that will be made at test time. Training our model with dynamic oracles yields a linear-time greedy parser with very competitive performance.


Sign in / Sign up

Export Citation Format

Share Document