scholarly journals Entry Generation by Analogy – Encoding New Words for Morphological Lexicons

Author(s):  
Krister Lindén

Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. To add new words to a lexicon, we need to indicate their base form and inflectional paradigm. In this article, we evaluate a combination of corpus-based and lexicon-based methods for assigning the base form and inflectional paradigm to new words in Finnish, Swedish and English finite-state transducer lexicons. The methods have been implemented with the open-source Helsinki Finite-State Technology (Lindén & al., 2009). As an entry generator often produces numerous suggestions, it is important that the best suggestions be among the first few, otherwise it may become more efficient to create the entries by hand. By combining the probabilities calculated from corpus data and from lexical data, we get a more precise combined model. The combined method has 77-81 % precision and 89-97 % recall, i.e. the first correctly generated entry is on the average found as the first or second candidate for the test languages. A further study demonstrated that a native speaker could revise suggestions from the entry generator at a speed of 300-400 entries per hour.

Author(s):  
Lawrence B Adewole ◽  
Adebayo O Adetunmbi ◽  
Boniface K Alese ◽  
Samuel A Oluwadare

Recent methodologies in machine translation depend on the availability of large language corpora. The web being the repository for text and other multimedia content becomes a viable source for such data. However, there is need for text cleaning, as a pre-processing step, since foreign words are inevitably part of the harvested text. Dictionary lookup approach can be adopted for languages with comprehensive lexicon while manual cleaning approach is applied in other cases. Developing a full-coverage lexicon for Yoruba language is a cumbersome task due to the fact that new words can be formed as a result of elision, assimilation and contraction. In this paper, the morphology of Yorùbá language was studied and modelled as a Finite State Machine which accepts a word and returns true if the goal state is reached and false otherwise. The FSM model was implemented in Java. A Yorùbá dictionary containing 10,443 distinct words in their base form (i.e. without diacritics) and English dictionary with 64,150 distinct words were parsed through the finite state machine.   In addition, 58 web pages sourced from the internet were subjected to classification by the system. Classification of entries from the Yoruba dictionary as valid Yoruba words gave 99.99% accuracy while the classification of entries from the English dictionary as Non-Yoruba words gave 94.07% accuracy. Also, using the threshold of 90% valid Yoruba words in a webpage, all 58 webpages were correctly classified. Result obtained revealed that the approach can reliably be applied in automatic harvesting of Yoruba monolingual corpus from the internet.


2007 ◽  
Vol 18 (04) ◽  
pp. 859-871
Author(s):  
MARTIN ŠIMŮNEK ◽  
BOŘIVOJ MELICHAR

A border of a string is a prefix of the string that is simultaneously its suffix. It is one of the basic stringology keystones used as a part of many algorithms in pattern matching, molecular biology, computer-assisted music analysis and others. The paper offers the automata-theoretical description of Iliopoulos's ALL_BORDERS algorithm. The algorithm finds all borders of a string with don't care symbols. We show that ALL_BORDERS algorithm is an implementation of a finite state transducer of specific form. We describe how such a transducer can be constructed and what should be the input string like. The described transducer finds a set of lengths of all borders. Last but not least, we define approximate borders and show how to find all approximate borders of a string when we concern Hamming distance definition. Our solution of this problem is based on transducers again. This allows us to use analogy with automata-based pattern matching methods. Finally we discuss conditions under which the same principle can be used for other distance measures.


2019 ◽  
Vol 2 (1) ◽  
Author(s):  
Jeffrey Micher

We present a method for building a morphological generator from the output of an existing analyzer for Inuktitut, in the absence of a two-way finite state transducer which would normally provide this functionality. We make use of a sequence to sequence neural network which “translates” underlying Inuktitut morpheme sequences into surface character sequences. The neural network uses only the previous and the following morphemes as context. We report a morpheme accuracy of approximately 86%. We are able to increase this accuracy slightly by passing deep morphemes directly to output for unknown morphemes. We do not see significant improvement when increasing training data set size, and postulate possible causes for this.


Author(s):  
Cyril Allauzen ◽  
Michael Riley ◽  
Johan Schalkwyk ◽  
Wojciech Skut ◽  
Mehryar Mohri

2010 ◽  
Author(s):  
Lluís-F. Hurtado ◽  
Joaquin Planells ◽  
Encarna Segarra ◽  
Emilio Sanchis ◽  
David Griol

Sign in / Sign up

Export Citation Format

Share Document