Entry Generation by Analogy – Encoding New Words for Morphological Lexicons

Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. To add new words to a lexicon, we need to indicate their base form and inflectional paradigm. In this article, we evaluate a combination of corpus-based and lexicon-based methods for assigning the base form and inflectional paradigm to new words in Finnish, Swedish and English finite-state transducer lexicons. The methods have been implemented with the open-source Helsinki Finite-State Technology (Lindén & al., 2009). As an entry generator often produces numerous suggestions, it is important that the best suggestions be among the first few, otherwise it may become more efficient to create the entries by hand. By combining the probabilities calculated from corpus data and from lexical data, we get a more precise combined model. The combined method has 77-81 % precision and 89-97 % recall, i.e. the first correctly generated entry is on the average found as the first or second candidate for the test languages. A further study demonstrated that a native speaker could revise suggestions from the entry generator at a speed of 300-400 entries per hour.

Download Full-text

Token Validation in Automatic Corpus Gathering for Yorùbá Language

FUOYE Journal of Engineering and Technology ◽

10.46792/fuoyejet.v2i1.85 ◽

2017 ◽

Vol 2 (1) ◽

Cited By ~ 2

Author(s):

Lawrence B Adewole ◽

Adebayo O Adetunmbi ◽

Boniface K Alese ◽

Samuel A Oluwadare

Keyword(s):

Finite State Machine ◽

State Machine ◽

The Internet ◽

Web Pages ◽

Goal State ◽

New Words ◽

Base Form ◽

Processing Step ◽

Finite State

Recent methodologies in machine translation depend on the availability of large language corpora. The web being the repository for text and other multimedia content becomes a viable source for such data. However, there is need for text cleaning, as a pre-processing step, since foreign words are inevitably part of the harvested text. Dictionary lookup approach can be adopted for languages with comprehensive lexicon while manual cleaning approach is applied in other cases. Developing a full-coverage lexicon for Yoruba language is a cumbersome task due to the fact that new words can be formed as a result of elision, assimilation and contraction. In this paper, the morphology of Yorùbá language was studied and modelled as a Finite State Machine which accepts a word and returns true if the goal state is reached and false otherwise. The FSM model was implemented in Java. A Yorùbá dictionary containing 10,443 distinct words in their base form (i.e. without diacritics) and English dictionary with 64,150 distinct words were parsed through the finite state machine. In addition, 58 web pages sourced from the internet were subjected to classification by the system. Classification of entries from the Yoruba dictionary as valid Yoruba words gave 99.99% accuracy while the classification of entries from the English dictionary as Non-Yoruba words gave 94.07% accuracy. Also, using the threshold of 90% valid Yoruba words in a webpage, all 58 webpages were correctly classified. Result obtained revealed that the approach can reliably be applied in automatic harvesting of Yoruba monolingual corpus from the internet.

Download Full-text

BORDERS AND FINITE AUTOMATA

International Journal of Foundations of Computer Science ◽

10.1142/s0129054107005029 ◽

2007 ◽

Vol 18 (04) ◽

pp. 859-871

Author(s):

MARTIN ŠIMŮNEK ◽

BOŘIVOJ MELICHAR

Keyword(s):

Pattern Matching ◽

Hamming Distance ◽

Finite Automata ◽

Music Analysis ◽

Theoretical Description ◽

Specific Form ◽

Distance Measures ◽

Computer Assisted ◽

Finite State ◽

Finite State Transducer

A border of a string is a prefix of the string that is simultaneously its suffix. It is one of the basic stringology keystones used as a part of many algorithms in pattern matching, molecular biology, computer-assisted music analysis and others. The paper offers the automata-theoretical description of Iliopoulos's ALL_BORDERS algorithm. The algorithm finds all borders of a string with don't care symbols. We show that ALL_BORDERS algorithm is an implementation of a finite state transducer of specific form. We describe how such a transducer can be constructed and what should be the input string like. The described transducer finds a set of lengths of all borders. Last but not least, we define approximate borders and show how to find all approximate borders of a string when we concern Hamming distance definition. Our solution of this problem is based on transducers again. This allows us to use analogy with automata-based pattern matching methods. Finally we discuss conditions under which the same principle can be used for other distance measures.

Download Full-text

Hidden Semi-Markov Model Based Speech Recognition System using Weighted Finite-State Transducer

2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings ◽

10.1109/icassp.2006.1659950 ◽

2006 ◽

Cited By ~ 5

Author(s):

K. Oura ◽

Heiga Zen ◽

Y. Nankaku ◽

Akinobu Lee ◽

K. Tokuda

Keyword(s):

Speech Recognition ◽

Markov Model ◽

Recognition System ◽

Speech Recognition System ◽

Model Based ◽

Finite State ◽

Finite State Transducer

Download Full-text

Bootstrapping a Neural Morphological Generator from Morphological Analyzer Output for Inuktitut

10.33011/computel.v2i.455 ◽

2019 ◽

Vol 2 (1) ◽

Author(s):

Jeffrey Micher

Keyword(s):

Neural Network ◽

Training Data ◽

Data Set ◽

Set Size ◽

The Neural Network ◽

Surface Character ◽

Finite State ◽

Character Sequences ◽

Finite State Transducer

We present a method for building a morphological generator from the output of an existing analyzer for Inuktitut, in the absence of a two-way finite state transducer which would normally provide this functionality. We make use of a sequence to sequence neural network which “translates” underlying Inuktitut morpheme sequences into surface character sequences. The neural network uses only the previous and the following morphemes as context. We report a morpheme accuracy of approximately 86%. We are able to increase this accuracy slightly by passing deep morphemes directly to output for unknown morphemes. We do not see significant improvement when increasing training data set size, and postulate possible causes for this.

Download Full-text