Tibetan Unknown Word Identification from News Corpora for Supporting Lexicon-based Tibetan Word Segmentation

Author(s):  
Minghua Nuo ◽  
Huidan Liu ◽  
Congjun Long ◽  
Jian Wu
2011 ◽  
Vol 474-476 ◽  
pp. 460-465
Author(s):  
Bo Sun ◽  
Sheng Hui Huang ◽  
Xiao Hua Liu

Unknown word is a kind of word that is not included in the sub_word vocabulary, but must be cut out by the word segmentation program. Peoples’ names, place names and translated names are the major unknown words.Unknown Chinese words is a difficult problem in natural language processing, and also contributed to the low rate of correct segmention. This paper introduces the finite multi-list method that using the word fragments’ capability to composite a word and the location in the word tree to process the unknown Chinese words.The experiment recall is 70.67% ,the correct rate is 43.65% .The result of the experiment shows that unknown Chinese word identification based on the finite multi-list method is feasible.


2016 ◽  
Vol 23 (3) ◽  
pp. 235-266 ◽  
Author(s):  
Mo Shen ◽  
Daisuke Kawahara ◽  
Sadao Kurohashi

2011 ◽  
Vol 109 ◽  
pp. 612-616 ◽  
Author(s):  
Dun Li ◽  
Wei Tu ◽  
Lei Shi

New word identification is one of the difficult problems of the Chinese information processing. This paper presents a new method to identify new words. First of all, the text is segmented using N-Gram; then PPM is used to identify the new words which are in the text; finally, the new identified words are added to update the dictionary using LRU. Compared with three well-known word segmentation systems, the experimental results show that this method can improve the precision and recall rate of new word identification to a certain extent.


2013 ◽  
Vol 39 (1) ◽  
pp. 121-160 ◽  
Author(s):  
Yoav Goldberg ◽  
Michael Elhadad

We present a constituency parsing system for Modern Hebrew. The system is based on the PCFG-LA parsing method of Petrov et al. 2006 , which is extended in various ways in order to accommodate the specificities of Hebrew as a morphologically rich language with a small treebank. We show that parsing performance can be enhanced by utilizing a language resource external to the treebank, specifically, a lexicon-based morphological analyzer. We present a computational model of interfacing the external lexicon and a treebank-based parser, also in the common case where the lexicon and the treebank follow different annotation schemes. We show that Hebrew word-segmentation and constituency-parsing can be performed jointly using CKY lattice parsing. Performing the tasks jointly is effective, and substantially outperforms a pipeline-based model. We suggest modeling grammatical agreement in a constituency-based parser as a filter mechanism that is orthogonal to the grammar, and present a concrete implementation of the method. Although the constituency parser does not make many agreement mistakes to begin with, the filter mechanism is effective in fixing the agreement mistakes that the parser does make. These contributions extend outside of the scope of Hebrew processing, and are of general applicability to the NLP community. Hebrew is a specific case of a morphologically rich language, and ideas presented in this work are useful also for processing other languages, including English. The lattice-based parsing methodology is useful in any case where the input is uncertain. Extending the lexical coverage of a treebank-derived parser using an external lexicon is relevant for any language with a small treebank.


Sign in / Sign up

Export Citation Format

Share Document