Tibetan Unknown Word Identification from News Corpora for Supporting Lexicon-based Tibetan Word Segmentation

Identification of Chinese Unknown Word Based on Finite Multi-List Method

Key Engineering Materials ◽

10.4028/www.scientific.net/kem.474-476.460 ◽

2011 ◽

Vol 474-476 ◽

pp. 460-465

Author(s):

Bo Sun ◽

Sheng Hui Huang ◽

Xiao Hua Liu

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Word Identification ◽

Difficult Problem ◽

Word Segmentation ◽

Chinese Word ◽

Unknown Word ◽

List Method ◽

Low Rate

Unknown word is a kind of word that is not included in the sub_word vocabulary, but must be cut out by the word segmentation program. Peoples’ names, place names and translated names are the major unknown words.Unknown Chinese words is a difficult problem in natural language processing, and also contributed to the low rate of correct segmention. This paper introduces the finite multi-list method that using the word fragments’ capability to composite a word and the location in the word tree to process the unknown Chinese words.The experiment recall is 70.67% ,the correct rate is 43.65% .The result of the experiment shows that unknown Chinese word identification based on the finite multi-list method is feasible.

Download Full-text

Chinese Unknown Word Identification Based on Local Bigram Model

International Journal of Computer Processing Of Languages ◽

10.1142/s0219427905001286 ◽

2005 ◽

Vol 18 (03) ◽

pp. 185-196 ◽

Cited By ~ 3

Author(s):

ZHUORAN WANG ◽

TING LIU

Keyword(s):

Word Identification ◽

Unknown Word

Download Full-text

Chinese Word Segmentation and Unknown Word Extraction by Mining Maximized Substring

Journal of Natural Language Processing ◽

10.5715/jnlp.23.235 ◽

2016 ◽

Vol 23 (3) ◽

pp. 235-266 ◽

Cited By ~ 2

Author(s):

Mo Shen ◽

Daisuke Kawahara ◽

Sadao Kurohashi

Keyword(s):

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Unknown Word

Download Full-text

Chinese unknown word identification using character-based tagging and chunking

Proceedings of the conference on SIGGRAPH 2004 course notes - GRAPH '04 ◽

10.3115/1075178.1075215 ◽

2003 ◽

Cited By ~ 5

Author(s):

Goh Chooi Ling ◽

Masayuki Asahara ◽

Yuji Matsumoto

Keyword(s):

Word Identification ◽

Unknown Word

Download Full-text

Chinese New Word Identification Using N-Gram and PPM Models

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.109.612 ◽

2011 ◽

Vol 109 ◽

pp. 612-616 ◽

Cited By ~ 1

Author(s):

Dun Li ◽

Wei Tu ◽

Lei Shi

Keyword(s):

Information Processing ◽

Word Identification ◽

Recall Rate ◽

Word Segmentation ◽

Experimental Results ◽

New Method ◽

New Words ◽

Chinese Information Processing ◽

N Gram

New word identification is one of the difficult problems of the Chinese information processing. This paper presents a new method to identify new words. First of all, the text is segmented using N-Gram; then PPM is used to identify the new words which are in the text; finally, the new identified words are added to update the dictionary using LRU. Compared with three well-known word segmentation systems, the experimental results show that this method can improve the precision and recall rate of new word identification to a certain extent.

Download Full-text

Unknown word identification for science and technology project based on rule model

Future Communication Technology ◽

10.2495/icct130681 ◽

2014 ◽

Author(s):

Yuehua Lv ◽

Jianhai Lin ◽

Renke Wu

Keyword(s):

Word Identification ◽

Science And Technology ◽

Unknown Word ◽

Technology Project ◽

Rule Model ◽

Science And Technology Project

Download Full-text

Word Segmentation, Unknown-word Resolution, and Morphological Agreement in a Hebrew Parsing System

Computational Linguistics ◽

10.1162/coli_a_00137 ◽

2013 ◽

Vol 39 (1) ◽

pp. 121-160 ◽

Cited By ~ 7

Author(s):

Yoav Goldberg ◽

Michael Elhadad

Keyword(s):

Computational Model ◽

Word Segmentation ◽

General Applicability ◽

Unknown Word ◽

Grammatical Agreement ◽

Hebrew Word ◽

Language Resource ◽

Modern Hebrew ◽

The Common ◽

Filter Mechanism

We present a constituency parsing system for Modern Hebrew. The system is based on the PCFG-LA parsing method of Petrov et al. 2006 , which is extended in various ways in order to accommodate the specificities of Hebrew as a morphologically rich language with a small treebank. We show that parsing performance can be enhanced by utilizing a language resource external to the treebank, specifically, a lexicon-based morphological analyzer. We present a computational model of interfacing the external lexicon and a treebank-based parser, also in the common case where the lexicon and the treebank follow different annotation schemes. We show that Hebrew word-segmentation and constituency-parsing can be performed jointly using CKY lattice parsing. Performing the tasks jointly is effective, and substantially outperforms a pipeline-based model. We suggest modeling grammatical agreement in a constituency-based parser as a filter mechanism that is orthogonal to the grammar, and present a concrete implementation of the method. Although the constituency parser does not make many agreement mistakes to begin with, the filter mechanism is effective in fixing the agreement mistakes that the parser does make. These contributions extend outside of the scope of Hebrew processing, and are of general applicability to the NLP community. Hebrew is a specific case of a morphologically rich language, and ideas presented in this work are useful also for processing other languages, including English. The lattice-based parsing methodology is useful in any case where the input is uncertain. Extending the lexical coverage of a treebank-derived parser using an external lexicon is relevant for any language with a small treebank.

Download Full-text