A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation

Author(s):  
Zhou GuoDong
2016 ◽  
Vol 23 (3) ◽  
pp. 235-266 ◽  
Author(s):  
Mo Shen ◽  
Daisuke Kawahara ◽  
Sadao Kurohashi

2009 ◽  
Vol 35 (4) ◽  
pp. 505-512 ◽  
Author(s):  
Zhongguo Li ◽  
Maosong Sun

We present a Chinese word segmentation model learned from punctuation marks which are perfect word delimiters. The learning is aided by a manually segmented corpus. Our method is considerably more effective than previous methods in unknown word recognition. This is a step toward addressing one of the toughest problems in Chinese word segmentation.


2019 ◽  
Vol 25 (2) ◽  
pp. 239-255
Author(s):  
Yuzhi Liang ◽  
Min Yang ◽  
Jia Zhu ◽  
S. M. Yiu

AbstractUnlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps in processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., “B” for word beginning, and “E” for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which high-quality in-domain training data are not available. An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method Edge Likelihood (EL) for Chinese word boundary detection. Then we propose a domain-independent Chinese new word detector (DICND); each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector.


Author(s):  
Meishan Zhang ◽  
Guohong Fu ◽  
Nan Yu

State-of-the-art Chinese word segmentation systems typically exploit supervised modelstrained on a standard manually-annotated corpus,achieving performances over 95% on a similar standard testing corpus.However, the performances may drop significantly when the same models are applied onto Chinese microtext.One major challenge is the issue of informal words in the microtext.Previous studies show that informal word detection can be helpful for microtext processing.In this work, we investigate it under the neural setting, by proposing a joint segmentation model that integrates the detection of informal words simultaneously.In addition, we generate training corpus for the joint model by using existing corpus automatically.Experimental results show that the proposed model is highly effective for segmentation of Chinese microtext.


2015 ◽  
Author(s):  
Xinchi Chen ◽  
Xipeng Qiu ◽  
Chenxi Zhu ◽  
Xuanjing Huang

Sign in / Sign up

Export Citation Format

Share Document