A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation

Chinese Word Segmentation and Unknown Word Extraction by Mining Maximized Substring

Journal of Natural Language Processing ◽

10.5715/jnlp.23.235 ◽

2016 ◽

Vol 23 (3) ◽

pp. 235-266 ◽

Cited By ~ 2

Author(s):

Mo Shen ◽

Daisuke Kawahara ◽

Sadao Kurohashi

Keyword(s):

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Unknown Word

Download Full-text

Statistics-based Chinese word segmentation and new word detection

10.14711/thesis-b779391 ◽

2002 ◽

Author(s):

Wing Sze Lo

Keyword(s):

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Word Detection

Download Full-text

Punctuation as Implicit Annotations for Chinese Word Segmentation

Computational Linguistics ◽

10.1162/coli.2009.35.4.35403 ◽

2009 ◽

Vol 35 (4) ◽

pp. 505-512 ◽

Cited By ~ 30

Author(s):

Zhongguo Li ◽

Maosong Sun

Keyword(s):

Word Recognition ◽

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Unknown Word

We present a Chinese word segmentation model learned from punctuation marks which are perfect word delimiters. The learning is aided by a manually segmented corpus. Our method is considerably more effective than previous methods in unknown word recognition. This is a step toward addressing one of the toughest problems in Chinese word segmentation.

Download Full-text

Which performs better for new word detection, character based or Chinese Word Segmentation based?

2014 International Conference on Asian Language Processing (IALP) ◽

10.1109/ialp.2014.6973474 ◽

2014 ◽

Cited By ~ 3

Author(s):

Haijun Zhang ◽

Shumin Shi

Keyword(s):

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Word Detection

Download Full-text

Out-domain Chinese new word detection with statistics-based character embedding

Natural Language Engineering ◽

10.1017/s1351324918000463 ◽

2019 ◽

Vol 25 (2) ◽

pp. 239-255

Author(s):

Yuzhi Liang ◽

Min Yang ◽

Jia Zhu ◽

S. M. Yiu

Keyword(s):

Short Term Memory ◽

Conditional Random Field ◽

Word Segmentation ◽

Training Data ◽

Chinese Word ◽

Chinese Word Segmentation ◽

High Quality ◽

Pos Tagging ◽

Part Of Speech ◽

Word Detection

AbstractUnlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps in processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., “B” for word beginning, and “E” for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which high-quality in-domain training data are not available. An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method Edge Likelihood (EL) for Chinese word boundary detection. Then we propose a domain-independent Chinese new word detector (DICND); each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector.

Download Full-text

Segmenting Chinese Microtext: Joint Informal-Word Detection and Segmentation with Neural Networks

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/591 ◽

2017 ◽

Author(s):

Meishan Zhang ◽

Guohong Fu ◽

Nan Yu

Keyword(s):

Neural Networks ◽

State Of The Art ◽

Joint Model ◽

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Training Corpus ◽

Standard Testing ◽

Proposed Model ◽

Word Detection

State-of-the-art Chinese word segmentation systems typically exploit supervised modelstrained on a standard manually-annotated corpus,achieving performances over 95% on a similar standard testing corpus.However, the performances may drop significantly when the same models are applied onto Chinese microtext.One major challenge is the issue of informal words in the microtext.Previous studies show that informal word detection can be helpful for microtext processing.In this work, we investigate it under the neural setting, by proposing a joint segmentation model that integrates the detection of informal words simultaneously.In addition, we generate training corpus for the joint model by using existing corpus automatically.Experimental results show that the proposed model is highly effective for segmentation of Chinese microtext.

Download Full-text