scholarly journals Segmenting Chinese Microtext: Joint Informal-Word Detection and Segmentation with Neural Networks

Author(s):  
Meishan Zhang ◽  
Guohong Fu ◽  
Nan Yu

State-of-the-art Chinese word segmentation systems typically exploit supervised modelstrained on a standard manually-annotated corpus,achieving performances over 95% on a similar standard testing corpus.However, the performances may drop significantly when the same models are applied onto Chinese microtext.One major challenge is the issue of informal words in the microtext.Previous studies show that informal word detection can be helpful for microtext processing.In this work, we investigate it under the neural setting, by proposing a joint segmentation model that integrates the detection of informal words simultaneously.In addition, we generate training corpus for the joint model by using existing corpus automatically.Experimental results show that the proposed model is highly effective for segmentation of Chinese microtext.

2014 ◽  
Vol 513-517 ◽  
pp. 683-686 ◽  
Author(s):  
Dai Yuan Zhang ◽  
Yan Xu

With the continuous development of information technology, word segmentation technology becomes an important link in dealing with the increasing amount of information conveniently. Different from English word segmented by spaces, Chinese writing is continuous, and there is no space between words, this brings a lot of trouble to word segmentation. In this article, through the analysis of different property of words to get a code that can be used for training and combined it with the first kind of spline weight function neural network, then by training a large number of existing rules encoding to generate a study method that can divide the statement correctly.


2015 ◽  
Vol 41 (1) ◽  
pp. 119-147 ◽  
Author(s):  
Wenbin Jiang ◽  
Yajuan Lü ◽  
Liang Huang ◽  
Qun Liu

Manually annotated corpora are indispensable resources, yet for many annotation tasks, such as the creation of treebanks, there exist multiple corpora with different and incompatible annotation guidelines. This leads to an inefficient use of human expertise, but it could be remedied by integrating knowledge across corpora with different annotation guidelines. In this article we describe the problem of annotation adaptation and the intrinsic principles of the solutions, and present a series of successively enhanced models that can automatically adapt the divergence between different annotation formats. We evaluate our algorithms on the tasks of Chinese word segmentation and dependency parsing. For word segmentation, where there are no universal segmentation guidelines because of the lack of morphology in Chinese, we perform annotation adaptation from the much larger People's Daily corpus to the smaller but more popular Penn Chinese Treebank. For dependency parsing, we perform annotation adaptation from the Penn Chinese Treebank to a semantics-oriented Dependency Treebank, which is annotated using significantly different annotation guidelines. In both experiments, automatic annotation adaptation brings significant improvement, achieving state-of-the-art performance despite the use of purely local features in training.


2014 ◽  
Vol 602-605 ◽  
pp. 3469-3473
Author(s):  
Fan Jin Mai ◽  
Shi Tong Wu ◽  
Lai Yue Wang

In order to cope with the increasing size of the training corpus and adapt to the requirements of incremental learning, this paper introduces a feature selection algorithm of maximum entropy model into the research of Chinese word segmentation technology, designs and implements a Chinese word segmentation system based on incremental learning. The experimental results show that the system gradually improves the segmentation accuracy in the incremental learning process which without wasting time to restudy.


Sign in / Sign up

Export Citation Format

Share Document