scholarly journals An Empirical Study of Automatic Chinese Word Segmentation for Spoken Language Understanding and Named Entity Recognition

Author(s):  
Wencan Luo ◽  
Fan Yang
2005 ◽  
Vol 31 (4) ◽  
pp. 531-574 ◽  
Author(s):  
Jianfeng Gao ◽  
Mu Li ◽  
Chang-Ning Huang ◽  
Andi Wu

This article presents a pragmatic approach to Chinese word segmentation. It differs from most previous approaches mainly in three respects. First, while theoretical linguists have defined Chinese words using various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Second, we propose a pragmatic mathematical framework in which segmenting known words and detecting unknown words of different types (i.e., morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted separately in other systems. Finally, we do not assume the existence of a universal word segmentation standard that is application-independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different natural language processing applications might require different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word segmenter, called MSRSeg, which will be described in detail. It consists of two components: (1) a generic segmenter that is based on the framework of linear mixture models and provides a unified approach to the five fundamental features of word-level Chinese language processing: lexicon word processing, morphological analysis, factoid detection, named entity recognition, and new word identification; and (2) a set of output adaptors for adapting the output of (1) to different application-specific standards. Evaluation on five test sets with different standards shows that the adaptive system achieves state-of-the-art performance on all the test sets.


2020 ◽  
Vol 16 (3) ◽  
pp. 87-107
Author(s):  
Na Deng ◽  
Caiquan Xiong

In the retrieval and mining of traditional Chinese medicine (TCM) patents, a key step is Chinese word segmentation and named entity recognition. However, the alias phenomenon of traditional Chinese medicines causes great challenges to Chinese word segmentation and named entity recognition in TCM patents, which directly affects the effect of patent mining. Because of the lack of a comprehensive Chinese herbal medicine name thesaurus, traditional thesaurus-based Chinese word segmentation and named entity recognition are not suitable for medicine identification in TCM patents. In view of the present situation, using the language characteristics and structural characteristics of TCM patent texts, a modified and serialized co-training method to recognize medicine names from TCM patent abstract texts is proposed. Experiments show that this method can maintain high accuracy under relatively low time complexity. In addition, this method can also be expanded to the recognition of other named entities in TCM patents, such as disease names, preparation methods, and so on.


2018 ◽  
Vol 27 (03) ◽  
pp. 1850009
Author(s):  
Changsu Lee ◽  
Youngjoong Ko

Intelligent personal assistant software, such as Apple’s Siri and Samsung’s S-Voice, is being widely used these days. One of the core modules of this kind of software is the spoken language understanding (SLU) module used to predict the user’s intention for determining the system actions. The SLU module usually consists of several connected recognition components on a pipeline framework, whereas the proposed SLU module is developed by a novel technique that can simultaneously recognize four recognition components, namely named entity, speech-act, target, and operation using conditional random fields. In the experiments, the proposed simultaneous recognition technique achieved a relative improvement as high as approximately 2.2% and a faster speed of approximately 15% compared to a pipeline framework. A significance test showed that this improvement was statistically significant because the p-value was smaller than 0.01.


Sign in / Sign up

Export Citation Format

Share Document