Improving Thai word segmentation with Named Entity Recognition

This article presents a pragmatic approach to Chinese word segmentation. It differs from most previous approaches mainly in three respects. First, while theoretical linguists have defined Chinese words using various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Second, we propose a pragmatic mathematical framework in which segmenting known words and detecting unknown words of different types (i.e., morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted separately in other systems. Finally, we do not assume the existence of a universal word segmentation standard that is application-independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different natural language processing applications might require different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word segmenter, called MSRSeg, which will be described in detail. It consists of two components: (1) a generic segmenter that is based on the framework of linear mixture models and provides a unified approach to the five fundamental features of word-level Chinese language processing: lexicon word processing, morphological analysis, factoid detection, named entity recognition, and new word identification; and (2) a set of output adaptors for adapting the output of (1) to different application-specific standards. Evaluation on five test sets with different standards shows that the adaptive system achieves state-of-the-art performance on all the test sets.

Download Full-text

Isarn Dharma Word Segmentation Using a Statistical Approach with Named Entity Recognition

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3359990 ◽

2020 ◽

Vol 19 (2) ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Sittichai Somsap ◽

Pusadee Seresangtakul

Keyword(s):

Statistical Approach ◽

Named Entity Recognition ◽

Word Segmentation ◽

Entity Recognition ◽

Named Entity

Download Full-text

Neural Chinese Named Entity Recognition via CNN-LSTM-CRF and Joint Training with Word Segmentation

The World Wide Web Conference on - WWW '19 ◽

10.1145/3308558.3313743 ◽

2019 ◽

Cited By ~ 6

Author(s):

Fangzhao Wu ◽

Junxin Liu ◽

Chuhan Wu ◽

Yongfeng Huang ◽

Xing Xie

Keyword(s):

Named Entity Recognition ◽

Word Segmentation ◽

Entity Recognition ◽

Named Entity ◽

Joint Training

Download Full-text

Simultaneous Character-Cluster-Based Word Segmentation and Named Entity Recognition in Thai Language

Knowledge, Information, and Creativity Support Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-642-24788-0_20 ◽

2011 ◽

pp. 216-225 ◽

Cited By ~ 2

Author(s):

Nattapong Tongtep ◽

Thanaruk Theeramunkong

Keyword(s):

Named Entity Recognition ◽

Word Segmentation ◽

Entity Recognition ◽

Thai Language ◽

Named Entity

Download Full-text

Correcting Word Segmentation and Part-of-Speech Tagging Errors for Chinese Named Entity Recognition

The Internet Challenge: Technology and Applications ◽

10.1007/978-94-010-0494-7_4 ◽

2002 ◽

pp. 29-36 ◽

Cited By ~ 1

Author(s):

Tianfang Yao ◽

Wei Ding ◽

Gregor Erbach

Keyword(s):

Named Entity Recognition ◽

Word Segmentation ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Download Full-text

Serialized Co-Training-Based Recognition of Medicine Names for Patent Mining and Retrieval

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2020070105 ◽

2020 ◽

Vol 16 (3) ◽

pp. 87-107

Author(s):

Na Deng ◽

Caiquan Xiong

Keyword(s):

Structural Characteristics ◽

Named Entity Recognition ◽

Word Segmentation ◽

Entity Recognition ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Preparation Methods ◽

Named Entity ◽

Patent Mining ◽

Language Characteristics

In the retrieval and mining of traditional Chinese medicine (TCM) patents, a key step is Chinese word segmentation and named entity recognition. However, the alias phenomenon of traditional Chinese medicines causes great challenges to Chinese word segmentation and named entity recognition in TCM patents, which directly affects the effect of patent mining. Because of the lack of a comprehensive Chinese herbal medicine name thesaurus, traditional thesaurus-based Chinese word segmentation and named entity recognition are not suitable for medicine identification in TCM patents. In view of the present situation, using the language characteristics and structural characteristics of TCM patent texts, a modified and serialized co-training method to recognize medicine names from TCM patent abstract texts is proposed. Experiments show that this method can maintain high accuracy under relatively low time complexity. In addition, this method can also be expanded to the recognition of other named entities in TCM patents, such as disease names, preparation methods, and so on.

Download Full-text

Developing and Deploying Algorithms for Information Extraction using Classification Measures for Named Entity Recognition

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i10.235248 ◽

2018 ◽

Vol 6 (10) ◽

pp. 235-248

Author(s):

Rehan Khan ◽

A.J. Singh

Keyword(s):

Information Extraction ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity

Download Full-text