Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling

Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop ◽

10.1007/978-1-4614-1335-6_9 ◽

2011 ◽

pp. 63-73 ◽

Author(s):

Euisok Chung ◽

Hyung-Bae Jeon ◽

Jeon-Gue Park ◽

Yun-Keun Lee

Keyword(s):

Language Modeling ◽

Word Segmentation

Download Full-text

Ergodic multigram HMM integrating word segmentation and class tagging for Chinese language modeling

1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings ◽

10.1109/icassp.1996.540324 ◽

2002 ◽

Author(s):

Hubert Hin-Cheung Law ◽

Chorkin Chan

Keyword(s):

Chinese Language ◽

Language Modeling ◽

Word Segmentation

Download Full-text

A Study of Multilingual Toxic Text Detection Approaches under Imbalanced Sample Distribution

Information ◽

10.3390/info12050205 ◽

2021 ◽

Vol 12 (5) ◽

pp. 205

Author(s):

Guizhe Song ◽

Degen Huang ◽

Zhifeng Xiao

Keyword(s):

State Of The Art ◽

Contextual Information ◽

Language Modeling ◽

Word Segmentation ◽

Loss Functions ◽

Fusion Model ◽

Sample Distribution ◽

Fusion Strategy ◽

Training Models ◽

Processing Steps

Multilingual characteristics, lack of annotated data, and imbalanced sample distribution are the three main challenges for toxic comment analysis in a multilingual setting. This paper proposes a multilingual toxic text classifier which adopts a novel fusion strategy that combines different loss functions and multiple pre-training models. Specifically, the proposed learning pipeline starts with a series of pre-processing steps, including translation, word segmentation, purification, text digitization, and vectorization, to convert word tokens to a vectorized form suitable for the downstream tasks. Two models, multilingual bidirectional encoder representation from transformers (MBERT) and XLM-RoBERTa (XLM-R), are employed for pre-training through Masking Language Modeling (MLM) and Translation Language Modeling (TLM), which incorporate semantic and contextual information into the models. We train six base models and fuse them to obtain three fusion models using the F1 scores as the weights. The models are evaluated on the Jigsaw Multilingual Toxic Comment dataset. Experimental results show that the best fusion model outperforms the two state-of-the-art models, MBERT and XLM-R, in F1 score by 5.05% and 0.76%, respectively, verifying the effectiveness and robustness of the proposed fusion strategy.

Download Full-text

Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling

Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - ACL-IJCNLP '09 ◽

10.3115/1687878.1687894 ◽

2009 ◽

Author(s):

Daichi Mochihashi ◽

Takeshi Yamada ◽

Naonori Ueda

Keyword(s):

Language Modeling ◽

Word Segmentation

Download Full-text

Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling

10.18653/v1/d18-1531 ◽

2018 ◽

Author(s):

Zhiqing Sun ◽

Zhi-Hong Deng

Keyword(s):

Language Modeling ◽

Word Segmentation

Download Full-text

Combining Language Modeling and Discriminative Classification for Word Segmentation

Computational Linguistics and Intelligent Text Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-642-00382-0_14 ◽

2009 ◽

pp. 170-182

Author(s):

Dekang Lin

Keyword(s):

Language Modeling ◽

Word Segmentation

Download Full-text

Joint n-gram Chinese language modeling with an application to Chinese word segmentation

2012 International Conference on Audio, Language and Image Processing ◽

10.1109/icalip.2012.6376633 ◽

2012 ◽

Author(s):

Xin He ◽

Zhijian Ou ◽

Jiasong Sun

Keyword(s):

Chinese Language ◽

Language Modeling ◽

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Download Full-text

Word Segmentation in the "Real World" of Conversational Speech

PsycEXTRA Dataset ◽

10.1037/e527342012-594 ◽

2007 ◽

Author(s):

Joseph D. W. S tephens ◽

Mark A. Pitt

Keyword(s):

Word Segmentation ◽

Conversational Speech ◽

Download Full-text

Does Prior Orthographic Experience Influence L1 and L2 Word Segmentation?

PsycEXTRA Dataset ◽

10.1037/e537052012-687 ◽

2004 ◽

Author(s):

Jyotsna Vaid ◽

Hsin-Chin Chen ◽

Francisco E. Martinez ◽

Chaitra Rao

Keyword(s):

Word Segmentation ◽

Download Full-text

Improving word segmentation by simultaneously learning phonotactics

Proceedings of the Twelfth Conference on Computational Natural Language Learning - CoNLL '08 ◽

10.3115/1596324.1596336 ◽

2008 ◽

Author(s):

Daniel Blanchard ◽

Jeffrey Heinz

Keyword(s):

Word Segmentation

Download Full-text

Gated Recursive Neural Network for Chinese Word Segmentation

10.3115/v1/p15-1168 ◽

2015 ◽

Author(s):

Xinchi Chen ◽

Xipeng Qiu ◽

Chenxi Zhu ◽

Xuanjing Huang

Keyword(s):

Neural Network ◽

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Recursive Neural Network

Download Full-text