Supervised Ensemble Learning for Vietnamese Tokenization
Vietnamese tokenization is a challenging basic issue, and the corresponding algorithms can be used in many applications of natural language processing. In this paper, we investigate the Vietnamese tokenization problem and propose a supervised ensemble learning (SEL) framework as well as a SEL-based tokenization (SELT) algorithm. Supported by the data structure of syllable-syllable frequency index, the SELT algorithm combines multiple weak tokenizers to form a strong tokenizer. Within the SEL framework, we also investigate the efficient construction problem of a weak tokenizer. We suggest two prediction methods to select a suitable dictionary, and efficiently implement two weak tokenizers by the simple dictionary-based tokenization algorithm. The experimental results show that the SELT algorithm integrating our weak tokenizers can achieve state-of-the-art performance in the Vietnamese tokenization task.