New Words Identification Based on Ensemble Methods
2014 ◽
Vol 602-605
◽
pp. 1626-1629
Keyword(s):
In order to identify new words in huge Chinese corpus efficiently, this paper comes up with an algorithm based on ensemble methods. At first we perform Chinese word segmenting with Trie and build segment-tree. Then we select words pattern drawing method, frequency filtering, independent word probability and naive Bayes model to be sub-models of ensemble methods and train them independently. At last we integrate results from different sub-models with a multi-layer model. In experiment, this algorithm is proved to be quite fast as well as product precise and high-coverage results.