ADAPTIVE SPAM FILTERING USING DYNAMIC FEATURE SPACES
Unsolicited bulk e-mail, also known as spam, has been an increasing problem for the e-mail society. This paper presents a new spam filtering strategy that 1) uses a practical entropy coding technique, Huffman coding, to dynamically encode the feature space of the e-mail collected over time and, 2) applies an online algorithm to adaptively enhance the learned spam concept as new e-mail data becomes available. The contributions of this work include a highly efficient spam filtering algorithm in which the input space is radically reduced to a single-dimension input vector, and an adaptive learning technique that is robust to vocabulary change, concept drifting and skewed class distributions. We compare our technique with several existing off-line learning techniques including support vector machine, logistic regression, naïve Bayes, k-nearest neighbor, C4.5 decision tree, RBFNetwork, boosted decision tree and stacking. We demonstrate the effectiveness of our technique by presenting the experimental results on the e-mail data that is publicly available. A more in-depth statistical analysis on the experimental results is also presented and discussed.