The Impact of the Mode of Data Representation for the Result Quality of the Detection and Filtering of Spam
Spam is now of phenomenal proportions since it represents a high percentage of total emails exchanged on the Internet. In the fight against spam, we are using this article to develop a hybrid algorithm based primarily on the probabilistic model in this case, Naïve Bayes, for weighting the terms of the matrix term -category and second place used an algorithm of unsupervised learning (K-means) to filter two classes, namely spam and ham (legitimate email). To determine the sensitive parameters that make up the classifications we are interested in studying the content of the messages by using a representation of messages using the n-gram words and characters independent of languages (because a message may be received in any language) to later decide what representation to use to get a good classification. We have chosen several metrics as evaluation to validate our results.