Advanced Image-Based Spam Detection and Filtering Techniques - Advances in Information Security, Privacy, and Ethics
Latest Publications


TOTAL DOCUMENTS

7
(FIVE YEARS 0)

H-INDEX

0
(FIVE YEARS 0)

Published By IGI Global

9781683180135, 9781683180142

Spam features represent the unique and special characteristics associated with spam, which are further used to differentiate them from other genuine messages. Each message m is processed by a feature extraction module to represent m in terms of n dimensional feature vector x = (x1, x2, …, xn) containing n features. This feature vector consists of many such features extracted from spam. In case of text based spam filters, a feature can be a word and a feature vector may be composed of various words extracted from spam. Each spam is associated with one feature vector. Based on the characteristics discussed in previous chapter, we will try to extract different features capturing those unique characteristics from image spam, in order to build the robust spam detection algorithms further. These features are broadly classified into high level metadata features, low level image features like color features, grayscale features, texture related features and embedded text related features.


In order to understand the never-ending fights between developers of anti-spam detection techniques and the spammers; it is important to have an insight of the history of spam mails. On May 3, 1978, Gary Thuerk, a marketing manager at Digital Equipment Corporation sent his first mass email to more than 400 customers over the Arpanet in order to promote and sell Digital's new T-Series of VAX systems (Streitfeld, 2003). In this regard, he said, “It's too much work to send everyone an e-mail. So we'll send one e-mail to everyone”. He said with pride, “I was the pioneer. I saw a new way of doing things.” As every coin has two sides, any technology too can be utilized for good and bad intention. At that time, Gary Thuerk would have never dreamt of this method of sending mails to emerge as an area of research in future. Gary Thuerk ended up getting crowned as the father of spam mails instead of the father of e-marketing. In the present scenario, the internet receives 2.5 billion pieces of spam a day by spiritual followers of Thuerk.


The evasion techniques used by image spam impose new challenges for e-mail spam filters. Effectual image spam detection requires selection of discriminative image features and suitable classification scheme. Existing research on image spam detection utilizes only visual features such as color, appearance, shape and texture, while no efforts is made to employ statistical noise features. Further, most image spam classification schemes assume existence of clear cut demarcation between extracted features from genuine image and image spam dataset. In this chapter, we attempt to solve these issues; by proposing a novel server side solution called F-ISDS (Fuzzy Inference System based Image Spam Detection Scheme). F-ISDS considers statistical noise features along with the standard image features and meta-data features. F-ISDS employs dimensionality reduction using Principal Component Analysis (PCA) to map selected set of n features into a set of m principal components. Based on the selected significant principal components, input/output membership functions and rules are designed for Fuzzy Inference System (FIS) classifier. FIS provides a computationally simple and an intuitive means of performing the image spam detection. Email server can tag email with this knowledge so that client can take decision as per the local policy. Further, a Linear Regression Analysis is used to model the relationship between selected principal components and extracted features for classification phase. Experimental results confirm the efficacy of the proposed solution.


This chapter provides the details of visual feature based image spam filters, a literature review on these spam filters and their limitations. These methods are generally computationally efficient and exhibits more accuracy in presence of various noises compared to OCR based detection schemes, as they do not include any text recognition stage (Lamia et al., 2012). Previously discussed near-duplicate spam detection methods are likely to perform well in abstracting base templates, when given enough examples of various spam templates in use (Mehta et al., 2008). However, the generalization ability of these methods will be limited. Visual feature based spam detection methods are generally built using different high level and/or low level image features (refer Chapter 3 of this book) related to color, shape, texture characteristics of spam images; hence they have more generalization capability (Lamia et al., 2012). Mostly; these techniques exploit the text intensive and noisy nature of spam images.


A picture is worth a thousand words. Spam images give us many hints; one of them is that they are duplicates. Spam images are often generated from the same templates (which are designed by spammers) as they are sent to various recipients at the same time in batches. Various spam images are generated by randomization of the contents of these templates; as a result a similarity or uniqueness is present among the spam images. This similarity property in visually similar spam images can be exploited by the spam detectors for discriminating them from ham. The spam detectors can further trained on new data, if the spam images are generated from different templates, which is not a frequent phenomenon as it is resources intensive. The detection schemes that exploit the near duplicate characteristics of image spam, uses different types of image characteristics to calculate the similarity among spam images. This chapter provides the details of near duplicate detection based image spam filters, a literature review on these spam filters and their limitations.


In 2003, the first image with the spam text inside was reported by Graham-Cumming. Later, this technique was utilized successfully by spammers, by sending image spam as MIME attachments instead of sending as simple image tags. The previous content filtering techniques based on text analysis of subject and body fields of email were ineffective to handle this new spam attack type. The first attempts made by researchers to detect such spam were based on Optical Character Recognition (OCR) methods. These methods tried to extract the spam texts/words from image spam and compare with existing spam text keyword database. This chapter provides the details of OCR methods, a literature review on spam filters based on OCR methods and their limitations.


Each and everything in this world have some sort of unique and special characteristics associated with it. These characteristics aid us to differentiate among various things. Similarly the spam mails exhibit some special characteristics by which they can easily be distinguished from genuine mails. In order to understand the characteristic of Image spam, it is important to have an overview of the characteristics of the spam emails in general. The good understanding of these characteristics will offer the sufficient knowledge for anti-spammers during the development of anti-spam frameworks for both server and client end.


Sign in / Sign up

Export Citation Format

Share Document