WSF2: A Novel Framework for Filtering Web Spam

Over the last years, research on web spam filtering has gained interest from both academia and industry. In this context, although there are a good number of successful antispam techniques available (i.e., content-based, link-based, and hiding), an adequate combination of different algorithms supported by an advanced web spam filtering platform would offer more promising results. To this end, we propose the WSF2 framework, a new platform particularly suitable for filtering spam content on web pages. Currently, our framework allows the easy combination of different filtering techniques including, but not limited to, regular expressions and well-known classifiers (i.e., Naïve Bayes, Support Vector Machines, and C5.0). Applying our WSF2 framework over the publicly available WEBSPAM-UK2007 corpus, we have been able to demonstrate that a simple combination of different techniques is able to improve the accuracy of single classifiers on web spam detection. As a result, we conclude that the proposed filtering platform is a powerful tool for boosting applied research in this area.

Download Full-text

Retrieving Informative Content from Web Pages with Conditional Learning of Support Vector Machines and Semantic Analysis

Artificial Intelligence and Soft Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-642-29350-4_15 ◽

2012 ◽

pp. 128-135

Author(s):

Piotr Ładyżyński ◽

Przemysław Grzegorzewski

Keyword(s):

Support Vector Machines ◽

Semantic Analysis ◽

Support Vector ◽

Web Pages ◽

Vector Machines ◽

Conditional Learning ◽

Informative Content

Download Full-text

A Perspective of Evolution After Five Years: A Large-Scale Study of Web Spam Evolution

International Journal of Cooperative Information Systems ◽

10.1142/s0218843014410019 ◽

2014 ◽

Vol 23 (02) ◽

pp. 1441001 ◽

Cited By ~ 3

Author(s):

De Wang ◽

Danesh Irani ◽

Calton Pu

Keyword(s):

Social Networks ◽

Social Media ◽

Large Scale ◽

Web Pages ◽

Spam Filtering ◽

Web Page ◽

New Techniques ◽

Web Spam ◽

Classification Evaluation ◽

Over Time

Identifying and detecting web spam is an ongoing battle between spam-researchers and spammers which has been going on since search engines allowed searching of web pages to the modern sharing of web links via social networks. A common challenge faced by spam-researchers is the fact that new techniques depend on requiring a corpus of legitimate and spam web pages. Although large corpora of legitimate web pages are available to researchers, the same cannot be said about web spam or spam web pages. In this paper, we introduce the Webb Spam Corpus 2011 — a corpus of approximately 330,000 spam web pages — which we make available to researchers in the fight against spam. By having a standard corpus available, researchers can collaborate better on developing and reporting results of spam filtering techniques. The corpus contains web pages crawled from links found in over 6.3 million spam emails. We analyze multiple aspects of this corpus including redirection, HTTP headers, web page content, and classification evaluation. We also provide insights into changes in web spam since the last Webb Spam Corpus was released in 2006. These insights include: (1) spammers manipulate social media in spreading spam; (2) HTTP headers and content also change over time; (3) spammers have evolved and adopted new techniques to avoid the detection based on HTTP header information.

Download Full-text

Benchmarking the Performance of Support Vector Machines in Classifying Web Pages

Communications in Computer and Information Science - Knowledge Technology ◽

10.1007/978-3-642-32826-8_42 ◽

2012 ◽

pp. 375-378

Author(s):

Wein-Pei Wong ◽

Ke-Xin Chan ◽

Lay-Ki Soon

Keyword(s):

Support Vector Machines ◽

Support Vector ◽

Web Pages ◽

Vector Machines

Download Full-text

Exploring a hybrid of support vector machines (SVMs) and a heuristic based system in classifying web pages

10.1117/12.472836 ◽

2003 ◽

Cited By ~ 1

Author(s):

Ahmad Rahman ◽

Yuliya Tarnikova ◽

Hassan Alam

Keyword(s):

Support Vector Machines ◽

Support Vector ◽

Web Pages ◽

Vector Machines

Download Full-text

E-mail Spam Filtering Using Support Vector Machines with Selection of Kernel Function Parameters

2009 Fourth International Conference on Innovative Computing, Information and Control (ICICIC) ◽

10.1109/icicic.2009.184 ◽

2009 ◽

Author(s):

Tsan-Ying Yu ◽

Wei-Chih Hsu

Keyword(s):

Support Vector Machines ◽

Kernel Function ◽

Support Vector ◽

Spam Filtering ◽

Vector Machines ◽

E Mail ◽

Selection Of

Download Full-text

A study of spam filtering using support vector machines

Artificial Intelligence Review ◽

10.1007/s10462-010-9166-x ◽

2010 ◽

Vol 34 (1) ◽

pp. 73-108 ◽

Cited By ~ 51

Author(s):

Ola Amayri ◽

Nizar Bouguila

Keyword(s):

Support Vector Machines ◽

Support Vector ◽

Spam Filtering ◽

Vector Machines

Download Full-text

An Improved Framework for Content- and Link-Based Web-Spam Detection: A Combined Approach

Complexity ◽

10.1155/2021/6625739 ◽

2021 ◽

Vol 2021 ◽

pp. 1-18

Author(s):

Asim Shahzad ◽

Nazri Mohd Nawi ◽

Muhammad Zubair Rehman ◽

Abdullah Khan

Keyword(s):

Spam Detection ◽

Web Pages ◽

Web Traffic ◽

Part Of Speech ◽

Web Spam ◽

Relationship Network ◽

The Relationship ◽

Modern Era ◽

Share Information

In this modern era, people utilise the web to share information and to deliver services and products. The information seekers use different search engines (SEs) such as Google, Bing, and Yahoo as tools to search for products, services, and information. However, web spamming is one of the most significant issues encountered by SEs because it dramatically affects the quality of SE results. Web spamming’s economic impact is enormous because web spammers index massive free advertising data on SEs to increase the volume of web traffic on a targeted website. Spammers trick an SE into ranking irrelevant web pages higher than relevant web pages in the search engine results pages (SERPs) using different web-spamming techniques. Consequently, these high-ranked unrelated web pages contain insufficient or inappropriate information for the user. To detect the spam web pages, several researchers from industry and academia are working. No efficient technique that is capable of catching all spam web pages on the World Wide Web (WWW) has been presented yet. This research is an attempt to propose an improved framework for content- and link-based web-spam identification. The framework uses stopwords, keywords’ frequency, part of speech (POS) ratio, spam keywords database, and copied-content algorithms for content-based web-spam detection. For link-based web-spam detection, we initially exposed the relationship network behind the link-based web spamming and then used the paid-link database, neighbour pages, spam signals, and link-farm algorithms. Finally, we combined all the content- and link-based spam identification algorithms to identify both types of spam. To conduct experiments and to obtain threshold values, WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets were used. A promising F-measure of 79.6% with 81.2% precision shows the applicability and effectiveness of the proposed approach.

Download Full-text

On the Effectiveness of Social Tagging for Resource Discovery

Social Computing ◽

10.4018/978-1-60566-984-7.ch116 ◽

2010 ◽

pp. 1778-1787

Author(s):

Dion Hoe-Lian Goh ◽

Khasfariyati Razikin ◽

Alton Y.K. Chua ◽

Chei Sian Lee ◽

Schubert Foo

Keyword(s):

Support Vector Machines ◽

Text Categorization ◽

Collective Intelligence ◽

Resource Discovery ◽

Social Tagging ◽

Support Vector ◽

Web Pages ◽

Vector Machines ◽

Future Work ◽

F Measure

Social tagging is the process of assigning and sharing among users freely selected terms of resources. This approach enables users to annotate/ describe resources, and also allows users to locate new resources through the collective intelligence of other users. Social tagging offers a new avenue for resource discovery as compared to taxonomies and subject directories created by experts. This chapter investigates the effectiveness of tags as resource descriptors and is achieved using text categorization via support vector machines (SVM). Two text categorization experiments were done for this research, and tags and Web pages from del.icio. us were used. The first study concentrated on the use of terms as its features while the second used both terms and its tags as part of its feature set. The experiments yielded a macroaveraged precision, recall, and F-measure scores of 52.66%, 54.86%, and 52.05%, respectively. In terms of microaveraged values, the experiments obtained 64.76% for precision, 54.40% for recall, and 59.14% for F-measure. The results suggest that the tags were not always reliable indicators of the resource contents. At the same time, the results from the terms-only experiment were better compared to the experiment with both terms and tags. Implications of our work and opportunities for future work are also discussed.

Download Full-text

On the Effectiveness of Social Tagging for Resource Discovery

Handbook of Research on Digital Libraries ◽

10.4018/978-1-59904-879-6.ch025 ◽

2009 ◽

pp. 251-260

Author(s):

Dion Hoe-Lian Goh ◽

Khasfariyati Razikin ◽

Alton Y.K. Chua ◽

Chei Sian Lee ◽

Schubert Foo

Keyword(s):

Support Vector Machines ◽

Text Categorization ◽

Collective Intelligence ◽

Resource Discovery ◽

Social Tagging ◽

Support Vector ◽

Web Pages ◽

Vector Machines ◽

Future Work ◽

F Measure

Social tagging is the process of assigning and sharing among users freely selected terms of resources. This approach enables users to annotate/describe resources, and also allows users to locate new resources through the collective intelligence of other users. Social tagging offers a new avenue for resource discovery as compared to taxonomies and subject directories created by experts. This chapter investigates the effectiveness of tags as resource descriptors and is achieved using text categorization via support vector machines (SVM). Two text categorization experiments were done for this research, and tags and Web pages from del.icio.us were used. The first study concentrated on the use of terms as its features while the second used both terms and its tags as part of its feature set. The experiments yielded a macroaveraged precision, recall, and F-measure scores of 52.66%, 54.86%, and 52.05%, respectively. In terms of microaveraged values, the experiments obtained 64.76% for precision, 54.40% for recall, and 59.14% for F-measure. The results suggest that the tags were not always reliable indicators of the resource contents. At the same time, the results from the terms-only experiment were better compared to the experiment with both terms and tags. Implications of our work and opportunities for future work are also discussed.

Download Full-text

Improved Online Support Vector Machines Spam Filtering Using String Kernels

Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-642-10268-4_73 ◽

2009 ◽

pp. 621-628 ◽

Cited By ~ 4

Author(s):

Ola Amayri ◽

Nizar Bouguila

Keyword(s):

Support Vector Machines ◽

Support Vector ◽

Online Support ◽

Spam Filtering ◽

String Kernels ◽

Vector Machines

Download Full-text