Web Mining | ScienceGate

A Java Technology Based Distributed Software Architecture for Web Usage Mining

Web Mining ◽

10.4018/978-1-59140-414-9.ch017 ◽

2011 ◽

pp. 355-372

Author(s):

Juan M. Hernansaez

Keyword(s):

Inductive Learning ◽

Mining Area ◽

Web Usage Mining ◽

Sequential Patterns ◽

Distributed Software ◽

Web Usage ◽

Java Technology ◽

Commercial Company ◽

Meta Learning ◽

The Web

In this chapter we focus on the three approaches that seem to be the most successful ones in the Web usage mining area: clustering, association rules and sequential patterns. We will discuss some techniques from each one of these approaches, and then we will show the benefits of using METALA (a META-Learning Architecture) as an integrating tool not only for the discussed Web usage mining techniques, but also for inductive learning algorithms. As we will show, this architecture can also be used to generate new theories and models that can be useful to provide new generic applications for several supervised and non-supervised learning paradigms. As a particular example of a Web usage mining application, we will report our work for a medium-sized commercial company, and we will discuss some interesting properties and conclusions that we have obtained from our reporting.

Efficient Web Mining for Traversal Path Patterns

Web Mining ◽

10.4018/978-1-59140-414-9.ch015 ◽

2011 ◽

pp. 322-338 ◽

Cited By ~ 1

Author(s):

Zhixiang Chen ◽

Richard H. Fowler ◽

Ada Wai-Chee Fu ◽

Chunyue Wang

Keyword(s):

Web Mining ◽

Linear Time ◽

Fundamental Problem ◽

A Priori ◽

Web Pages ◽

Suffix Trees ◽

Web Logs ◽

Large Alphabet ◽

Optimal Linear ◽

Linear Time Algorithms

A maximal forward reference of a Web user is a longest consecutive sequence of Web pages visited by the user in a session without revisiting some previously visited page in the sequence. Efficient mining of frequent traversal path patterns, that is, large reference sequences of maximal forward references, from very large Web logs is a fundamental problem in Web mining. This chapter aims at designing algorithms for this problem with the best possible efficiency. First, two optimal linear time algorithms are designed for finding maximal forward references from Web logs. Second, two algorithms for mining frequent traversal path patterns are devised with the help of a fast construction of shallow generalized suffix trees over a very large alphabet. These two algorithms have respectively provable linear and sublinear time complexity, and their performances are analyzed in comparison with the a priori-like algorithms and the Ukkonen algorithm. It is shown that these two new algorithms are substantially more efficient than the a priori-like algorithms and the Ukkonen algorithm.

Web Usage Mining in Search Engines

Web Mining ◽

10.4018/978-1-59140-414-9.ch014 ◽

2011 ◽

pp. 307-321 ◽

Cited By ~ 14

Author(s):

Ricardo Baeza-Yates

Keyword(s):

User Interface ◽

Search Engine ◽

Power Law ◽

Search Engines ◽

Real Data ◽

Web Usage Mining ◽

Power Law Distribution ◽

Main Ideas ◽

Answer Ranking ◽

Navigation Information

Search engine logs not only keep navigation information, but also the queries made by their users. In particular, queries to a search engine follow a power-law distribution, which is far from uniform. Queries and related clicks can be used to improve the search engine itself in different aspects: user interface, index performance, and answer ranking. In this chapter we present some of the main ideas proposed in query mining and we show a few examples based on real data from a search engine focused on the Chilean Web.

Extracting and Customizing Information Using Multi-Agents

Web Mining ◽

10.4018/978-1-59140-414-9.ch011 ◽

2011 ◽

pp. 228-252 ◽

Cited By ~ 1

Author(s):

Mohamed Salah Hamdi

Keyword(s):

Main Idea ◽

Complex Environments ◽

Time Data ◽

Client Server ◽

Windows Nt ◽

Distributed Object ◽

Time Space ◽

Multi Agent ◽

Information Customization ◽

Pervasive Access

Rapidly evolving network and computer technology, coupled with the exponential growth of the services and information available on the Internet, has already brought us to the point where hundreds of millions of people should have fast, pervasive access to a phenomenal amount of information, through desktop machines at work, school and home, through televisions, phones, pagers, and car dashboards, from anywhere and everywhere. The challenge of complex environments is therefore obvious: software is expected to do more in more situations, there are a variety of users (Power/Naive, Techie/ Financial/Clerical, ...), there are a variety of systems (Windows/NT/Mac/Unix, Client/Server, Portable, Distributed Object Manager, Web, ...), there are a variety of interactions (Real-time, Data Bases, Other Players, ...), and there are a variety of resources and goals (time, space, bandwidth, cost, security, quality, ...). To cope with such environments, the promise of information customization systems is becoming highly attractive. In this chapter we discuss important problems in relationship to such systems and smooth the way for possible solutions. The main idea is to approach information customization using a multi-agent paradigm.

Ontology Learning from a Domain Web Corpus

Web Mining ◽

10.4018/978-1-59140-414-9.ch004 ◽

2011 ◽

pp. 69-98 ◽

Cited By ~ 1

Author(s):

Roberto Navigli

Keyword(s):

World Wide ◽

Shared Vision ◽

Domain Ontology ◽

Virtual Organizations ◽

Semantic Interpretation ◽

Application Domain ◽

Web Based ◽

Domain Experts ◽

Usable Knowledge ◽

The World

Domain ontologies are widely recognized as a key element for the so-called semantic Web, an improved, “semantic aware” version of the World Wide Web. Ontologies define concepts and interrelationships in order to provide a shared vision of a given application domain. Despite the significant amount of work in the field, ontologies are still scarcely used in Web-based applications. One of the main problems is the difficulty in identifying and defining relevant concepts within the domain. In this chapter, we provide an approach to the problem, defining a method and a tool, OntoLearn, aimed at the extraction of knowledge from Websites, and more generally from documents shared among the members of virtual organizations, to support the construction of a domain ontology. Exploiting the idea that a corpus of documents produced by a community is the most representative (although implicit) repository of concepts, the method extracts a terminology, provides a semantic interpretation of relevant terms and populates the domain ontology in an automatic manner. Finally, further manual corrections are required from domain experts in order to achieve a rich and usable knowledge resource.

Web Usage Mining

Web Mining ◽

10.4018/978-1-59140-414-9.ch018 ◽

2011 ◽

pp. 373-392 ◽

Cited By ~ 1

Author(s):

Yew-Kwong Woon ◽

Wee-Keong Ng ◽

Ee-Peng Lim

Keyword(s):

Data Mining ◽

World Wide ◽

Web Usage Mining ◽

Web Usage ◽

Online Business ◽

Web Access ◽

Business Competitiveness ◽

Access Logs ◽

Web Server Logs ◽

Web Access Logs

The rising popularity of electronic commerce makes data mining an indispensable technology for several applications, especially online business competitiveness. The World Wide Web provides abundant raw data in the form of Web access logs. However, without data mining techniques, it is difficult to make any sense out of such massive data. In this chapter, we focus on the mining of Web access logs, commonly known as Web usage mining. We analyze algorithms for preprocessing and extracting knowledge from such logs. We will also propose our own techniques to mine the logs in a more holistic manner. Experiments conducted on real Web server logs verify the practicality as well as the efficiency of the proposed techniques as compared to an existing technique. Finally, challenges in Web usage mining are discussed.

Exploiting Captions for Web Data Mining

Web Mining ◽

10.4018/978-1-59140-414-9.ch006 ◽

2011 ◽

pp. 119-144

Author(s):

Neil C. Rowe

Keyword(s):

Data Mining ◽

Survey Research ◽

Web Data ◽

Web Data Mining ◽

Other Information ◽

Media Objects ◽

The Media ◽

The Web ◽

Mapping Information

We survey research on using captions in data mining from the Web. Captions are text that describes some other information (typically, multimedia). Since text is considerably easier to analyze than non-text, a good way to support access to non-text is to index the words of its captions. However, captions vary considerably in form and content on the Web. We discuss the range of syntactic clues (such as HTML tags) and semantic clues (such as particular words). We discuss how to quantify clue strength and combine clues for a consensus. We then discuss the problem of mapping information in captions to information in media objects. While it is hard, classes of mapping schemes are distinguishable, and a segmentation of the media can be matched to a parse of the caption.

Metadata Management

Web Mining ◽

10.4018/978-1-59140-414-9.ch001 ◽

2011 ◽

pp. 1-26 ◽

Cited By ~ 1

Author(s):

Gilbert W. Laware

Keyword(s):

Data Mining ◽

Decision Making ◽

World Wide Web ◽

World Wide ◽

Metadata Management ◽

Web Content ◽

Current Information ◽

The World ◽

Increasing Demand

This chapter introduces the need for the World Wide Web to provide a standard mechanism so individuals can readily obtain data, reports, research and knowledge about any topic posted to it. Individuals have been frustrated by this process since they are not able to access relevant data and current information. Much of the reason for this lies with metadata, the data about the data that are used in support of Web content. These metadata are non-existent, ill-defined, erroneously labeled, or, if well-defined, continue to be marked by other disparate metadata. With the ever-increasing demand for Web-enabled data mining, warehousing and management of knowledge, an organization has to address the multiple facets of process, standards, technology, data mining, and warehousing management. This requires approaches to provide an integrated interchange of quality metadata that enables individuals to access Web content with the most relevant, contemporary data, information, and knowledge that are both content-rich and practical for decision-making situations.

Analysis of Document Viewing Patterns of Web Search Engine Users

Web Mining ◽

10.4018/978-1-59140-414-9.ch016 ◽

2011 ◽

pp. 339-354 ◽

Cited By ~ 6

Author(s):

Bernard J. Jansen ◽

Amanda Spink

Keyword(s):

Information Seeking ◽

Web Search ◽

Real Data ◽

Temporal Analysis ◽

Log Analysis ◽

Web Page ◽

Retrieval Systems ◽

Web Information ◽

Information Interaction ◽

Information Retrieval Systems

This chapter reviews the concepts of Web results page and Web page viewing patterns by users of Web search engines. It presents the advantages of using traditional transaction log analysis in identifying these patterns, serving as a basis for Web usage mining. The authors also present the results of a temporal analysis of Web page viewing, illustrating that the user — information interaction is extremely short. By using real data collected from real users interacting with real Web information retrieval systems, the authors aim to highlight one aspect of the complex environment of Web information seeking.

Data Cleansing and Validation for Multiple Site Link Structure Analysis

Web Mining ◽

10.4018/978-1-59140-414-9.ch010 ◽

2011 ◽

pp. 208-227 ◽

Cited By ~ 4

Author(s):

Mike Thelwall

Keyword(s):

Structure Analysis ◽

Application Area ◽

Multiple Site ◽

Web Data ◽

Data Cleansing ◽

Link Structure ◽

Main Application ◽

Web Structure ◽

Web Structure Mining ◽

Link Data

A range of techniques is described for cleansing and validating link data for use in different types of Web structure mining, and some applications are given. The main application area is Multiple Site Link Structure Analysis, which typically involves mining patterns from themed collections of Websites. The importance of data cleansing and validation stems from the fact that Web data are typically very messy. It involves extensive duplication of pages and page components, which when analyzing raw Web data may give meaningless results.

Web Mining
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

A Java Technology Based Distributed Software Architecture for Web Usage Mining

Efficient Web Mining for Traversal Path Patterns

Web Usage Mining in Search Engines

Extracting and Customizing Information Using Multi-Agents

Ontology Learning from a Domain Web Corpus

Web Usage Mining

Exploiting Captions for Web Data Mining

Metadata Management

Analysis of Document Viewing Patterns of Web Search Engine Users

Data Cleansing and Validation for Multiple Site Link Structure Analysis

Export Citation Format

Web MiningLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

A Java Technology Based Distributed Software Architecture for Web Usage Mining

Efficient Web Mining for Traversal Path Patterns

Web Usage Mining in Search Engines

Extracting and Customizing Information Using Multi-Agents

Ontology Learning from a Domain Web Corpus

Web Usage Mining

Exploiting Captions for Web Data Mining

Metadata Management

Analysis of Document Viewing Patterns of Web Search Engine Users

Data Cleansing and Validation for Multiple Site Link Structure Analysis

Web Mining
Latest Publications