The Evolution of the (Hidden) Web and Its Hidden Data

2015 ◽

pp. 1-30 ◽

Cited By ~ 1

Author(s):

Manuel Álvarez Díaz ◽

Víctor Manuel Prieto Álvarez ◽

Fidel Cacheda Seijo

Keyword(s):

Large Subset ◽

Hidden Web ◽

Hidden Data ◽

The Web

This paper presents an analysis of the most important features of the Web and its evolution and implications on the tools that traverse it to index its content to be searched later. It is important to remark that some of these features of the Web make a quite large subset to remain “hidden”. The analysis of the Web focuses on a snapshot of the Global Web for six different years: 2009 to 2014. The results for each year are analyzed independently and together to facilitate the analysis of both the features at any given time and the changes between the different analyzed years. The objective of the analysis are twofold: to characterize the Web and more importantly, its evolution along the time.

Download Full-text

Evolution of the Web and Its Hidden Data

Inside the Dark Web ◽

10.1201/9780367260453-7 ◽

2019 ◽

pp. 121-145

Author(s):

Erdal Ozkaya ◽

Rafiqul Islam

Keyword(s):

Hidden Data ◽

The Web

Download Full-text

Exploring `hidden' parts of the web: the hidden web

Fourth International Conference on Advances in Recent Technologies in Communication and Computing (ARTCom2012) ◽

10.1049/cp.2012.2556 ◽

2012 ◽

Cited By ~ 1

Author(s):

S. Gupta ◽

K.K. Bhatia

Keyword(s):

Hidden Web ◽

The Web

Download Full-text

A Review on Design and Development Of Sequential Patterns Algorithms In Web Usage Mining

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i2.1448 ◽

2021 ◽

Vol 12 (2) ◽

pp. 1634-1639

Author(s):

V Aruna, Et. al.

Keyword(s):

Web Mining ◽

Pattern Mining ◽

Pattern Discovery ◽

Relevant Information ◽

Sequential Pattern Mining ◽

Web Usage Mining ◽

Sequential Pattern ◽

Web Usage ◽

Hidden Data ◽

The Web

In the recent years with the advancement in technology, a lot of information is available in different formats and extracting the knowledge from that data has become a very difficult task. Due to the vast amount of information available on the web, users are finding it difficult to extract relevant information or create new knowledge using information available on the web. To solve this problem Web mining techniques are used to discover the interesting patterns from the hidden data .Web Usage Mining (WUM), which is one of the subset of Web Mining helps in extracting the hidden knowledge present in the Web log files , in recognizing various interests of web users and also in discovering customer behaviours. Web Usage mining includes different phases of data mining techniques called Data Pre-processing, Pattern Discovery & Pattern Analysis. This paper presents an updated focused survey on various sequential pattern mining algorithms like apriori-based algorithm , Breadth First Search-based strategy, Depth First Search strategy, sequential closed-pattern algorithm and Incremental pattern mining algorithm which are used in Pattern Discovery Phase of WUM. At last , a comparison is done based on the important key features present in these algorithms. This study gives us better understanding of the approaches of sequential pattern mining.

Download Full-text

Tag Line Generating System Using Information on the Web

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2013.p0185 ◽

2013 ◽

Vol 17 (2) ◽

pp. 185-193 ◽

Cited By ~ 3

Author(s):

Hiroaki Yamane ◽

◽

Masafumi Hagiwara

Keyword(s):

Word Group ◽

Web Information ◽

Subjective Experiment ◽

N Gram ◽

Hidden Data ◽

Generating System ◽

The Web

This paper proposes a tag line generating systemusing information extracted from the web. Tag lines sometimes attract attention even when they consist of indirect word group of the target. We use web information to extract hidden data and use several tag line corpora to collect a large number of tag lines. First, knowledge related to the input is obtained from the web. Then, the proposed system selects suitable words according to the theme. Also, model tag lines are selected from the corpora using the knowledge. By inserting nouns, verbs and adjectives into model tag lines’ structure, candidate sentences are generated. These tag line candidates are selected by the suitability as a sentence using a text N-gram corpus. The subjective experiment measures the quality of system-generated tag lines and some of them are quite comparable to human-made ones.

Download Full-text

A scale for crawler effectiveness on the client-side hidden web

Computer Science and Information Systems ◽

10.2298/csis111215015p ◽

2012 ◽

Vol 9 (2) ◽

pp. 561-583 ◽

Cited By ~ 2

Author(s):

Víctor Prieto ◽

Manuel Álvarez ◽

Rafael López-García ◽

Fidel Cacheda

Keyword(s):

Open Source ◽

Search Engines ◽

Web Site ◽

Point Of View ◽

Web Pages ◽

Common Features ◽

Hidden Web ◽

Client Side ◽

The Web

The main goal of this study is to present a scale that classifies crawling systems according to their effectiveness in traversing the ?clientside? Hidden Web. First, we perform a thorough analysis of the different client-side technologies and the main features of the web pages in order to determine the basic steps of the aforementioned scale. Then, we define the scale by grouping basic scenarios in terms of several common features, and we propose some methods to evaluate the effectiveness of the crawlers according to the levels of the scale. Finally, we present a testing web site and we show the results of applying the aforementioned methods to the results obtained by some open-source and commercial crawlers that tried to traverse the pages. Only a few crawlers achieve good results in treating client-side technologies. Regarding standalone crawlers, we highlight the open-source crawlers Heritrix and Nutch and the commercial crawler WebCopierPro, which is able to process very complex scenarios. With regard to the crawlers of the main search engines, only Google processes most of the scenarios we have proposed, while Yahoo! and Bing just deal with the basic ones. There are not many studies that assess the capacity of the crawlers to deal with client-side technologies. Also, these studies consider fewer technologies, fewer crawlers and fewer combinations. Furthermore, to the best of our knowledge, our article provides the first scale for classifying crawlers from the point of view of the most important client-side technologies.

Download Full-text

All Domain Hidden Web Exposer Ontologies: A Unified Approach for Excavating the Web to Unhide Deep Web

Smart Innovations in Communication and Computational Sciences - Advances in Intelligent Systems and Computing ◽

10.1007/978-981-13-2414-7_39 ◽

2018 ◽

pp. 423-431

Author(s):

Manpreet Singh Sehgal ◽

Jay Shankar Prasad

Keyword(s):

Deep Web ◽

Unified Approach ◽

Hidden Web ◽

The Web

Download Full-text

Surfing In Sound: Sonification of Hidden Web Tracking

Proceedings of the 25th International Conference on Auditory Display (ICAD 2019) ◽

10.21785/icad2019.071 ◽

2019 ◽

Cited By ~ 1

Author(s):

Otto Hans-Martin Lutz ◽

Jacob Leon Kröger ◽

Manuel Schneiderbauer ◽

Manfred Hauswirth

Keyword(s):

User Study ◽

Empirical Support ◽

Personal Data ◽

Behavioral Analysis ◽

Time Data ◽

Main Hypothesis ◽

Network Connection ◽

Hidden Web ◽

Sound Control ◽

The Web

Web tracking is found on 90% of common websites. It allows online behavioral analysis which can reveal insights to sensitive personal data of an individual. Most users are not aware of the amount of web tracking happening in the background. This paper contributes a sonification-based approach to raise user awareness by conveying information on web tracking through sound while the user is browsing the web. We present a framework for live web tracking analysis, conversion to Open Sound Control events and sonification. The amount of web tracking is disclosed by sound each time data is exchanged with a web tracking host. When a connection to one of the most prevalent tracking companies is established, this is additionally indicated by a voice whispering the company name. Compared to existing approaches on web tracking sonification, we add the capability to monitor any network connection, including all browsers, applications and devices. An initial user study with 12 participants showed empirical support for our main hypothesis: exposure to our sonification significantly raises web tracking awareness.

Download Full-text

SPORTAL

Information Retrieval and Management ◽

10.4018/978-1-5225-5191-1.ch017 ◽

2018 ◽

pp. 368-401

Author(s):

Ali Hasnain ◽

Qaiser Mehmood ◽

Syeda Sana e Zainab ◽

Aidan Hogan

Keyword(s):

Knowledge Bases ◽

Best Effort ◽

Sparql Endpoint ◽

Large Subset ◽

Aggregate Queries ◽

Online Catalogue ◽

The Web

Access to hundreds of knowledge bases has been made available on the Web through public SPARQL endpoints. Unfortunately, few endpoints publish descriptions of their content (e.g., using VoID). It is thus unclear how agents can learn about the content of a given SPARQL endpoint or, relatedly, find SPARQL endpoints with content relevant to their needs. In this paper, the authors investigate the feasibility of a system that gathers information about public SPARQL endpoints by querying them directly about their own content. With the advent of SPARQL 1.1 and features such as aggregates, it is now possible to specify queries whose results would form a detailed profile of the content of the endpoint, comparable with a large subset of VoID. In theory it would thus be feasible to build a rich centralised catalogue describing the content indexed by individual endpoints by issuing them SPARQL (1.1) queries; this catalogue could then be searched and queried by agents looking for endpoints with content they are interested in. In practice, however, the coverage of the catalogue is bounded by the limitations of public endpoints themselves: some may not support SPARQL 1.1, some may return partial responses, some may throw exceptions for expensive aggregate queries, etc. The authors' goal in this paper is thus twofold: (i) using VoID as a bar, to empirically investigate the extent to which public endpoints can describe their own content, and (ii) to build and analyse the capabilities of a best-effort online catalogue of current endpoints based on the (partial) results collected.

Download Full-text

SPORTAL

International Journal on Semantic Web and Information Systems ◽

10.4018/ijswis.2016070105 ◽

2016 ◽

Vol 12 (3) ◽

pp. 134-163 ◽

Cited By ~ 7

Author(s):

Ali Hasnain ◽

Qaiser Mehmood ◽

Syeda Sana e Zainab ◽

Aidan Hogan

Keyword(s):

Knowledge Bases ◽

Best Effort ◽

Sparql Endpoint ◽

Large Subset ◽

Aggregate Queries ◽

Online Catalogue ◽

The Web

Access to hundreds of knowledge bases has been made available on the Web through public SPARQL endpoints. Unfortunately, few endpoints publish descriptions of their content (e.g., using VoID). It is thus unclear how agents can learn about the content of a given SPARQL endpoint or, relatedly, find SPARQL endpoints with content relevant to their needs. In this paper, the authors investigate the feasibility of a system that gathers information about public SPARQL endpoints by querying them directly about their own content. With the advent of SPARQL 1.1 and features such as aggregates, it is now possible to specify queries whose results would form a detailed profile of the content of the endpoint, comparable with a large subset of VoID. In theory it would thus be feasible to build a rich centralised catalogue describing the content indexed by individual endpoints by issuing them SPARQL (1.1) queries; this catalogue could then be searched and queried by agents looking for endpoints with content they are interested in. In practice, however, the coverage of the catalogue is bounded by the limitations of public endpoints themselves: some may not support SPARQL 1.1, some may return partial responses, some may throw exceptions for expensive aggregate queries, etc. The authors' goal in this paper is thus twofold: (i) using VoID as a bar, to empirically investigate the extent to which public endpoints can describe their own content, and (ii) to build and analyse the capabilities of a best-effort online catalogue of current endpoints based on the (partial) results collected.

Download Full-text

The Evolution of the (Hidden) Web and Its Hidden Data