The Evolution of the (Hidden) Web and Its Hidden Data

The Dark Web ◽  
2018 ◽  
pp. 84-113
Author(s):  
Manuel Álvarez Díaz ◽  
Víctor Manuel Prieto Álvarez ◽  
Fidel Cacheda Seijo
Keyword(s):  

This paper presents an analysis of the most important features of the Web and its evolution and implications on the tools that traverse it to index its content to be searched later. It is important to remark that some of these features of the Web make a quite large subset to remain “hidden”. The analysis of the Web focuses on a snapshot of the Global Web for six different years: 2009 to 2014. The results for each year are analyzed independently and together to facilitate the analysis of both the features at any given time and the changes between the different analyzed years. The objective of the analysis are twofold: to characterize the Web and more importantly, its evolution along the time.

Author(s):  
Manuel Álvarez Díaz ◽  
Víctor Manuel Prieto Álvarez ◽  
Fidel Cacheda Seijo
Keyword(s):  

This paper presents an analysis of the most important features of the Web and its evolution and implications on the tools that traverse it to index its content to be searched later. It is important to remark that some of these features of the Web make a quite large subset to remain “hidden”. The analysis of the Web focuses on a snapshot of the Global Web for six different years: 2009 to 2014. The results for each year are analyzed independently and together to facilitate the analysis of both the features at any given time and the changes between the different analyzed years. The objective of the analysis are twofold: to characterize the Web and more importantly, its evolution along the time.


2019 ◽  
pp. 121-145
Author(s):  
Erdal Ozkaya ◽  
Rafiqul Islam
Keyword(s):  

Author(s):  
V Aruna, Et. al.

In the recent years with the advancement in technology, a  lot of information is available in different formats and extracting the  knowledge from that data has become a very difficult task. Due to the vast amount of information available on the web, users are finding it difficult to extract relevant information or create new knowledge using information available on the web. To solve this problem  Web mining techniques are used to discover the interesting patterns from the hidden data .Web Usage Mining (WUM), which is one  of the subset of  Web Mining helps in extracting the hidden knowledge present in the Web log  files , in recognizing various interests of web users and also in  discovering customer behaviours. Web Usage mining  includes different phases of data mining techniques called Data Pre-processing, Pattern Discovery & Pattern Analysis. This paper presents an updated focused survey on various sequential pattern mining  algorithms  like  apriori-based algorithm , Breadth First Search-based strategy, Depth First Search strategy,  sequential closed-pattern algorithm and Incremental pattern mining algorithm which are used in Pattern Discovery Phase of WUM. At last , a comparison  is done based on the important key features present in these algorithms. This study gives us better understanding of the approaches of sequential pattern mining.


Author(s):  
Hiroaki Yamane ◽  
◽  
Masafumi Hagiwara

This paper proposes a tag line generating systemusing information extracted from the web. Tag lines sometimes attract attention even when they consist of indirect word group of the target. We use web information to extract hidden data and use several tag line corpora to collect a large number of tag lines. First, knowledge related to the input is obtained from the web. Then, the proposed system selects suitable words according to the theme. Also, model tag lines are selected from the corpora using the knowledge. By inserting nouns, verbs and adjectives into model tag lines’ structure, candidate sentences are generated. These tag line candidates are selected by the suitability as a sentence using a text N-gram corpus. The subjective experiment measures the quality of system-generated tag lines and some of them are quite comparable to human-made ones.


2012 ◽  
Vol 9 (2) ◽  
pp. 561-583 ◽  
Author(s):  
Víctor Prieto ◽  
Manuel Álvarez ◽  
Rafael López-García ◽  
Fidel Cacheda

The main goal of this study is to present a scale that classifies crawling systems according to their effectiveness in traversing the ?clientside? Hidden Web. First, we perform a thorough analysis of the different client-side technologies and the main features of the web pages in order to determine the basic steps of the aforementioned scale. Then, we define the scale by grouping basic scenarios in terms of several common features, and we propose some methods to evaluate the effectiveness of the crawlers according to the levels of the scale. Finally, we present a testing web site and we show the results of applying the aforementioned methods to the results obtained by some open-source and commercial crawlers that tried to traverse the pages. Only a few crawlers achieve good results in treating client-side technologies. Regarding standalone crawlers, we highlight the open-source crawlers Heritrix and Nutch and the commercial crawler WebCopierPro, which is able to process very complex scenarios. With regard to the crawlers of the main search engines, only Google processes most of the scenarios we have proposed, while Yahoo! and Bing just deal with the basic ones. There are not many studies that assess the capacity of the crawlers to deal with client-side technologies. Also, these studies consider fewer technologies, fewer crawlers and fewer combinations. Furthermore, to the best of our knowledge, our article provides the first scale for classifying crawlers from the point of view of the most important client-side technologies.


Author(s):  
Otto Hans-Martin Lutz ◽  
Jacob Leon Kröger ◽  
Manuel Schneiderbauer ◽  
Manfred Hauswirth

Web tracking is found on 90% of common websites. It allows online behavioral analysis which can reveal insights to sensitive personal data of an individual. Most users are not aware of the amount of web tracking happening in the background. This paper contributes a sonification-based approach to raise user awareness by conveying information on web tracking through sound while the user is browsing the web. We present a framework for live web tracking analysis, conversion to Open Sound Control events and sonification. The amount of web tracking is disclosed by sound each time data is exchanged with a web tracking host. When a connection to one of the most prevalent tracking companies is established, this is additionally indicated by a voice whispering the company name. Compared to existing approaches on web tracking sonification, we add the capability to monitor any network connection, including all browsers, applications and devices. An initial user study with 12 participants showed empirical support for our main hypothesis: exposure to our sonification significantly raises web tracking awareness.


Author(s):  
Ali Hasnain ◽  
Qaiser Mehmood ◽  
Syeda Sana e Zainab ◽  
Aidan Hogan

Access to hundreds of knowledge bases has been made available on the Web through public SPARQL endpoints. Unfortunately, few endpoints publish descriptions of their content (e.g., using VoID). It is thus unclear how agents can learn about the content of a given SPARQL endpoint or, relatedly, find SPARQL endpoints with content relevant to their needs. In this paper, the authors investigate the feasibility of a system that gathers information about public SPARQL endpoints by querying them directly about their own content. With the advent of SPARQL 1.1 and features such as aggregates, it is now possible to specify queries whose results would form a detailed profile of the content of the endpoint, comparable with a large subset of VoID. In theory it would thus be feasible to build a rich centralised catalogue describing the content indexed by individual endpoints by issuing them SPARQL (1.1) queries; this catalogue could then be searched and queried by agents looking for endpoints with content they are interested in. In practice, however, the coverage of the catalogue is bounded by the limitations of public endpoints themselves: some may not support SPARQL 1.1, some may return partial responses, some may throw exceptions for expensive aggregate queries, etc. The authors' goal in this paper is thus twofold: (i) using VoID as a bar, to empirically investigate the extent to which public endpoints can describe their own content, and (ii) to build and analyse the capabilities of a best-effort online catalogue of current endpoints based on the (partial) results collected.


2016 ◽  
Vol 12 (3) ◽  
pp. 134-163 ◽  
Author(s):  
Ali Hasnain ◽  
Qaiser Mehmood ◽  
Syeda Sana e Zainab ◽  
Aidan Hogan

Access to hundreds of knowledge bases has been made available on the Web through public SPARQL endpoints. Unfortunately, few endpoints publish descriptions of their content (e.g., using VoID). It is thus unclear how agents can learn about the content of a given SPARQL endpoint or, relatedly, find SPARQL endpoints with content relevant to their needs. In this paper, the authors investigate the feasibility of a system that gathers information about public SPARQL endpoints by querying them directly about their own content. With the advent of SPARQL 1.1 and features such as aggregates, it is now possible to specify queries whose results would form a detailed profile of the content of the endpoint, comparable with a large subset of VoID. In theory it would thus be feasible to build a rich centralised catalogue describing the content indexed by individual endpoints by issuing them SPARQL (1.1) queries; this catalogue could then be searched and queried by agents looking for endpoints with content they are interested in. In practice, however, the coverage of the catalogue is bounded by the limitations of public endpoints themselves: some may not support SPARQL 1.1, some may return partial responses, some may throw exceptions for expensive aggregate queries, etc. The authors' goal in this paper is thus twofold: (i) using VoID as a bar, to empirically investigate the extent to which public endpoints can describe their own content, and (ii) to build and analyse the capabilities of a best-effort online catalogue of current endpoints based on the (partial) results collected.


Sign in / Sign up

Export Citation Format

Share Document