Crawling the Deep Web Using Asynchronous Advantage Actor Critic Technique

Journal of Web Engineering ◽

10.13052/jwe1540-9589.20314 ◽

2021 ◽

Author(s):

Kapil Madan ◽

Rajesh Bhatia

Keyword(s):

Value Function ◽

General Purpose ◽

Deep Web ◽

Web Pages ◽

Coverage Ratio ◽

Digital World ◽

Learning Technique ◽

Policy Gradient ◽

Structured Information ◽

Surface Web

In the digital world, World Wide Web magnitude is expanding very promptly. Now a day, a rising number of data-centric websites require a mechanism to crawl the information. The information accessible through hyperlinks can easily be retrieved with general-purpose search engines. A massive chunk of the structured information is invisible behind the search forms. Such immense information is recognized as the deep web and has structured information as compared to the surface web. Crawling the content of deep web is very challenging and requires filling the search forms with suitable queries. This paper proposes an innovative technique using an Asynchronous Advantage Actor-Critic (A3C) to explore the unidentified deep web pages. It is based on the policy gradient deep reinforcement learning technique that parameterizes the policy and value function based on the reward system. A3C has one coordinator and various agents. These agents learn from different environments, update the local gradients to a coordinator, and produce a more stable system. The proposed technique has been validated with Open Directory Project (ODP). The experimental outcome shows that the proposed technique outperforms most of the prevailing techniques based on various metrics such as average precision-recall, average harvest rate, and coverage ratio.

Download Full-text

A FRAME WORK FOR WEB INFORMATION EXTRACTION AND ANALYSIS

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v7i2.3459 ◽

2013 ◽

Vol 7 (2) ◽

pp. 574-579 ◽

Cited By ~ 3

Author(s):

Dr Sunitha Abburu ◽

G. Suresh Babu

Keyword(s):

Information Extraction ◽

Data Extraction ◽

Research Work ◽

Web Pages ◽

Web Documents ◽

E Learning ◽

Structured Information ◽

Frame Work ◽

Effective Decision ◽

The Web

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.Â But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies Â data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.Â Â It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.Â The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.

Download Full-text

Principled approach to the selection of the embedding dimension of networks

Nature Communications ◽

10.1038/s41467-021-23795-5 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Weiwei Gu ◽

Aditya Tandon ◽

Yong-Yeol Ahn ◽

Filippo Radicchi

Keyword(s):

Real World ◽

Structural Information ◽

General Purpose ◽

Embedding Dimension ◽

Network Embedding ◽

Machine Learning Technique ◽

Learning Technique ◽

Low Dimensional ◽

Large Corpus ◽

Selection Of

AbstractNetwork embedding is a general-purpose machine learning technique that encodes network structure in vector spaces with tunable dimension. Choosing an appropriate embedding dimension – small enough to be efficient and large enough to be effective – is challenging but necessary to generate embeddings applicable to a multitude of tasks. Existing strategies for the selection of the embedding dimension rely on performance maximization in downstream tasks. Here, we propose a principled method such that all structural information of a network is parsimoniously encoded. The method is validated on various embedding algorithms and a large corpus of real-world networks. The embedding dimension selected by our method in real-world networks suggest that efficient encoding in low-dimensional spaces is usually possible.

Download Full-text

The NPS Phenomenon and the Deep Web: Trends Analyses and Internet Snapshots

Global Journal of Health Science ◽

10.5539/gjhs.v9n11p71 ◽

2017 ◽

Vol 9 (11) ◽

pp. 71

Author(s):

Ahmed Al-Imam ◽

Ban A. AbdulMajeed

Keyword(s):

Middle Eastern ◽

Reliable Data ◽

Deep Web ◽

Google Trends ◽

Knowledge Discovery In Databases ◽

Novel Psychoactive Substances ◽

Psychoactive Substances ◽

Cross Sectional ◽

Surface Web ◽

Mining Tools

BACKGROUND: In relation to the phenomenon of novel psychoactive substances, activities on the surface web represent only the tip of the iceberg. The majority of the electronic commerce (e-commerce) activities exist on the deep web and the darknet. Observational analytic studies are failing to keep pace with these activities; these studies are either obsolete beyond the point in time of the taken internet snapshot or highly-consuming for resources including time, funding, and manpower.MATERIALS & METHODS: Cross-sectional and retrospective analyses via multiple Internet snapshots were carried out across Google Trends database and the e-markets on the darknet. Google Trends were scanned retrospectively (2012-2016) for keywords specific to the deep web in an aim to estimate and geo-map of the attentiveness (interest) of surface web users in the deep web and its illicit activities.RESULTS: The attentiveness of surface web users in the deep web was noticed to be incremented during 2013 and 2014; the top ten contributing countries were Norway, Germany, Denmark, Austria, Poland, Sweden, Slovenia, Switzerland, Finland, and Netherlands. Middle Eastern countries contributed minimally including; Syria, Iran, Israel, UAE, Morocco, Egypt, and Saudi Arabia. Power scoring of e-markets revealed that the top five markets were; AlphaBay, Agora, Nucleus, Abraxas, and Hansa. The most common categories of NPS on these markets were; cannabis and cannabimimetic (1st), stimulants (2nd), empathogens (3rd), and psychedelics (4th).CONCLUSION: The e-commerce activities on the deep web and the darknet e-marketplace represent an integral component of the NPS e-phenomenon. Unfortunately, recent attempts to examine and study those unlawful activities are outdated. Hence, to achieve real-time and reliable data, the inclusion of data mining tools and knowledge discovery in databases are critical to ensuring a future victory.

Download Full-text

Ekstraksi Informasi Halaman Web Menggunakan Pendekatan Bootstrapping pada Ontology-Based Information Extraction

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.7540 ◽

2015 ◽

Vol 9 (2) ◽

pp. 111 ◽

Cited By ~ 1

Author(s):

Erma Susanti ◽

Khabib Mustofa

Keyword(s):

Information Extraction ◽

Language Processing ◽

Semantic Content ◽

Extraction Process ◽

Web Pages ◽

Structured Information ◽

Improved Performance ◽

Types Of Information ◽

Unstructured Information

AbstrakEkstraksi informasi merupakan suatu bidang ilmu untuk pengolahan bahasa alami, dengan cara mengubah teks tidak terstruktur menjadi informasi dalam bentuk terstruktur. Berbagai jenis informasi di Internet ditransmisikan secara tidak terstruktur melalui website, menyebabkan munculnya kebutuhan akan suatu teknologi untuk menganalisa teks dan menemukan pengetahuan yang relevan dalam bentuk informasi terstruktur. Contoh informasi tidak terstruktur adalah informasi utama yang ada pada konten halaman web. Bermacam pendekatan untuk ekstraksi informasi telah dikembangkan oleh berbagai peneliti, baik menggunakan metode manual atau otomatis, namun masih perlu ditingkatkan kinerjanya terkait akurasi dan kecepatan ekstraksi. Pada penelitian ini diusulkan suatu penerapan pendekatan ekstraksi informasi dengan mengkombinasikan pendekatan bootstrapping dengan Ontology-based Information Extraction (OBIE). Pendekatan bootstrapping dengan menggunakan sedikit contoh data berlabel, digunakan untuk memimalkan keterlibatan manusia dalam proses ekstraksi informasi, sedangkan penggunakan panduan ontologi untuk mengekstraksi classes (kelas), properties dan instance digunakan untuk menyediakan konten semantik untuk web semantik. Pengkombinasian kedua pendekatan tersebut diharapkan dapat meningkatan kecepatan proses ekstraksi dan akurasi hasil ekstraksi. Studi kasus untuk penerapan sistem ekstraksi informasi menggunakan dataset “LonelyPlanet”. Kata kunci—Ekstraksi informasi, ontologi, bootstrapping, Ontology-Based Information Extraction, OBIE, kinerja Abstract Information extraction is a field study of natural language processing by converting unstructured text into structured information. Several types of information on the Internet is transmitted through unstructured information via websites, led to emergence of the need a technology to analyze text and found relevant knowledge into structured information. For example of unstructured information is existing main information on the content of web pages. Various approaches for information extraction have been developed by many researchers, either using manual or automatic method, but still need to be improved performance related accuracy and speed of extraction. This research proposed an approach of information extraction that combines bootstrapping approach with Ontology-Based Information Extraction (OBIE). Bootstrapping approach using small seed of labelled data, is used to minimize human intervention on information extraction process, while the use of guide ontology for extracting classes, properties and instances, using for provide semantic content for semantic web. Combining both approaches expected to increase speed of extraction process and accuracy of extraction results. Case study to apply information extraction system using “LonelyPlanet” datasets. Keywords— Information extraction, ontology, bootstrapping, Ontology-Based Information Extraction, OBIE, performance

Download Full-text

Cross-Fertilizing Deep Web Analysis and Ontology Enrichment

10.31219/osf.io/b3fvz ◽

2017 ◽

Author(s):

Marilena Oita ◽

Antoine Amarilli ◽

Pierre Senellart

Keyword(s):

Domain Knowledge ◽

Deep Web ◽

Web Pages ◽

Complete Understanding ◽

Specific Knowledge ◽

Domain Specific ◽

Domain Specific Knowledge ◽

Web Crawlers ◽

New Perspective ◽

The Impact

Deep Web databases, whose content is presented as dynamically-generated Web pages hidden behind forms, have mostly been left unindexed by search engine crawlers. In order to automatically explore this mass of information, many current techniques assume the existence of domain knowledge, which is costly to create and maintain. In this article, we present a new perspective on form understanding and deep Web data acquisition that does not require any domain-specific knowledge. Unlike previous approaches, we do not perform the various steps in the process (e.g., form understanding, record identification, attribute labeling) independently but integrate them to achieve a more complete understanding of deep Web sources. Through information extraction techniques and using the form itself for validation, we reconcile input and output schemas in a labeled graph which is further aligned with a generic ontology. The impact of this alignment is threefold: first, the resulting semantic infrastructure associated with the form can assist Web crawlers when probing the form for content indexing; second, attributes of response pages are labeled by matching known ontology instances, and relations between attributes are uncovered; and third, we enrich the generic ontology with facts from the deep Web.

Download Full-text

Policy Direct Search for Effective Reinforcement Learning

10.26686/wgtn.17138678 ◽

2021 ◽

Author(s):

◽

Yiming Peng

Keyword(s):

Value Function ◽

Feature Learning ◽

Direct Search ◽

The State ◽

Cutting Edge ◽

Approximation Technique ◽

Control Problems ◽

Function Learning ◽

Policy Gradient ◽

Primal Dual

<p>Reinforcement Learning (RL) problems appear in diverse real-world applications and are gaining substantial attention in academia and industry. Policy Direct Search (PDS) is widely recognized as an effective approach to RL problems. However, existing PDS algorithms have some major limitations. First, many step-wise Policy Gradient Search (PGS) algorithms cannot effectively utilize informative historical gradients to accurately estimate policy gradients. Second, although evolutionary PDS algorithms do not rely on accurate policy gradient estimations and can explore learning environments effectively, they are not sample efficient at learning policies in the form of deep neural networks. Third, existing PGS algorithms often diverge easily due to the lack of reliable and flexible techniques for value function learning. Fourth, existing PGS algorithms have not provided suitable mechanisms to learn proper state features automatically. To address these limitations, the overall goal of this thesis is to develop effective policy direct search algorithms for tackling challenging RL problems through technical innovations in four key areas. First, the thesis aims to improve the accuracy of policy gradient estimation by utilizing historical gradients through a Primal-Dual Approximation technique. Second, the thesis targets on surpassing the state-of-the-art performance by properly balancing the exploration-exploitation trade-off via Covariance Matrix Adaption Evolutionary Strategy (CMA-ES) and Proximal Policy Optimization (PPO). Third, the thesis seeks to stabilize value function learning via a self-organized Sandpile Model (SM) meanwhile generalize the compatible condition to support flexible value function learning. Fourth, the thesis endeavors to develop innovative evolutionary feature learning techniques that are capable of automatically extracting useful state features so as to enhance various cutting-edge PGS algorithms. In the thesis, we explore the four key technical areas by studying policies with increasing complexity. First of all, we start the research from a simple linear policy representation, and then proceed to a complex neural network based policy representation. Next, we consider a more complicated situation where policy learning is coupled with a value function learning. Subsequently, we consider policies modeled as a concatenation of two interrelated networks, one for feature learning and one for action selection. To achieve the first goal, this thesis proposes a new policy gradient learning framework where a series of historical gradients are jointly exploited to obtain accurate policy gradient estimations via the Primal-Dual Approximation technique. Under the framework, three new PGS algorithms for step-wise policy training have been derived from three widely used PGS algorithms; meanwhile, the convergence properties of these new algorithms have been theoretically analyzed. The empirical results on several benchmark control problems further show that the newly proposed algorithms can significantly outperform their base algorithms. To achieve the second goal, this thesis develops a new sample efficient evolutionary deep policy optimization algorithm based on CMA-ES and PPO. The algorithm has a layer-wise learning mechanism to improve computational efficiency in comparison to CMA-ES. Additionally, it uses a performance lower bound based surrogate model for fitness evaluation to significantly reduce the sample cost to the state-of-the-art level. More importantly, the best policy found by CMA-ES at every generation is further improved by PPO to properly balance exploration and exploitation. The experimental results confirm that the proposed algorithm outperforms various cutting-edge algorithms on many benchmark continuous control problems. To achieve the third goal, this thesis develops new value function learning methods that are both reliable and flexible so as to further enhance the effectiveness of policy gradient search. Two Actor-Critic (AC) algorithms have been successfully developed from a commonly-used PGS algorithm, i.e., Regular Actor-Critic (RAC). The first algorithm adopts SM to stabilize value function learning, and the second algorithm generalizes the logarithm function used by the compatible condition to provide a flexible family of new compatible functions. The experimental results show that, with the help of reliable and flexible value function learning, the newly developed algorithms are more effective than RAC on several benchmark control problems. To achieve the fourth goal, this thesis develops innovative NeuroEvolution algorithms for automated feature learning to enhance various cutting-edge PGS algorithms. The newly developed algorithms not only can extract useful state features but also learn good policies. The experimental analysis demonstrates that the newly proposed algorithms can achieve better performance on large-scale RL problems in comparison to both well-known PGS algorithms and NeuroEvolution techniques. Our experiments also confirm that the state features learned by NeuroEvolution on one RL task can be easily transferred to boost learning performance on similar but different tasks.</p>

Download Full-text

Development of parser agents for bibliographic data retrieval

Bhartiya Krishi Anusandhan Patrika ◽

10.18805/bkap141 ◽

2018 ◽

Vol 33 (4) ◽

Author(s):

Murari Kumar ◽

Samir Farooqi ◽

K. K. Chaturvedi ◽

Chandan Kumar Deb ◽

Pankaj Das

Keyword(s):

Retrieval System ◽

Data Retrieval ◽

Web Pages ◽

Bibliographic Database ◽

Bibliographic Information ◽

Open Access Journals ◽

Bibliographic Records ◽

Bibliographic Data ◽

Structured Information ◽

The Web

Bibliographic data contains necessary information about literature to help users to recognize and retrieve that resource. These data are used quantitatively by a “Bibliometrician” for analysis and dissemination purpose but with the increasing rate of literature publication in open access journals such as Nucleic Acids Research (NAR), Springer, Oxford Journals etc., it has become difficult to retrieve structured bibliographic information in desired format. A digital bibliographic database contains necessary and structured information about published literature. Bibliographic records of different articles are scattered and resides on different web pages. This thesis presents the retrieval system for bibliographic data of NAR at a single place. For this purpose, parser agents have been developed which access the web pages of NAR and parse the scattered bibliographic data and finally store it into a local bibliographic database. Based on the bibliographic database, “three-tier architecture” has been utilized to display the bibliographic information in systematized format. Using this system, it would be possible to build the network between different authors and affiliations and also other analytical reports can be generated.

Download Full-text

Towards a Spatial Instance Learning Method for Deep Web Pages

Advances in Data Mining. Applications and Theoretical Aspects - Lecture Notes in Computer Science ◽

10.1007/978-3-642-23184-1_21 ◽

2011 ◽

pp. 270-285 ◽

Cited By ~ 1

Author(s):

Ermelinda Oro ◽

Massimo Ruffolo

Keyword(s):

Deep Web ◽

Web Pages ◽

Learning Method

Download Full-text

Search Engine

The Dark Web ◽

10.4018/978-1-5225-3163-0.ch016 ◽

2018 ◽

pp. 359-374

Author(s):

Dilip Kumar Sharma ◽

A. K. Sharma

Keyword(s):

Computer Networks ◽

Search Engines ◽

Web Search ◽

Relevant Information ◽

Vital Role ◽

Deep Web ◽

Telecommunication Networks ◽

Web Pages ◽

Web Crawler ◽

Main Components

ICT plays a vital role in human development through information extraction and includes computer networks and telecommunication networks. One of the important modules of ICT is computer networks, which are the backbone of the World Wide Web (WWW). Search engines are computer programs that browse and extract information from the WWW in a systematic and automatic manner. This paper examines the three main components of search engines: Extractor, a web crawler which starts with a URL; Analyzer, an indexer that processes words on the web page and stores the resulting index in a database; and Interface Generator, a query handler that understands the need and preferences of the user. This paper concentrates on the information available on the surface web through general web pages and the hidden information behind the query interface, called deep web. This paper emphasizes the Extraction of relevant information to generate the preferred content for the user as the first result of his or her search query. This paper discusses the aspect of deep web with analysis of a few existing deep web search engines.

Download Full-text

Deep Web Information Retrieval Process

The Dark Web ◽

10.4018/978-1-5225-3163-0.ch007 ◽

2018 ◽

pp. 114-137

Author(s):

Dilip Kumar Sharma ◽

A. K. Sharma

Keyword(s):

Information Retrieval ◽

General Framework ◽

Deep Web ◽

Web Searching ◽

Web Content ◽

Web Information Retrieval ◽

Technical Literature ◽

Web Information ◽

Web Crawlers ◽

Surface Web

Web crawlers specialize in downloading web content and analyzing and indexing from surface web, consisting of interlinked HTML pages. Web crawlers have limitations if the data is behind the query interface. Response depends on the querying party's context in order to engage in dialogue and negotiate for the information. In this paper, the authors discuss deep web searching techniques. A survey of technical literature on deep web searching contributes to the development of a general framework. Existing frameworks and mechanisms of present web crawlers are taxonomically classified into four steps and analyzed to find limitations in searching the deep web.

Download Full-text