IHWC: intelligent hidden web crawler for harvesting data in urban domains

Complex & Intelligent Systems ◽

10.1007/s40747-021-00471-1 ◽

2021 ◽

Author(s):

Sawroop Kaur ◽

Aman Singh ◽

G. Geetha ◽

Xiaochun Cheng

Keyword(s):

Large Scale ◽

Smart Cities ◽

Quality Data ◽

Web Pages ◽

Web Crawler ◽

Harvest Rate ◽

Web Interfaces ◽

Hidden Web ◽

Special Cases ◽

Daunting Task

AbstractDue to the massive size of the hidden web, searching, retrieving and mining rich and high-quality data can be a daunting task. Moreover, with the presence of forms, data cannot be accessed easily. Forms are dynamic, heterogeneous and spread over trillions of web pages. Significant efforts have addressed the problem of tapping into the hidden web to integrate and mine rich data. Effective techniques, as well as application in special cases, are required to be explored to achieve an effective harvest rate. One such special area is atmospheric science, where hidden web crawling is least implemented, and crawler is required to crawl through the huge web to narrow down the search to specific data. In this study, an intelligent hidden web crawler for harvesting data in urban domains (IHWC) is implemented to address the relative problems such as classification of domains, prevention of exhaustive searching, and prioritizing the URLs. The crawler also performs well in curating pollution-related data. The crawler targets the relevant web pages and discards the irrelevant by implementing rejection rules. To achieve more accurate results for a focused crawl, ICHW crawls the websites on priority for a given topic. The crawler has fulfilled the dual objective of developing an effective hidden web crawler that can focus on diverse domains and to check its integration in searching pollution data in smart cities. One of the objectives of smart cities is to reduce pollution. Resultant crawled data can be used for finding the reason for pollution. The crawler can help the user to search the level of pollution in a specific area. The harvest rate of the crawler is compared with pioneer existing work. With an increase in the size of a dataset, the presented crawler can add significant value to emission accuracy. Our results are demonstrating the accuracy and harvest rate of the proposed framework, and it efficiently collect hidden web interfaces from large-scale sites and achieve higher rates than other crawlers.

Download Full-text

An XML based Web Crawler with Page Revisit Policy and Updation in Local Repository of Search Engine

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.12924 ◽

2018 ◽

Vol 7 (3) ◽

pp. 1119

Author(s):

Jyoti Mor ◽

Dr Dinesh Rai ◽

Dr Naresh Kumar

Keyword(s):

Search Engines ◽

Quality Data ◽

Web Pages ◽

Large Collection ◽

Web Crawler ◽

Network Resources ◽

High Quality Data ◽

Remote Server ◽

Web Crawlers ◽

Shared Network

In a large collection of web pages, it is difficult for search engines to keep their online repository updated. Major search engines have hundreds of web crawlers that crawl the WWW day and night and send the downloaded web pages via a network to be stored in the search engine’s database. These results in over utilization of network resources like bandwidth, CPU cycles and so on. This paper proposes an architecture that tries to reduce the utilization of shared network resources with the help of an advanced XML based approach. This focused crawling based architecture is trained to download only the high quality data from the internet leaving behind the web pages which are not relevant to the desired domain. Here, a detailed layout of the proposed system is described which is capable of reducing the load on network and reducing the problem arise in residency of mobile agent at the remote server.

Download Full-text

Design of a Parallel and Scalable Crawler for the Hidden Web

International Journal of Information Retrieval Research ◽

10.4018/ijirr.289612 ◽

2022 ◽

Vol 12 (1) ◽

pp. 0-0

Keyword(s):

Experimental Results ◽

Web Pages ◽

Web Crawler ◽

Huge Amount ◽

Web Databases ◽

Specific Approach ◽

Amount Of Information ◽

Domain Specific ◽

Hidden Web ◽

Enormous Amount

The WWW contains huge amount of information from different areas. This information may be present virtually in the form of web pages, media, articles (research journals / magazine), blogs etc. A major portion of the information is present in web databases that can be retrieved by raising queries at the interface offered by the specific database and is thus called the Hidden Web. An important issue is to efficiently retrieve and provide access to this enormous amount of information through crawling. In this paper, we present the architecture of a parallel crawler for the Hidden Web that avoids download overlaps by following a domain-specific approach. The experimental results further show that the proposed parallel Hidden web crawler (PSHWC), not only effectively but also efficiently extracts and download the contents in the Hidden web databases

Download Full-text

Wireless Sensor Networks for Smart Cities: Network Design, Implementation and Performance Evaluation

Electronics ◽

10.3390/electronics10020218 ◽

2021 ◽

Vol 10 (2) ◽

pp. 218

Author(s):

Ala’ Khalifeh ◽

Khalid A. Darabkh ◽

Ahmad M. Khasawneh ◽

Issa Alqaisieh ◽

Mohammad Salameh ◽

...

Keyword(s):

Wireless Sensor Networks ◽

Sensor Networks ◽

Experimental Evaluation ◽

Large Scale ◽

Performance Metrics ◽

Smart Cities ◽

Field Tests ◽

Sensor Nodes ◽

Wireless Sensor ◽

Real Field

The advent of various wireless technologies has paved the way for the realization of new infrastructures and applications for smart cities. Wireless Sensor Networks (WSNs) are one of the most important among these technologies. WSNs are widely used in various applications in our daily lives. Due to their cost effectiveness and rapid deployment, WSNs can be used for securing smart cities by providing remote monitoring and sensing for many critical scenarios including hostile environments, battlefields, or areas subject to natural disasters such as earthquakes, volcano eruptions, and floods or to large-scale accidents such as nuclear plants explosions or chemical plumes. The purpose of this paper is to propose a new framework where WSNs are adopted for remote sensing and monitoring in smart city applications. We propose using Unmanned Aerial Vehicles to act as a data mule to offload the sensor nodes and transfer the monitoring data securely to the remote control center for further analysis and decision making. Furthermore, the paper provides insight about implementation challenges in the realization of the proposed framework. In addition, the paper provides an experimental evaluation of the proposed design in outdoor environments, in the presence of different types of obstacles, common to typical outdoor fields. The experimental evaluation revealed several inconsistencies between the performance metrics advertised in the hardware-specific data-sheets. In particular, we found mismatches between the advertised coverage distance and signal strength with our experimental measurements. Therefore, it is crucial that network designers and developers conduct field tests and device performance assessment before designing and implementing the WSN for application in a real field setting.

Download Full-text

A Framework for the Design and Deployment of Large-Scale LPWAN Network for Smart Cities Applications

IEEE Internet of Things Magazine ◽

10.1109/iotm.001.2000179 ◽

2020 ◽

pp. 1-8

Author(s):

Bassel Al Homssi ◽

Akram Al-Hourani ◽

Kagiso Magowe ◽

James Delaney ◽

Neil Tom ◽

...

Keyword(s):

Large Scale ◽

Smart Cities

Download Full-text

Improving Specimen Labelling and Data Collection in Bio-science Research using Mobile and Web Applications

Open Computer Science ◽

10.1515/comp-2020-0002 ◽

2020 ◽

Vol 10 (1) ◽

pp. 1-16

Author(s):

Isaac Nyabisa Oteyo ◽

Mary Esther Muyoka Toili

Keyword(s):

Data Collection ◽

Web Applications ◽

Data Entry ◽

Science Research ◽

Quality Data ◽

Quality Of Data ◽

Data Entry Form ◽

Research Activities ◽

Daunting Task

AbstractResearchers in bio-sciences are increasingly harnessing technology to improve processes that were traditionally pegged on pen-and-paper and highly manual. The pen-and-paper approach is used mainly to record and capture data from experiment sites. This method is typically slow and prone to errors. Also, bio-science research activities are often undertaken in remote and distributed locations. Timeliness and quality of data collected are essential. The manual method is slow to collect quality data and relay it in a timely manner. Capturing data manually and relaying it in real time is a daunting task. The data collected has to be associated to respective specimens (objects or plants). In this paper, we seek to improve specimen labelling and data collection guided by the following questions; (1) How can data collection in bio-science research be improved? (2) How can specimen labelling be improved in bio-science research activities? We present WebLog, an application that we prototyped to aid researchers generate specimen labels and collect data from experiment sites. We use the application to convert the object (specimen) identifiers into quick response (QR) codes and use them to label the specimens. Once a specimen label is successfully scanned, the application automatically invokes the data entry form. The collected data is immediately sent to the server in electronic form for analysis.

Download Full-text

Vehicular Crowdsourcing for Congestion Support in Smart Cities

Smart Cities ◽

10.3390/smartcities4020034 ◽

2021 ◽

Vol 4 (2) ◽

pp. 662-685

Author(s):

Stephan Olariu

Keyword(s):

Smart City ◽

Large Scale ◽

Smart Cities ◽

Traffic Signal Control ◽

Signal Timing ◽

Adaptive Traffic Signal Control ◽

Traffic Conditions ◽

Direct Benefits ◽

Traffic Signal Control Systems ◽

Computational Resources

Under present-day practices, the vehicles on our roadways and city streets are mere spectators that witness traffic-related events without being able to participate in the mitigation of their effect. This paper lays the theoretical foundations of a framework for harnessing the on-board computational resources in vehicles stuck in urban congestion in order to assist transportation agencies with preventing or dissipating congestion through large-scale signal re-timing. Our framework is called VACCS: Vehicular Crowdsourcing for Congestion Support in Smart Cities. What makes this framework unique is that we suggest that in such situations the vehicles have the potential to cooperate with various transportation authorities to solve problems that otherwise would either take an inordinate amount of time to solve or cannot be solved for lack for adequate municipal resources. VACCS offers direct benefits to both the driving public and the Smart City. By developing timing plans that respond to current traffic conditions, overall traffic flow will improve, carbon emissions will be reduced, and economic impacts of congestion on citizens and businesses will be lessened. It is expected that drivers will be willing to donate under-utilized on-board computing resources in their vehicles to develop improved signal timing plans in return for the direct benefits of time savings and reduced fuel consumption costs. VACCS allows the Smart City to dynamically respond to traffic conditions while simultaneously reducing investments in the computational resources that would be required for traditional adaptive traffic signal control systems.

Download Full-text

A Parallel Unmixing-Based Content Retrieval System for Distributed Hyperspectral Imagery Repository on Cloud Computing Platforms

Remote Sensing ◽

10.3390/rs13020176 ◽

2021 ◽

Vol 13 (2) ◽

pp. 176

Author(s):

Peng Zheng ◽

Zebin Wu ◽

Jin Sun ◽

Yi Zhang ◽

Yaoqin Zhu ◽

...

Keyword(s):

Cloud Computing ◽

Large Scale ◽

Retrieval System ◽

Hyperspectral Image ◽

Parallel Implementation ◽

Remotely Sensed Data ◽

Web Interfaces ◽

Content Retrieval ◽

Service Mode ◽

Computing Platforms

As the volume of remotely sensed data grows significantly, content-based image retrieval (CBIR) becomes increasingly important, especially for cloud computing platforms that facilitate processing and storing big data in a parallel and distributed way. This paper proposes a novel parallel CBIR system for hyperspectral image (HSI) repository on cloud computing platforms under the guide of unmixed spectral information, i.e., endmembers and their associated fractional abundances, to retrieve hyperspectral scenes. However, existing unmixing methods would suffer extremely high computational burden when extracting meta-data from large-scale HSI data. To address this limitation, we implement a distributed and parallel unmixing method that operates on cloud computing platforms in parallel for accelerating the unmixing processing flow. In addition, we implement a global standard distributed HSI repository equipped with a large spectral library in a software-as-a-service mode, providing users with HSI storage, management, and retrieval services through web interfaces. Furthermore, the parallel implementation of unmixing processing is incorporated into the CBIR system to establish the parallel unmixing-based content retrieval system. The performance of our proposed parallel CBIR system was verified in terms of both unmixing efficiency and accuracy.

Download Full-text

Harvest rate of sooty shearwaters (Puffinus griseus) by Rakiura Māori: a potential tool to monitor population trends?

Wildlife Research ◽

10.1071/wr02034 ◽

2004 ◽

Vol 31 (3) ◽

pp. 319 ◽

Cited By ~ 13

Author(s):

Jane Catherine Kitson

Keyword(s):

New Zealand ◽

Cross Section ◽

Large Scale ◽

Population Change ◽

Population Trends ◽

Harvest Rate ◽

General Index ◽

Careful Screening ◽

Potential Tool

Sooty shearwaters (tītī, muttonbird, Puffinus griseus) are highly abundant migratory seabirds, which return to breeding colonies in New Zealand. The Rakiura Māori annual chick harvest on islands adjacent to Rakiura (Stewart Island), is one of the last large-scale customary uses of native wildlife in New Zealand. This study aimed to establish whether the rate at which muttonbirders can extract chicks from their breeding burrows indicates population trends of sooty shearwaters. Harvest rates increased slightly with increasing chick densities on Putauhinu Island. Birders' harvest rates vary in their sensitivities to changing chick density. Therefore a monitoring panel requires careful screening to ensure that harvest rates of the birders selected are sensitive to chick density, and represents a cross-section of different islands. Though harvest rates can provide only a general index of population change, it can provide an inexpensive and feasible way to measure population trends. Detecting trends is the first step to assessing the long-term sustainability of the harvest.

Download Full-text

The information complexity of learning tasks, their structure and their distance

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa033 ◽

2021 ◽

Author(s):

Alessandro Achille ◽

Giovanni Paolini ◽

Glen Mbeng ◽

Stefano Soatto

Keyword(s):

Kolmogorov Complexity ◽

Large Scale ◽

Parametric Model ◽

Training Dataset ◽

Optimization Scheme ◽

Learning Tasks ◽

Asymmetric Distance ◽

Special Cases ◽

Scale Models ◽

Real World Datasets

Abstract We introduce an asymmetric distance in the space of learning tasks and a framework to compute their complexity. These concepts are foundational for the practice of transfer learning, whereby a parametric model is pre-trained for a task, and then fine tuned for another. The framework we develop is non-asymptotic, captures the finite nature of the training dataset and allows distinguishing learning from memorization. It encompasses, as special cases, classical notions from Kolmogorov complexity and Shannon and Fisher information. However, unlike some of those frameworks, it can be applied to large-scale models and real-world datasets. Our framework is the first to measure complexity in a way that accounts for the effect of the optimization scheme, which is critical in deep learning.

Download Full-text

Smart crawler for hidden web interfaces

2016 Online International Conference on Green Engineering and Technologies (IC-GET) ◽

10.1109/get.2016.7916710 ◽

2016 ◽

Cited By ~ 1

Author(s):

Sunita Sundarde ◽

P. R. Rathod

Keyword(s):

Web Interfaces ◽

Hidden Web

Download Full-text