Research and Implementation of LED Optical Design Focused Web Crawler

The structures and contents of researching search engines are presented and the core technology is the analysis technology of web pages. The characteristic of analyzing web pages in one website is studied, relations between the web pages web crawler gained at two times are able to be obtained and the changed information among them are found easily. A new method of analyzing web pages in one website is introduced and the method analyzes web pages with the changed information of web pages. The result of applying the method shows that the new method is effective in the analysis of web pages.

Download Full-text

Web Crawler and Web Crawler Algorithms: A Perspective

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e9362.069520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 203-205

Keyword(s):

Search Engine ◽

Search Engines ◽

The Internet ◽

Web Pages ◽

Web Crawler ◽

Day By Day ◽

The Web

A web crawler is also called spider. For the intention of web indexing it automatically searches on the WWW. As the W3 is increasing day by day, globally the number of web pages grown massively. To make the search sociable for users, searching engine are mandatory. So to discover the particular data from the WWW search engines are operated. It would be almost challenging for mankind devoid of search engines to find anything from the web unless and until he identifies a particular URL address. A central depository of HTML documents in indexed form is sustained by every search Engine. Every time an operator gives the inquiry, searching is done at the database of indexed web pages. The size of a database of every search engine depends on the existing page on the internet. So to increase the proficiency of search engines, it is permitted to store only the most relevant and significant pages in the database.

Download Full-text

Web Crawler: Design And Implementation For Extracting Article-Like Contents

Cybernetics and Physics ◽

10.35470/2226-4116-2020-9-3-144-151 ◽

2020 ◽

pp. 144-151

Author(s):

Ngo Le Huy Hien ◽

Thai Quang Tien ◽

Nguyen Van Hieu

Keyword(s):

Search Engine ◽

Search Engines ◽

Visual Cues ◽

Future Research ◽

Web Pages ◽

Web Crawler ◽

Accessible Information ◽

Machine Learning Approach ◽

Engine Systems ◽

The Web

The World Wide Web is a large, wealthy, and accessible information system whose users are increasing rapidly nowadays. To retrieve information from the web as per users’ requests, search engines are built to access web pages. As search engine systems play a significant role in cybernetics, telecommunication, and physics, many efforts were made to enhance their capacity.However, most of the data contained on the web are unmanaged, making it impossible to access the entire network at once by current search engine system mechanisms. Web Crawler, therefore, is a critical part of search engines to navigate and download full texts of the web pages. Web crawlers may also be applied to detect missing links and for community detection in complex networks and cybernetic systems. However, template-based crawling techniques could not handle the layout diversity of objects from web pages. In this paper, a web crawler module was designed and implemented, attempted to extract article-like contents from 495 websites. It uses a machine learning approach with visual cues, trivial HTML, and text-based features to filter out clutters. The outcomes are promising for extracting article-like contents from websites, contributing to the search engine systems development and future research gears towards proposing higher performance systems.

Download Full-text

Search Engine

The Dark Web ◽

10.4018/978-1-5225-3163-0.ch016 ◽

2018 ◽

pp. 359-374

Author(s):

Dilip Kumar Sharma ◽

A. K. Sharma

Keyword(s):

Computer Networks ◽

Search Engines ◽

Web Search ◽

Relevant Information ◽

Vital Role ◽

Deep Web ◽

Telecommunication Networks ◽

Web Pages ◽

Web Crawler ◽

Main Components

ICT plays a vital role in human development through information extraction and includes computer networks and telecommunication networks. One of the important modules of ICT is computer networks, which are the backbone of the World Wide Web (WWW). Search engines are computer programs that browse and extract information from the WWW in a systematic and automatic manner. This paper examines the three main components of search engines: Extractor, a web crawler which starts with a URL; Analyzer, an indexer that processes words on the web page and stores the resulting index in a database; and Interface Generator, a query handler that understands the need and preferences of the user. This paper concentrates on the information available on the surface web through general web pages and the hidden information behind the query interface, called deep web. This paper emphasizes the Extraction of relevant information to generate the preferred content for the user as the first result of his or her search query. This paper discusses the aspect of deep web with analysis of a few existing deep web search engines.

Download Full-text

Search Engine

International Journal of Information Communication Technologies and Human Development ◽

10.4018/ijicthd.2011040103 ◽

2011 ◽

Vol 3 (2) ◽

pp. 38-51 ◽

Cited By ~ 6

Author(s):

Dilip Kumar Sharma ◽

A. K. Sharma

Keyword(s):

Computer Networks ◽

Search Engines ◽

Web Search ◽

Relevant Information ◽

Vital Role ◽

Deep Web ◽

Telecommunication Networks ◽

Web Pages ◽

Web Crawler ◽

Main Components

ICT plays a vital role in human development through information extraction and includes computer networks and telecommunication networks. One of the important modules of ICT is computer networks, which are the backbone of the World Wide Web (WWW). Search engines are computer programs that browse and extract information from the WWW in a systematic and automatic manner. This paper examines the three main components of search engines: Extractor, a web crawler which starts with a URL; Analyzer, an indexer that processes words on the web page and stores the resulting index in a database; and Interface Generator, a query handler that understands the need and preferences of the user. This paper concentrates on the information available on the surface web through general web pages and the hidden information behind the query interface, called deep web. This paper emphasizes the Extraction of relevant information to generate the preferred content for the user as the first result of his or her search query. This paper discusses the aspect of deep web with analysis of a few existing deep web search engines.

Download Full-text

Extracting Top-k Company Acquisition Relations From the Web

International Journal on Semantic Web and Information Systems ◽

10.4018/ijswis.2017100102 ◽

2017 ◽

Vol 13 (4) ◽

pp. 27-41 ◽

Cited By ~ 1

Author(s):

Jie Zhao ◽

Jianfei Wang ◽

Jia Yang ◽

Peiquan Jin

Keyword(s):

Rapid Development ◽

Relation Extraction ◽

Experimental Results ◽

Competitive Intelligence ◽

Web Pages ◽

Web Content ◽

Web Page ◽

Competitive Strategies ◽

The Web ◽

Novel Algorithm

Company acquisition relation reflects a company's development intent and competitive strategies, which is an important type of enterprise competitive intelligence. In the traditional environment, the acquisition of competitive intelligence mainly relies on newspapers, internal reports, and so on, but the rapid development of the Web introduces a new way to extract company acquisition relation. In this paper, the authors study the problem of extracting company acquisition relation from huge amounts of Web pages, and propose a novel algorithm for company acquisition relation extraction. The authors' algorithm considers the tense feature of Web content and classification technology of semantic strength when extracting company acquisition relation from Web pages. It first determines the tense of each sentence in a Web page, which is then applied in sentences classification so as to evaluate the semantic strength of the candidate sentences in describing company acquisition relation. After that, the authors rank the candidate acquisition relations and return the top-k company acquisition relation. They run experiments on 6144 pages crawled through Google, and measure the performance of their algorithm under different metrics. The experimental results show that the algorithm is effective in determining the tense of sentences as well as the company acquisition relation.

Download Full-text

Distributed and collaborative Web Change Detection system

Computer Science and Information Systems ◽

10.2298/csis131120081p ◽

2015 ◽

Vol 12 (1) ◽

pp. 91-114 ◽

Cited By ~ 7

Author(s):

Víctor Prieto ◽

Manuel Álvarez ◽

Víctor Carneiro ◽

Fidel Cacheda

Keyword(s):

Change Detection ◽

Search Engines ◽

Web Site ◽

Detection System ◽

Computational Cost ◽

Web Pages ◽

Web Page ◽

Case Scenario ◽

Worst Case ◽

The Web

Search engines use crawlers to traverse the Web in order to download web pages and build their indexes. Maintaining these indexes up-to-date is an essential task to ensure the quality of search results. However, changes in web pages are unpredictable. Identifying the moment when a web page changes as soon as possible and with minimal computational cost is a major challenge. In this article we present the Web Change Detection system that, in a best case scenario, is capable to detect, almost in real time, when a web page changes. In a worst case scenario, it will require, on average, 12 minutes to detect a change on a low PageRank web site and about one minute on a web site with high PageRank. Meanwhile, current search engines require more than a day, on average, to detect a modification in a web page (in both cases).

Download Full-text

The Canon of Dutch Literature According to Google

10.31235/osf.io/ewy27 ◽

2019 ◽

Author(s):

Lucas van der Deijl ◽

Antal van den Bosch ◽

Roel Smeets

Keyword(s):

Knowledge Base ◽

Search Engines ◽

Literary History ◽

Information Sources ◽

Search Algorithms ◽

Web Pages ◽

Printed Media ◽

Literary Histories ◽

Dutch Literature ◽

The Web

Literary history is no longer written in books alone. As literary reception thrives in blogs, Wikipedia entries, Amazon reviews, and Goodreads pro les, the Web has become a key platform for the exchange of information on literature. Al- though conventional printed media in the eld—academic monographs, literary supplements, and magazines—may still claim the highest authority, online me- dia presumably provide the rst (and possibly the only) source for many readers casually interested in literary history. Wikipedia o ers quick and free answers to readers’ questions and the range of topics described in its entries dramatically exceeds the volume any printed encyclopedia could possibly cover. While an important share of this expanding knowledge base about literature is produced bottom-up (user based and crowd-sourced), search engines such as Google have become brokers in this online economy of knowledge, organizing information on the Web for its users. Similar to the printed literary histories, search engines prioritize certain information sources over others when ranking and sorting Web pages; as such, their search algorithms create hierarchies of books, authors, and periods.

Download Full-text

WEB GRAPH BASED SEARCH BY USING DENSITY OF KEYWORD AND AGE FACTOR

International Journal of Computer Science and Informatics ◽

10.47893/ijcsi.2013.1124 ◽

2013 ◽

pp. 89-93

Author(s):

GAURAV AGARWAL ◽

SACHI GUPTA ◽

SAURABH MUKHERJEE

Keyword(s):

Search Engine ◽

Web Search ◽

Web Pages ◽

Main Role ◽

Ranking Algorithm ◽

Web Page ◽

Web Crawler ◽

User Requirement ◽

Priority Assignment ◽

The Web

Today, web servers, are the key repositories of the information & internet is the source of getting this information. There is a mammoth data on the Internet. It becomes a difficult job to search out the accordant data. Search Engine plays a vital role in searching the accordant data. A search engine follows these steps: Web crawling by crawler, Indexing by Indexer and Searching by Searcher. Web crawler retrieves information of the web pages by following every link on the site. Which is stored by web search engine then the content of the web page is indexed by the indexer. The main role of indexer is how data can be catch soon as per user requirements. As the client gives a query, Search Engine searches the results corresponding to this query to provide excellent output. Here ambition is to enroot an algorithm for search engine which may response most desirable result as per user requirement. In this a ranking method is used by the search engine to rank the web pages. Various ranking approaches are discussed in literature but in this paper, ranking algorithm is proposed which is based on parent-child relationship. Proposed ranking algorithm is based on priority assignment phase of Heterogeneous Earliest Finish Time (HEFT) Algorithm which is designed for multiprocessor task scheduling. Proposed algorithm works on three on range variable its means the density of keywords, number of successors to the nodes and the age of the web page. Density shows the occurrence of the keyword on the particular web page. Numbers of successors represent the outgoing link to a single web page. Age is the freshness value of the web page. The page which is modified recently is the freshest page and having the smallest age or largest freshness value. Proposed Technique requires that the priorities of each page to be set with the downward rank values & pages are arranged in ascending/ Descending order of their rank values. Experiments show that our algorithm is valuable. After the comparison with Google we find that our Algorithm is performing better. For 70% problems our algorithm is working better than Google.

Download Full-text

An XML based Web Crawler with Page Revisit Policy and Updation in Local Repository of Search Engine

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.12924 ◽

2018 ◽

Vol 7 (3) ◽

pp. 1119

Author(s):

Jyoti Mor ◽

Dr Dinesh Rai ◽

Dr Naresh Kumar

Keyword(s):

Search Engines ◽

Quality Data ◽

Web Pages ◽

Large Collection ◽

Web Crawler ◽

Network Resources ◽

High Quality Data ◽

Remote Server ◽

Web Crawlers ◽

Shared Network

In a large collection of web pages, it is difficult for search engines to keep their online repository updated. Major search engines have hundreds of web crawlers that crawl the WWW day and night and send the downloaded web pages via a network to be stored in the search engine’s database. These results in over utilization of network resources like bandwidth, CPU cycles and so on. This paper proposes an architecture that tries to reduce the utilization of shared network resources with the help of an advanced XML based approach. This focused crawling based architecture is trained to download only the high quality data from the internet leaving behind the web pages which are not relevant to the desired domain. Here, a detailed layout of the proposed system is described which is capable of reducing the load on network and reducing the problem arise in residency of mobile agent at the remote server.

Download Full-text