UCrawler: A learning-based web crawler using a URL knowledge base

Author(s):  
Wei Wang ◽  
Lihua Yu

Focused crawlers, as fundamental components of vertical search engines, focus on crawling the web pages related to a specific topic. Existing focused crawlers commonly suffer from the problems of low efficiency of crawling pages and subject migration. In this paper, we propose a learning-based focused crawler using a URL knowledge base. To improve the accuracy of similarity, the similarity of the topic is measured with the parent page content, anchor information, and URL content. The URL content is also learned and updated iteratively and continuously. Within the crawler, we implement a crawling mechanism based on a combination of content analysis and simple link analysis crawler strategy, which decreases computational complexity and avoids the locality problem of crawling. Experimental results show that our proposed algorithm achieves a better precision than traditional methods including the shark-search and best-first search algorithms, and avoids the local optimum problem of crawling.

Author(s):  
B Sathiya ◽  
T.V. Geetha

The prime textual sources used for ontology learning are a domain corpus and dynamic large text from web pages. The first source is limited and possibly outdated, while the second is uncertain. To overcome these shortcomings, a novel ontology learning methodology is proposed to utilize the different sources of text such as a corpus, web pages and the massive probabilistic knowledge base, Probase, for an effective automated construction of ontology. Specifically, to discover taxonomical relations among the concept of the ontology, a new web page based two-level semantic query formation methodology using the lexical syntactic patterns (LSP) and a novel scoring measure: Fitness built on Probase are proposed. Also, a syntactic and statistical measure called COS (Co-occurrence Strength) scoring, and Domain and Range-NTRD (Non-Taxonomical Relation Discovery) algorithms are proposed to accurately identify non-taxonomical relations(NTR) among concepts, using evidence from the corpus and web pages.


The Dark Web ◽  
2018 ◽  
pp. 359-374
Author(s):  
Dilip Kumar Sharma ◽  
A. K. Sharma

ICT plays a vital role in human development through information extraction and includes computer networks and telecommunication networks. One of the important modules of ICT is computer networks, which are the backbone of the World Wide Web (WWW). Search engines are computer programs that browse and extract information from the WWW in a systematic and automatic manner. This paper examines the three main components of search engines: Extractor, a web crawler which starts with a URL; Analyzer, an indexer that processes words on the web page and stores the resulting index in a database; and Interface Generator, a query handler that understands the need and preferences of the user. This paper concentrates on the information available on the surface web through general web pages and the hidden information behind the query interface, called deep web. This paper emphasizes the Extraction of relevant information to generate the preferred content for the user as the first result of his or her search query. This paper discusses the aspect of deep web with analysis of a few existing deep web search engines.


Author(s):  
Dilip Kumar Sharma ◽  
A. K. Sharma

ICT plays a vital role in human development through information extraction and includes computer networks and telecommunication networks. One of the important modules of ICT is computer networks, which are the backbone of the World Wide Web (WWW). Search engines are computer programs that browse and extract information from the WWW in a systematic and automatic manner. This paper examines the three main components of search engines: Extractor, a web crawler which starts with a URL; Analyzer, an indexer that processes words on the web page and stores the resulting index in a database; and Interface Generator, a query handler that understands the need and preferences of the user. This paper concentrates on the information available on the surface web through general web pages and the hidden information behind the query interface, called deep web. This paper emphasizes the Extraction of relevant information to generate the preferred content for the user as the first result of his or her search query. This paper discusses the aspect of deep web with analysis of a few existing deep web search engines.


2012 ◽  
Vol 433-440 ◽  
pp. 5214-5217
Author(s):  
Hai Huang

Short-term traffic flow forecasting has a high requirement for the responding time and accuracy of the forecasting method because the result is directly used for instant traffic inducing. Based on the introduction of the fuzzy neural network model for short-term traffic flow forecasting together with its detailed procedures, this paper adopt the particle swarm optimization algorithm to train the fuzzy neural network. Its global searching and optimization algorithm helps to overcome the shortcomings of the traditional fuzzy neural network, such as its low efficiency and “local optimum”. A case study is also given for the PSO algorithm to train the fuzzy neural network for traffic flow forecasting. The result shows that the average square error is 0.932 when the PSO algorithm is put to use for the network training, which is 3.926 when the PSO is not used. Thus result is more accurate and it requires less time for the training procedures. It proves this method is feasible and efficient.


2016 ◽  
Vol 22 (4) ◽  
pp. 529-539 ◽  
Author(s):  
Limao ZHANG ◽  
Xianguo WU ◽  
Lieyun DING ◽  
Miroslaw J. SKIBNIEWSKI ◽  
Yujie LU

This paper presents an innovative approach of integrating Building Information Modeling (BIM) and expert systems to address deficiencies in traditional safety risk identification process in tunnel construction. A BIM-based Risk Identification Expert System (B-RIES) composed of three main built-in subsystems: BIM extraction, knowledge base management, and risk identification subsystems, is proposed. The engineering parameter information related to risk fac­tors is first extracted from BIM of a specific project where the Industry Foundation Classes (IFC) standard plays a bridge role between the BIM data and tunnel construction safety risks. An integrated knowledge base, consisting of fact base, rule base and case base, is then established to systematize the fragmented explicit and tacit knowledge. Finally, a hybrid inference approach, with case-based reasoning and rule-based reasoning combined, is developed to improve the flexibil­ity and comprehensiveness of the system reasoning capacity. B-RIES is used to overcome low-efficiency in traditional information extraction, reduce the dependence on domain experts, and facilitate knowledge sharing and communication among dispersed clients and domain experts. The identification of a safety hazard regarding the water gushing in one metro station of China is presented in a case study. The results demonstrate the feasibility of B-RIES and its application effectiveness.


2007 ◽  
Vol 16 (05) ◽  
pp. 793-828 ◽  
Author(s):  
JUAN D. VELÁSQUEZ ◽  
VASILE PALADE

Understanding the web user browsing behaviour in order to adapt a web site to the needs of a particular user represents a key issue for many commercial companies that do their business over the Internet. This paper presents the implementation of a Knowledge Base (KB) for building web-based computerized recommender systems. The Knowledge Base consists of a Pattern Repository that contains patterns extracted from web logs and web pages, by applying various web mining tools, and a Rule Repository containing rules that describe the use of discovered patterns for building navigation or web site modification recommendations. The paper also focuses on testing the effectiveness of the proposed online and offline recommendations. An ample real-world experiment is carried out on a web site of a bank.


2021 ◽  
Vol 21 (2) ◽  
pp. 105-120
Author(s):  
K. S. Sakunthala Prabha ◽  
C. Mahesh ◽  
S. P. Raja

Abstract Topic precise crawler is a special purpose web crawler, which downloads appropriate web pages analogous to a particular topic by measuring cosine similarity or semantic similarity score. The cosine based similarity measure displays inaccurate relevance score, if topic term does not directly occur in the web page. The semantic-based similarity measure provides the precise relevance score, even if the synonyms of the given topic occur in the web page. The unavailability of the topic in the ontology produces inaccurate relevance score by the semantic focused crawlers. This paper overcomes these glitches with a hybrid string-matching algorithm by combining the semantic similarity-based measure with the probabilistic similarity-based measure. The experimental results revealed that this algorithm increased the efficiency of the focused web crawlers and achieved better Harvest Rate (HR), Precision (P) and Irrelevance Ratio (IR) than the existing web focused crawlers achieve.


2021 ◽  
Vol 13 (6) ◽  
pp. 1-13
Author(s):  
Guangxuan Chen ◽  
Guangxiao Chen ◽  
Lei Zhang ◽  
Qiang Liu

In order to solve the problems of repeated acquisition, data redundancy and low efficiency in the process of website forensics, this paper proposes an incremental acquisition method orientecd to dynamic websites. This method realized the incremental collection on dynamically updated websites through acquiring and parsing web pages, URL deduplication, web page denoising, web page content extraction and hashing. Experiments show that the algorithm has relative high acquisition precision and recall rate, and can be combined with other data to perform effective digital forensics on dynamically updated real-time websites.


Sign in / Sign up

Export Citation Format

Share Document