xpath expression
Recently Published Documents


TOTAL DOCUMENTS

5
(FIVE YEARS 1)

H-INDEX

0
(FIVE YEARS 0)

Author(s):  
Harshala Bhoir ◽  
K. Jayamalini

Now a days Internet is widely used by users to find required information. Searching on web for useful information has become more difficult. Web crawler helps to extract the relevant and irrelevant links from the web. Web crawler downloads web pages through the program. This paper implements web crawler with Scrapy and Beautiful Soup python web crawler framework to crawls news on news web sites.Scrapy is a web crawling framework that allow programmer to create spider that define how a certain site or a group of sites will be scraped. It has built-in support for extracting data from HTML sources using XPath expression and CSS expression. BeautifulSoup is a framework that extract data from web pages. Beautiful Soup provides a few simple methods for navigating, searching and modifying a parse tree. BeautifulSoup automatically convert incoming document to Unicode and outgoing document to UTF-8.Proposed system use BeautifulSoup and scrapy framework to crawls news web sites. This paper also compares scrapy and beautiful Soup4 web crawler frameworks.


2011 ◽  
Vol 211-212 ◽  
pp. 726-730
Author(s):  
Hai Wei Zhang ◽  
Xiang Yu Hu ◽  
Ying Zhang ◽  
Yan Long Wen ◽  
Xiao Jie Yuan

As XPath is the core of most XML query languages, the efficiency of processing XPath expression has been a main part of the cost in XML queries. However, most existing XPath processing algorithms, which don't take index structure into account, spend lots of costs on spaces and time. This paper proposes an efficient XPath query processing mechanism based on structural index that makes full use of XML structural index to quickly retrieve XML data. Coming together with the mechanism, the Compressed XPath Query Tree based on structural index is proposed, which significantly reduces many join operations. Then query algorithms are used to deal with all the structural relationship using our mechanism. Finally, experiments will run the algorithm on real XML datasets and query workloads to report the performance of our mechanism and show the efficiency compared with other mechanisms.


2010 ◽  
Vol 20-23 ◽  
pp. 178-183
Author(s):  
Jun Hua Gu ◽  
Jie Song ◽  
Na Zhang ◽  
Yan Liu Liu

With the increasingly high-speed of the internet as well as the increase in the amount of data it contains, users are finding it more and more difficult to gain useful information from the web. How to extract accurate information from the Web efficiently has become an urgent problem. Web information extraction technology has emerged to solve this kind of problem. The method of Web information auto-extraction based on XML is designed through standardizing the HTML document using data translation algorism, forming an extracting rule base by learning the XPath expression of samples, and using extraction rule base to realize auto-extraction of pages of same kind. The results show that this approach should lead to a higher recall ratio and precision ratio, and the result should have a self-description, making it convenient for founding data extraction system of each domain.


Author(s):  
Yangjun Chen

XML employs a tree-structured model for representing data. Queries in XML query languages, for example, XPath (World Wide Web Consortium, 1999), XQuery (World Wide Web Consortium, 2001), XML-QL (Deutch, Fernandex, Florescu, Levy, & Suciu, 1999), and Quilt (Chamberlin, Clark, Florescu, & Stefanescu 1999; Chamberlin, Robie, & Florescu, 2000), typically specify patterns of selection predicates on multiple elements that have some specified tree structured relationships. For instance, the following XPath expression: a[b[c and //d]]/b[c and e//d] asks for any node of type b that is a child of some node of type a. In addition, the b-node is the parent of some c-node and some e-node, as well as an ancestor of some d-node. In general, such an expression can be represented by a tree structure as shown in Figure 1(a). In such a tree pattern, the nodes are types from S ? {*} (* is a wildcard, matching any node type), and edges are parent-child or ancestor-descendant relationships. Among all the nodes of a query Q, one is designated as the output node, denoted by output(Q), corresponding to the output of the query.


Sign in / Sign up

Export Citation Format

Share Document