Web Page Data Collection Based on Multithread

2013 ◽  
Vol 347-350 ◽  
pp. 2575-2579
Author(s):  
Wen Tao Liu

The web data collection is the process of collecting the semi-structured, large-scale and redundant data which include web content, web structure and web usage in the web by the crawler and it is often used for the information extraction, information retrieval, search engine and web data mining. In this paper, the web data collection principle is introduced and some related topics are discussed such as page download, coding problem, updated strategy, static and dynamic page. The multithread technology is described and multithread mode for the web data collection is proposed. The web data collection with multithread can get better resource utilization, better average response time and better performance.

Author(s):  
JYOTSNA BAGRET ◽  
PRASANNA MUNDADA ◽  
SABAH TAZEEN ◽  
TANUJA MULLA

This paper describes how the web content visualization can be greatly improved using the modeling technique. Web content visualization is the outcome of effort made to avail an improved 3D visualization unlike the 2D web content visualization at present. Web page navigation in this case will be depicted by a 2D graph and the web content will be visualized in the form of 3D graph. Also the RSS feeds will be visualized in the form of 3D graph. In normal browser we type name of the URL in the address bar and that URL is downloaded. But the 3D browser takes any URL as an input and generates a 3D graph of the whole website. When we type the URL, a root node of this URL is created. And then this URL goes to the Parser. The parser, parse this web page and gives output in the form of the set of the hyperlinks. Corresponding to each link we create a node and it is attached to the root node. In this way the whole 3D graph of the website is generated. Different color schemes are used for the nodes of different links e.g. text links, image links, video links etc. Advanced search facility is also provided. Moreover as the graph is 3D in nature, the user can rotate the graph as per his requirement.


Author(s):  
G. Sreedhar ◽  
A. Anandaraja Chari

Web Data Mining is the application of data mining techniques to extract useful knowledge from web data like contents of web, hyperlinks of documents and web usage logs. There is also a strong requirement of techniques to help in business decision in e-commerce. Web Data Mining can be broadly divided into three categories: Web content mining, Web structure mining and Web usage mining. Web content data are content availed to users to satisfy their required information. Web structure data represents linkage and relationship of web pages to others. Web usage data involves log data collected by web server and application server which is the main source of data. The growth of WWW and technologies has made business functions to be executed fast and easier. As large amount of transactions are performed through e-commerce sites and the huge amount of data is stored, valuable knowledge can be obtained by applying the Web Mining techniques.


Author(s):  
Jie Zhao ◽  
Jianfei Wang ◽  
Jia Yang ◽  
Peiquan Jin

Company acquisition relation reflects a company's development intent and competitive strategies, which is an important type of enterprise competitive intelligence. In the traditional environment, the acquisition of competitive intelligence mainly relies on newspapers, internal reports, and so on, but the rapid development of the Web introduces a new way to extract company acquisition relation. In this paper, the authors study the problem of extracting company acquisition relation from huge amounts of Web pages, and propose a novel algorithm for company acquisition relation extraction. The authors' algorithm considers the tense feature of Web content and classification technology of semantic strength when extracting company acquisition relation from Web pages. It first determines the tense of each sentence in a Web page, which is then applied in sentences classification so as to evaluate the semantic strength of the candidate sentences in describing company acquisition relation. After that, the authors rank the candidate acquisition relations and return the top-k company acquisition relation. They run experiments on 6144 pages crawled through Google, and measure the performance of their algorithm under different metrics. The experimental results show that the algorithm is effective in determining the tense of sentences as well as the company acquisition relation.


Author(s):  
Shailesh Shivakumar ◽  
Venkata Suresh Pachigolla

Segregating the web page content into logical chunks is one of the popular techniques for modular organization of web page. While chunk-based approach works well for public web scenarios, in case of mobile-first personalization cases, chunking strategy would not be as effective for performance optimization due to dynamic nature of the Web content and due to the nature of content granularity. In this paper, the authors propose a novel framework Micro chunk based Web Delivery Framework which proposes and uses a novel concept of “micro chunk”. The micro chunk based Web Delivery framework aims to address the performance challenges posed by regular chunk in a personalized web scenario. The authors will look at the methods for creating micro chunk and they will discuss the advantages of micro chunk when compared to a regular chunk for a personalized mobile web scenario. They have created a prototype application implementing the Micro chunk based Web Delivery Framework and benchmarked it against a regular personalized web application to quantify the performance improvements achieved by micro chunk design.


Author(s):  
Ily Amalina Ahmad Sabri ◽  
Mustafa Man

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>


Author(s):  
Osvaldo Adilson De Carvalho Junior ◽  
Sarita Mazzini Bruschi ◽  
Regina Helena Carlucci Santana ◽  
Marcos José Santana

The aim of this paper is to propose and evaluate GreenMACC (Green Metascheduler Architecture to Provide QoS in Cloud Computing), an extension of the MACC architecture (Metascheduler Architecture to provide QoS in Cloud Computing) which uses greenIT techniques to provide Quality of Service. The paper provides an evaluation of the performance of the policies in the four stages of scheduling focused on energy consumption and average response time. The results presented confirm the consistency of the proposal as it controls energy consumption and the quality of services requested by different users of a large-scale private cloud.


2011 ◽  
pp. 2206-2249
Author(s):  
Aidan Hogan ◽  
Andreas Harth ◽  
Axel Polleres

In this article the authors discuss the challenges of performing reasoning on large scale RDF datasets from the Web. Using ter-Horst’s pD* fragment of OWL as a base, the authors compose a rulebased framework for application to web data: they argue their decisions using observations of undesirable examples taken directly from the Web. The authors further temper their OWL fragment through consideration of “authoritative stheirces” which counter-acts an observed behavitheir which we term “ontology hijacking”: new ontologies published on the Web re-defining the semantics of existing entities resident in other ontologies. They then present their system for performing rule-based forward-chaining reasoning which they call SAOR: Scalable Authoritative OWL Reasoner. Based upon observed characteristics of web data and reasoning in general, they design their system to scale: the system is based upon a separation of terminological data from assertional data and comprises of a lightweight in-memory index, on-disk sorts and file-scans. The authors evaluate their methods on a dataset in the order of a hundred million statements collected from real-world Web stheirces and present scale-up experiments on a dataset in the order of a billion statements collected from the Web.


Author(s):  
Raghvendra Kumar ◽  
Priyanka Pandey ◽  
Prasant Kumar Pattnaik

The Web can be defined as a depot of varied range of information present in the form of millions of websites dispersed around us. Often users find it difficult to locate the appropriate information fulfilling their needs with the abundant number of websites in the Web. Hence multiple research work has been conducted in the field of Web Mining so as to present any information matching the user's needs. The application of data mining techniques on web usage, web content or web structure data to find out useful data like users' way in patterns and website utility statistics on a whole can be defined as Web mining. The main cause behind development of such websites was to personalize the substance of a website on user's preference. New methods are developed to deal with a Web site using a link hierarchy and a conceptual link hierarchy respectively on the basis of how users have used the Web site link structure.


Author(s):  
Kai-Hsiang Yang

This chapter will address the issues of Uniform Resource Locator (URL) correction techniques in proxy servers. The proxy servers are more and more important in the World Wide Web (WWW), and they provide Web page caches for browsing the Web pages quickly, and also reduce unnecessary network traffic. Traditional proxy servers use the URL to identify their cache, and it is a cache-miss when the request URL is non-existent in its caches. However, for general users, there must be some regularity and scope in browsing the Web. It would be very convenient for users when they do not need to enter the whole long URL, or if they still could see the Web content even though they forgot some part of the URL, especially for those personal favorite Web sites. We will introduce one URL correction mechanism into the personal proxy server to achieve this goal.


2010 ◽  
pp. 751-758
Author(s):  
P. Markellou

Over the last decade, we have witnessed an explosive growth in the information available on the Web. Today, Web browsers provide easy access to myriad sources of text and multimedia data. Search engines index more than a billion pages and finding the desired information is not an easy task. This profusion of resources has prompted the need for developing automatic mining techniques on Web, thereby giving rise to the term “Web mining” (Pal, Talwar, & Mitra, 2002). Web mining is the application of data mining techniques on the Web for discovering useful patterns and can be divided into three basic categories: Web content mining, Web structure mining, and Web usage mining. Web content mining includes techniques for assisting users in locating Web documents (i.e., pages) that meet certain criteria, while Web structure mining relates to discovering information based on the Web site structure data (the data depicting the Web site map). Web usage mining focuses on analyzing Web access logs and other sources of information regarding user interactions within the Web site in order to capture, understand and model their behavioral patterns and profiles and thereby improve their experience with the Web site. As citizens requirements and needs change continuously, traditional information searching, and fulfillment of various tasks result to the loss of valuable time spent in identifying the responsible actor (public authority) and waiting in queues. At the same time, the percentage of users who acquaint with the Internet has been remarkably increased (Internet World Stats, 2005). These two facts motivate many governmental organizations to proceed with the provision of e-services via their Web sites. The ease and speed with which business transactions can be carried out over the Web has been a key driving force in the rapid growth and popularity of e-government, e-commerce, and e-business applications. In this framework, the Web is emerging as the appropriate environment for business transactions and user-organization interactions. However, since it is a large collection of semi-structured and structured information sources, Web users often suffer from information overload. Personalization is considered as a popular solution in order to alleviate this problem and to customize the Web environment to users (Eirinaki & Vazirgiannis, 2003). Web personalization can be described, as any action that makes the Web experience of a user personalized to his or her needs and wishes. Principal elements of Web personalization include modeling of Web objects (pages) and subjects (users), categorization of objects and subjects, matching between and across objects and/or subjects, and determination of the set of actions to be recommended for personalization. In the remainder of this article, we present the way an e-government application can deploy Web mining techniques in order to support intelligent and personalized interactions with citizens. Specifically, we describe the tasks that typically comprise this process, illustrate the future trends, and discuss the open issues in the field.


Sign in / Sign up

Export Citation Format

Share Document