scholarly journals Standards opportunities around data-bearing Web pages

Author(s):  
David Karger

The evolving Web has seen ever-growing use of structured data, thanks to the way it enhances information authoring, querying, visualization and sharing. To date, however, most structured data authoring and management tools have been oriented towards programmers and Web developers. End users have been left behind, unable to leverage structured data for information management and communication as well as professionals. In this paper, I will argue that many of the benefits of structured data management can be provided to end users as well. I will describe an approach and tools that allow end users to define their own schemas (without knowing what a schema is), manage data and author (not program) interactive Web visualizations of that data using the Web tools with which they are already familiar, such as plain Web pages, blogs, wikis and WYSIWYG document editors. I will describe our experience deploying these tools and some lessons relevant to their future evolution.

Author(s):  
Ily Amalina Ahmad Sabri ◽  
Mustafa Man

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>


Author(s):  
Fagner Christian Paes ◽  
Willian Massami Watanabe

Cross-Browser Incompatibilities (XBIs) represent inconsistencies in Web Application when introduced in different browsers. The growing number of implementation of browsers (Internet Explorer, Microsoft Edge, Mozilla Firefox, Google Chrome) and the constant evolution of the specifications of Web technologies provided differences in the way that the browsers behave and render the web pages. The web applications must behave consistently among browsers. Therefore, the web developers should overcome the differences that happen during the rendering in different environments by detecting and avoiding XBIs during the development process. Many web developers depend on manual inspection of web pages in several environments to detect the XBIs, independently of the cost and time that the manual tests represent to the process of development. The tools for the automatic detection of the XBIs accelerate the inspection process in the web pages, but the current tools have little precision, and their evaluations report a large percentage of false positives. This search aims to evaluate the use of Artificial Neural Networks for reducing the numbers of false positives in the automatic detection of the XBIs through the CSS (Cascading Style Sheets) and the relative comparison of the element in the web page.


Author(s):  
Kimihito Ito ◽  
Yuzuru Tanaka

Web applications, which are computer programs ported to the Web, allow end-users to use various remote services and tools through their Web browsers. There are an enormous number of Web applications on the Web, and they are becoming the basic infrastructure of everyday life. In spite of the remarkable development of Web-based infrastructure, it is still difficult for end-users to compose new integrated tools of both existing Web applications and legacy local applications, such as spreadsheets, chart tools, and database. In this chapter, the authors propose a new framework where end-users can wrap remote Web applications into visual components, called pads, and functionally combine them together through drag-and-drop operations. The authors use, as the basis, a meme media architecture IntelligentPad that was proposed by the second author. In the IntelligentPad architecture, each visual component, called a pad, has slots as data I/O ports. By pasting a pad onto another pad, users can integrate their functionalities. The framework presented in this chapter allows users to visually create a wrapper pad for any Web application by defining HTML nodes within the Web application to work as slots. Examples of such a node include input-forms and text strings on Web pages. Users can directly manipulate both wrapped Web applications and wrapped local legacy tools on their desktop screen to define application linkages among them. Since no programming expertise is required to wrap Web applications or to functionally combine them together, end-users can build new integrated tools of both wrapped Web applications and local legacy applications.


10.29007/fvc9 ◽  
2019 ◽  
Author(s):  
Gautam Kishore Shahi ◽  
Durgesh Nandini ◽  
Sushma Kumari

Schema.org creates, supports and maintain schemas for structured data on the web pages. For a non-technical author, it is difficult to publish contents in a structured format. This work presents an automated way of inducing Schema.org markup from natural language context of web-pages by applying knowledge base creation technique. As a dataset, Web Data Commons was used, and the scope for the experimental part was limited to RDFa. The approach was implemented using the Knowledge Graph building techniques - Knowledge Vault and KnowMore.


Semantic web is not just a matter of translation from HTML to RDF/OWL languages. It is a matter of understanding the content of the web through knowledge graphs. Entities need to be related with relationships. This content is composed of resources (web pages) that contain, for example, text, images and audio. Thus, there is the need of extracting entities from these resources. Currently, most of the web content is in HTML5 format which is a W3C recommendation which enables describing the structure marginally with the help of annotations. The main challenge here is to transform unstructured data from plain HTML files to structured data (e.g RDF or OWL). The current work provides the first hand information for dealing with unstructured heterogeneous data residing on web using Twinkle, a Java tool for SPARQL query execution on FOAF (Friend Of A Friend) document.


2013 ◽  
Vol 756-759 ◽  
pp. 1590-1594
Author(s):  
Gui Li ◽  
Cheng Chen ◽  
Zheng Yu Li ◽  
Zi Yang Han ◽  
Ping Sun

Fully automatic methods that extract structured data from the Web have been studied extensively. The existing methods suffice for simple extraction, but they often fail to handle more complicated Web pages. This paper introduces a method based on tag path clustering to extract structured data. The method gets complete tag path collection by parsing the DOM tree of the Web document. Clustering of tag paths is performed based on introduced similarity measure and the data area can be targeted, then taking advantage of features of tag position, we can separate and filter record, finally complete data extraction. Experiments show this method achieves higher accuracy than previous methods.


2014 ◽  
Vol 11 (1) ◽  
pp. 111-131
Author(s):  
Tomas Grigalis ◽  
Antanas Cenys

Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving >90% accuracy.


Author(s):  
Ji-Rong Wen

The Web is an open and free environment for people to publish and get information. Everyone on the Web can be either an author, a reader, or both. The language of the Web, HTML (Hypertext Markup Language), is mainly designed for information display, not for semantic representation. Therefore, current Web search engines usually treat Web pages as unstructured documents, and traditional information retrieval (IR) technologies are employed for Web page parsing, indexing, and searching. The unstructured essence of Web pages seriously blocks more accurate search and advanced applications on the Web. For example, many sites contain structured information about various products. Extracting and integrating product information from multiple Web sites could lead to powerful search functions, such as comparison shopping and business intelligence. However, these structured data are embedded in Web pages, and there are no proper traditional methods to extract and integrate them. Another example is the link structure of the Web. If used properly, information hidden in the links could be taken advantage of to effectively improve search performance and make Web search go beyond traditional information retrieval (Page, Brin, Motwani, & Winograd, 1998, Kleinberg, 1998). Although XML (Extensible Markup Language) is an effort to structuralize Web data by introducing semantics into tags, it is unlikely that common users are willing to compose Web pages using XML due to its complication and the lack of standard schema definitions. Even if XML is extensively adopted, a huge amount of pages are still written in the HTML format and remain unstructured. Web structure mining is the class of methods to automatically discover structured data and information from the Web. Because the Web is dynamic, massive and heterogeneous, automated Web structure mining calls for novel technologies and tools that may take advantage of state-of-the-art technologies from various areas, including machine learning, data mining, information retrieval, and databases and natural language processing.


Sign in / Sign up

Export Citation Format

Share Document