Standards opportunities around data-bearing Web pages

David Karger

doi:10.1098/rsta.2012.0381

Standards opportunities around data-bearing Web pages

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2012.0381 ◽

2013 ◽

Vol 371 (1987) ◽

pp. 20120381 ◽

Cited By ~ 2

Author(s):

David Karger

Keyword(s):

Information Management ◽

End Users ◽

Structured Data ◽

Web Pages ◽

Management Tools ◽

Future Evolution ◽

Left Behind ◽

Web Tools ◽

Web Developers ◽

The Web

The evolving Web has seen ever-growing use of structured data, thanks to the way it enhances information authoring, querying, visualization and sharing. To date, however, most structured data authoring and management tools have been oriented towards programmers and Web developers. End users have been left behind, unable to leverage structured data for information management and communication as well as professionals. In this paper, I will argue that many of the benefits of structured data management can be provided to end users as well. I will describe an approach and tools that allow end users to define their own schemas (without knowing what a schema is), manage data and author (not program) interactive Web visualizations of that data using the Web tools with which they are already familiar, such as plain Web pages, blogs, wikis and WYSIWYG document editors. I will describe our experience deploying these tools and some lessons relevant to their future evolution.

Download Full-text

Improving Performance of DOM in Semi-structured Data Extraction using WEIDJ Model

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v9.i3.pp752-763 ◽

2018 ◽

Vol 9 (3) ◽

pp. 752 ◽

Cited By ~ 2

Author(s):

Ily Amalina Ahmad Sabri ◽

Mustafa Man

Keyword(s):

Data Extraction ◽

Extraction Process ◽

Structured Data ◽

Web Pages ◽

Web Page ◽

Web Data ◽

Web Documents ◽

Web Extraction ◽

Comparison Time ◽

The Web

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>

Download Full-text

Detecção Automática de Incompatibilidades Cross-Browser utilizando Redes Neurais Artificiais

Journal on Advances in Theoretical and Applied Informatics ◽

10.26729/jadi.v2i2.2109 ◽

2016 ◽

Vol 2 (2) ◽

pp. 55

Author(s):

Fagner Christian Paes ◽

Willian Massami Watanabe

Keyword(s):

Web Application ◽

Web Applications ◽

Automatic Detection ◽

False Positives ◽

Web Pages ◽

Inspection Process ◽

Cascading Style Sheets ◽

Mozilla Firefox ◽

Web Developers ◽

The Web

Cross-Browser Incompatibilities (XBIs) represent inconsistencies in Web Application when introduced in different browsers. The growing number of implementation of browsers (Internet Explorer, Microsoft Edge, Mozilla Firefox, Google Chrome) and the constant evolution of the specifications of Web technologies provided differences in the way that the browsers behave and render the web pages. The web applications must behave consistently among browsers. Therefore, the web developers should overcome the differences that happen during the rendering in different environments by detecting and avoiding XBIs during the development process. Many web developers depend on manual inspection of web pages in several environments to detect the XBIs, independently of the cost and time that the manual tests represent to the process of development. The tools for the automatic detection of the XBIs accelerate the inspection process in the web pages, but the current tools have little precision, and their evaluations report a large percentage of false positives. This search aims to evaluate the use of Artificial Neural Networks for reducing the numbers of false positives in the automatic detection of the XBIs through the CSS (Cascading Style Sheets) and the relative comparison of the element in the web page.

Download Full-text

Visual Environment for DOM-Based Wrapping and Client-Side Linkage of Web Applications

Intellectual Property Protection for Multimedia Information Technology ◽

10.4018/978-1-59904-762-1.ch010 ◽

2011 ◽

pp. 219-240

Author(s):

Kimihito Ito ◽

Yuzuru Tanaka

Keyword(s):

Web Application ◽

Web Applications ◽

End Users ◽

Web Pages ◽

Web Based ◽

Visual Component ◽

Basic Infrastructure ◽

Legacy Applications ◽

Client Side ◽

The Web

Web applications, which are computer programs ported to the Web, allow end-users to use various remote services and tools through their Web browsers. There are an enormous number of Web applications on the Web, and they are becoming the basic infrastructure of everyday life. In spite of the remarkable development of Web-based infrastructure, it is still difficult for end-users to compose new integrated tools of both existing Web applications and legacy local applications, such as spreadsheets, chart tools, and database. In this chapter, the authors propose a new framework where end-users can wrap remote Web applications into visual components, called pads, and functionally combine them together through drag-and-drop operations. The authors use, as the basis, a meme media architecture IntelligentPad that was proposed by the second author. In the IntelligentPad architecture, each visual component, called a pad, has slots as data I/O ports. By pasting a pad onto another pad, users can integrate their functionalities. The framework presented in this chapter allows users to visually create a wrapper pad for any Web application by defining HTML nodes within the Web application to work as slots. Examples of such a node include input-forms and text strings on Web pages. Users can directly manipulate both wrapped Web applications and wrapped local legacy tools on their desktop screen to define application linkages among them. Since no programming expertise is required to wrap Web applications or to functionally combine them together, end-users can build new integrated tools of both wrapped Web applications and local legacy applications.

Download Full-text

XML-Related Data and Information Management Tools for the Web

XML Databases and the Semantic Web ◽

10.1201/9781420000023.ch15 ◽

2002 ◽

pp. 203-212

Keyword(s):

Information Management ◽

Management Tools ◽

Related Data ◽

The Web

Download Full-text

Inducing Schema.org markup from Natural Language Context

10.29007/fvc9 ◽

2019 ◽

Author(s):

Gautam Kishore Shahi ◽

Durgesh Nandini ◽

Sushma Kumari

Keyword(s):

Natural Language ◽

Knowledge Base ◽

Structured Data ◽

Knowledge Graph ◽

Web Pages ◽

Web Data ◽

Experimental Part ◽

Language Context ◽

Data Commons ◽

The Web

Schema.org creates, supports and maintain schemas for structured data on the web pages. For a non-technical author, it is difficult to publish contents in a structured format. This work presents an automated way of inducing Schema.org markup from natural language context of web-pages by applying knowledge base creation technique. As a dataset, Web Data Commons was used, and the scope for the experimental part was limited to RDFa. The approach was implemented using the Knowledge Graph building techniques - Knowledge Vault and KnowMore.

Download Full-text

Data and Information Management Tools for the Web

Web Data Management and Electronic Commerce ◽

10.1201/9781482274448-26 ◽

2000 ◽

pp. 265-274

Keyword(s):

Information Management ◽

Management Tools ◽

The Web

Download Full-text

Role of SPARQL in Leveraging Sematic Web Technology

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.c5161.029320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 26-30

Keyword(s):

Heterogeneous Data ◽

Structured Data ◽

Unstructured Data ◽

Web Pages ◽

Query Execution ◽

Web Content ◽

Main Challenge ◽

Text Images ◽

The Web

Semantic web is not just a matter of translation from HTML to RDF/OWL languages. It is a matter of understanding the content of the web through knowledge graphs. Entities need to be related with relationships. This content is composed of resources (web pages) that contain, for example, text, images and audio. Thus, there is the need of extracting entities from these resources. Currently, most of the web content is in HTML5 format which is a W3C recommendation which enables describing the structure marginally with the help of annotations. The main challenge here is to transform unstructured data from plain HTML files to structured data (e.g RDF or OWL). The current work provides the first hand information for dealing with unstructured heterogeneous data residing on web using Twinkle, a Java tool for SPARQL query execution on FOAF (Friend Of A Friend) document.

Download Full-text

Web Data Extraction Based on Tag Path Clustering

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.1590 ◽

2013 ◽

Vol 756-759 ◽

pp. 1590-1594

Author(s):

Gui Li ◽

Cheng Chen ◽

Zheng Yu Li ◽

Zi Yang Han ◽

Ping Sun

Keyword(s):

Data Extraction ◽

Structured Data ◽

Web Pages ◽

Web Data Extraction ◽

Web Document ◽

Automatic Methods ◽

Simple Extraction ◽

Web Document Clustering ◽

Fully Automatic ◽

The Web

Fully automatic methods that extract structured data from the Web have been studied extensively. The existing methods suffice for simple extraction, but they often fail to handle more complicated Web pages. This paper introduces a method based on tag path clustering to extract structured data. The method gets complete tag path collection by parsing the DOM tree of the Web document. Clustering of tag paths is performed based on introduced similarity measure and the data area can be targeted, then taking advantage of features of tag position, we can separate and filter record, finally complete data extraction. Experiments show this method achieves higher accuracy than previous methods.

Download Full-text

Using XPaths of inbound links to cluster template-generated web pages

Computer Science and Information Systems ◽

10.2298/csis130416020g ◽

2014 ◽

Vol 11 (1) ◽

pp. 111-131

Author(s):

Tomas Grigalis ◽

Antanas Cenys

Keyword(s):

Real World ◽

Data Extraction ◽

Structural Similarity ◽

Structured Data ◽

Single Type ◽

Web Pages ◽

Template Structure ◽

Computationally Expensive ◽

Web Clustering ◽

The Web

Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving >90% accuracy.

Download Full-text

Enhancing Web Search through Web Structure Mining

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch118 ◽

2011 ◽

pp. 764-769

Author(s):

Ji-Rong Wen

Keyword(s):

Information Retrieval ◽

Web Search ◽

Semantic Representation ◽

Structured Data ◽

Web Pages ◽

Markup Language ◽

Web Structure ◽

Web Structure Mining ◽

Extensible Markup ◽

The Web

The Web is an open and free environment for people to publish and get information. Everyone on the Web can be either an author, a reader, or both. The language of the Web, HTML (Hypertext Markup Language), is mainly designed for information display, not for semantic representation. Therefore, current Web search engines usually treat Web pages as unstructured documents, and traditional information retrieval (IR) technologies are employed for Web page parsing, indexing, and searching. The unstructured essence of Web pages seriously blocks more accurate search and advanced applications on the Web. For example, many sites contain structured information about various products. Extracting and integrating product information from multiple Web sites could lead to powerful search functions, such as comparison shopping and business intelligence. However, these structured data are embedded in Web pages, and there are no proper traditional methods to extract and integrate them. Another example is the link structure of the Web. If used properly, information hidden in the links could be taken advantage of to effectively improve search performance and make Web search go beyond traditional information retrieval (Page, Brin, Motwani, & Winograd, 1998, Kleinberg, 1998). Although XML (Extensible Markup Language) is an effort to structuralize Web data by introducing semantics into tags, it is unlikely that common users are willing to compose Web pages using XML due to its complication and the lack of standard schema definitions. Even if XML is extensively adopted, a huge amount of pages are still written in the HTML format and remain unstructured. Web structure mining is the class of methods to automatically discover structured data and information from the Web. Because the Web is dynamic, massive and heterogeneous, automated Web structure mining calls for novel technologies and tools that may take advantage of state-of-the-art technologies from various areas, including machine learning, data mining, information retrieval, and databases and natural language processing.

Download Full-text