Clustering of Web Documents with Structure of Webpages based on the HTML Document Object Model

As the adoption of XML reaches more and more application domains, data sizes increase, and efficient XML handling gets more and more important. Many applications face scalability problems due to the overhead of XML parsing, the difficulty of effectively finding particular XML nodes, or the sheer size of XML documents, which nowadays can easily exceed gigabytes of data. In particular the latter issue can make certain tasks seemingly impossible to handle, as many applications depend on parsing XML documents completely into a Document Object Model (DOM) memory structure. Parsing XML into a DOM typically requires close to or even more memory as the serialized XML would consume, thus making it prohibitively expensive to handle XML documents in the gigabyte range. Recent research and development suggests that it is possible to modify these applications to run a wide range of tasks in a streaming fashion, thus limiting the memory consumption of individual applications. However this requires not only changes in the underlying tools, but often also in user code, such as XSLT style sheets. These required changes can often be unintuitive and complicate user code. A different approach is to run applications against an efficient, persistent, hard-disk backed DOM implementation that does not require entire documents to be in memory at a time. This talk will discuss such a DOM implementation, EMC's xDB, showing how to use binary XML and efficient backend structures to provide a standards compliant, non-memory-backed, transactional DOM implementation, with little overhead compared to regular memory-based DOMs. It will also give performance comparisons and show how to run existing applications transparently against xDB's DOM implementation, using XSLT stylesheets as an example.

Download Full-text

Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction

Proceedings of the tenth international conference on World Wide Web - WWW '01 ◽

10.1145/371920.372054 ◽

2001 ◽

Cited By ~ 65

Author(s):

Soumen Chakrabarti

Keyword(s):

Information Extraction ◽

Object Model ◽

Document Object Model ◽

Topic Distillation

Download Full-text

My document object model can do more than yours

Proceedings of Balisage: The Markup Conference 2013 ◽

10.4242/balisagevol10.couthures01 ◽

2013 ◽

Cited By ~ 1

Author(s):

Alain Couthures

Keyword(s):

Object Model ◽

Data Types ◽

Document Object Model ◽

Xml Documents ◽

Object Models ◽

Xml Technologies ◽

Sequence Types

Document object models, specifically the browser DOM, were designed to represent HTML and XML documents. Languages such as XPath were designed to access and traverse the DOM of HTML and XML documents. But suppose we wanted to bring the power and convenience of XML technologies like XPath to new data types. Could we extend the DOM to support CSV files? JSON? ZIP files? Yes we can! This paper explores a number of ways in which the DOM can be made to do more. We can loosen restrictions, describe new sequence types, and even define new XPath axes to make the DOM better and more useful.

Download Full-text