Structure- and Content-Based Retrieval for XML Documents

Author(s):  
Jae-Woo Chang

The XML was proposed as a standard markup language to make Web documents in 1996 (Extensible Markup Language, 2000). It has as good an expressive power as SGML and is easy to use like HTML. Recently, it has been common for users to acquire through the Web a variety of multimedia documents written by XML. Meanwhile, because the number of XML documents is dramatically increasing, it is difficult to reach a specific XML document required by users. Moreover, an XML document not only has a logical and hierarchical structure in common, but also contains its multimedia data, such as image and video. Thus, it is necessary to retrieve XML documents based on both document structure and image content. For supporting the structure-based retrieval, it is necessary to design four efficient index structures, that is, keyword, structure, element, and attribute index, by indexing XML documents using a basic element unit. For supporting the content-based retrieval, it is necessary to design a high-dimensional index structure so as to store and retrieve both color and shape feature vectors efficiently.

2011 ◽  
pp. 153-166
Author(s):  
Jae Woo Chang ◽  
Du-Seok Jin

As the number of XML documents is dramatically increasing, it is necessary to develop an XML document retrieval system that can support both structure-based retrieval and content-based retrieval. In order to support the structure-based retrieval, we design four efficient index structures, i.e., keyword, structure, element and attribute index, by indexing XML documents based on a basic element unit. In order to support the content-based retrieval, we design a high-dimensional index structure based on the X-tree so as to store and retrieve both color and shape feature vectors efficiently. Finally, we do the performance evaluation of our XML document retrieval system in terms of system efficiency, such as retrieval time, insertion time, and storage overhead, as well as system effectiveness, such as recall and precision measures.


Author(s):  
Jana Polgar ◽  
Robert Mark Braum ◽  
Tony Polgar

XML stands for Extensible Markup Language (http://www.w3.org/XML/), and it has been adopted by industry for exchanging data in a platform, language, and protocol independent fashion. While XML has many benefits during the development stage, it has some performance disadvantages. This chapter provides a quick look at the following topics: 1. Overview of the standard and basic concepts; 2. Basic XML document structure; 3. Information about usage of Document Type Definition (DTD); 4. Structure and usage of XML Schema; and 5. Discussion about the design and performance issues when using XML documents with Web service.


Author(s):  
Ibrahim Dweib ◽  
Joan Lu

Extensible Markup Language (XML) nowadays is one of the most important standard media used for exchanging and representing data through the Internet. Storing, updating, and retrieving the huge amount of web services data such as XML is an attractive area of research for researchers and database vendors. In this chapter, the authors propose and develop a new mapping model, called MAXDOR, for storing, rebuilding, updating, and querying XML documents using a relational database without making use of any XML schemas in the mapping process. The model addressed the problem of solving the structural hole between ordered hierarchical XML and unordered tabular relational database to enable us to use relational database systems for storing, updating, and querying XML data. A multiple link list is used to maintain XML document structure, manage the process of updating document contents, and retrieve document contents efficiently. Experiments are done to evaluate MAXDOR model. MAXDOR will be compared with other well-known models available in the literature (Tatarinov et al., 2002) and (Torsten et al., 2004) using total expected value of rebuilding XML document execution time and insertion of token execution time.


2011 ◽  
pp. 286-291
Author(s):  
Kalpdrum Passi ◽  
Louise Lane ◽  
Sanjay Madria ◽  
Mukesh Mohania

XML (eXtensible Markup Language) is used to describe semi-structured data, i.e., irregular or incomplete data whose structure may be subject to unpredictable changes. Unlike traditional semi-structured data, XML documents are self-describing, thus XML provides a platform-independent means to describe data and, therefore, can transport data from one platform to another (Bray, Paoli, & Sperberg-McQueen, 1998). XML documents can be both created and used by applications. The valid content, allowed structure, and metadata properties of XML documents are described by their related schema(s) (Thompson, Beech, Maloney, & Mendelsohn, 2001). An XML document is said to be valid if it conforms to its related schema. A schema also gives additional semantic meaning to the data it is used to tag. The schema is provided independently of the data it describes. Any given data set may rely on multiple schemas for validation. Any given schema may itself refer to multiple schemas.


Author(s):  
Kalpdrum Passi ◽  
Louise Lane ◽  
Sanjay Madria ◽  
Mukesh Mohania

XML (eXtensible Markup Language) is used to describe semi-structured data, i.e., irregular or incomplete data whose structure may be subject to unpredictable changes. Unlike traditional semi-structured data, XML documents are self-describing, thus XML provides a platform-independent means to describe data and, therefore, can transport data from one platform to another (Bray, Paoli, & Sperberg-McQueen, 1998). XML documents can be both created and used by applications. The valid content, allowed structure, and metadata properties of XML documents are described by their related schema(s) (Thompson, Beech, Maloney, & Mendelsohn, 2001). An XML document is said to be valid if it conforms to its related schema. A schema also gives additional semantic meaning to the data it is used to tag. The schema is provided independently of the data it describes. Any given data set may rely on multiple schemas for validation. Any given schema may itself refer to multiple schemas.


Author(s):  
Mohammed Ragheb Hakawati ◽  
Yasmin Yacob ◽  
Amiza Amir ◽  
Jabiry M. Mohammed ◽  
Khalid Jamal Jadaa

Extensible Markup Language (XML) is emerging as the primary standard for representing and exchanging data, with more than 60% of the total; XML considered the most dominant document type over the web; nevertheless, their quality is not as expected. XML integrity constraint especially XFD plays an important role in keeping the XML dataset as consistent as possible, but their ability to solve data quality issues is still intangible. The main reason is that old-fashioned data dependencies were basically introduced to maintain the consistency of the schema rather than that of the data. The purpose of this study is to introduce a method for discovering pattern tableaus for XML conditional dependencies to be used for enhancing XML document consistency as a part of data quality improvement phases. The notations of the conditional dependencies as new rules are designed mainly for improving data instance and extended traditional XML dependencies by enforcing pattern tableaus of semantically related constants. Subsequent to this, a set of minimal approximate conditional dependencies (XCFD, XCIND) is discovered and learned from the XML tree using a set of mining algorithms. The discovered patterns can be used as a Master data in order to detect inconsistencies that don’t respect the majority of the dataset.


Author(s):  
Joseph Fong ◽  
Herbert Shiu

Extensible Markup Language (XML) has become a standard for persistent storage and data interchange via the Internet due to its openness, self-descriptiveness and flexibility. This chapter proposes a systematic approach to reverse engineer arbitrary XML documents to their conceptual schema – Extended DTD Graphs ? which is a DTD Graph with data semantics. The proposed approach not only determines the structure of the XML document, but also derives candidate data semantics from the XML element instances by treating each XML element instance as a record in a table of a relational database. One application of the determined data semantics is to verify the linkages among elements. Implicit and explicit referential linkages are among XML elements modeled by the parent-children structure and ID/IDREF(S) respectively. As a result, an arbitrary XML document can be reverse engineered into its conceptual schema in an Extended DTD Graph format.


2012 ◽  
Vol 10 (3) ◽  
pp. 13-26
Author(s):  
Xiaomin Zhu ◽  
Zhongxiang He ◽  
Shengbo Shi

Extensible Markup Language (XML) is a textual markup language which becomes more and more important in the Internet web service. However, some distinct disadvantages exist in XML, such as its nature of redundancy, which consumes the limited network’s bandwidth greatly especially in mobile computing. Considering the characteristics of the mobile commerce, the handsets’ memory capability and data processing time are two problems for XML being applied. This paper studies an enhancement of XML for the purpose of application in mobile e-commerce, called SXML, which means Simple XML to enhance the XML used in mobile web service. It helps XML producers minimizing the size effects of XML, e.g., the size overhead and slow implementation speed. Comprehensive simulations show that the SXML could reduce the size of XML documents and reduce the time of implementation, consequently utilize the bandwidth effectively.


2011 ◽  
pp. 879-899
Author(s):  
Laura Irina Rusu ◽  
Wenny Rahayu ◽  
David Taniar

This chapter presents some of the existing mining techniques for extracting association rules out of XML documents in the context of rapid changes in the Web knowledge discovery area. The initiative of this study was driven by the fast emergence of XML (eXtensible Markup Language) as a standard language for representing semistructured data and as a new standard of exchanging information between different applications. The data exchanged as XML documents become richer and richer every day, so the necessity to not only store these large volumes of XML data for later use, but to mine them as well to discover interesting information has became obvious. The hidden knowledge can be used in various ways, for example, to decide on a business issue or to make predictions about future e-customer behaviour in a Web application. One type of knowledge that can be discovered in a collection of XML documents relates to association rules between parts of the document, and this chapter presents some of the top techniques for extracting them.


2008 ◽  
Vol 8 (3) ◽  
pp. 323-361 ◽  
Author(s):  
J. M. ALMENDROS-JIMÉNEZ ◽  
A. BECERRA-TERÓN ◽  
F. J. ENCISO-BAÑOS

AbstractExtensible Markup Language (XML) is a simple, very flexible text format derived from SGML. Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere. XPath language is the result of an effort to provide address parts of an XML document. In support of this primary purpose, it becomes in a query language against an XML document. In this paper we present a proposal for the implementation of the XPath language in logic programming. With this aim we will describe the representation of XML documents by means of a logic program. Rules and facts can be used for representing the document schema and the XML document itself. In particular, we will present how to index XML documents in logic programs: rules are supposed to be stored in main memory, however facts are stored in secondary memory by using two kind of indexes: one for each XML tag, and other for each group of terminal items. In addition, we will study how to query by means of the XPath language against a logic program representing an XML document. It evolves the specialization of the logic program with regard to the XPath expression. Finally, we will also explain how to combine the indexing and the top-down evaluation of the logic program.


Sign in / Sign up

Export Citation Format

Share Document