Web Retrieval of XML Documents

Author(s):  
Barbara Catania ◽  
Elena Ferrari

Web is characterized by a huge amount of very heterogeneous data sources, that differ both in media support and format representation. In this scenario, there is the need of an integrating approach for querying heterogeneous Web documents. To this purpose, XML can play an important role since it is becoming a standard for data representation and exchange over the Web. Due to its flexibility, XML is currently being used as an interface language over the Web, by which (part of) document sources are represented and exported. Under this assumption, the problem of querying heterogeneous sources can be reduced to the problem of querying XML data sources. In this chapter, we first survey the most relevant query languages for XML data proposed both by the scientific community and by standardization committees, e.g., W3C, mainly focusing on their expressive power. Then, we investigate how typical Information Retrieval concepts, such as ranking, similarity-based search, and profile-based search, can be applied to XML query languages. Commercial products based on the considered approaches are then briefly surveyed. Finally, we conclude the chapter by providing an overview of the most promising research trends in the fields.

2009 ◽  
pp. 2472-2488
Author(s):  
Angelo Brayner ◽  
Marcelo Meirelles ◽  
José de Aguiar Moraes Filho

Integrating data sources published on the Web requires an integration strategy that guarantees the local data sources’ autonomy. A multidatabase system (MDBS) has been consolidated as an approach to integrate multiple heterogeneous and distributed data sources in flexible and dynamic environments such as the Web. A key property of MDBSs is to guarantee a higher degree of local autonomy. In order to adopt the MDBS strategy, it is necessary to use a query language, called the MultiDatabase Language (MDL), which provides the necessary constructs for jointly manipulating and accessing data in heterogeneous data sources. In other words, the MDL is responsible for solving integration conflicts. This chapter describes an extension to the XQuery Language, called MXQuery, which supports queries over several data sources and solves such integration problems as semantic heterogeneity and incomplete information.


2009 ◽  
Vol 35 (5) ◽  
pp. 571-601 ◽  
Author(s):  
Timo Niemi ◽  
Turkka Näppilä ◽  
Kalervo Järvelin

There are numerous approaches for integrating data from heterogeneous data sources. A common background assumption is that the data sources remain quite stable and are known in advance. Hence an integration system can be built to manipulate them. In practice there is, however, often a demand for supporting ad hoc information needs concerning unexpected autonomous data sources containing volatile data. A different approach is therefore needed. We propose that semantically similar data are harmonized when extracting data from XML-based data sources. We introduce a constructor algebra, which is a powerful tool in the harmonization of XML data. This algebra is able to form for any XML data source a unique relational representation, called an XML relation. We demonstrate that the XML relation representation supports grouping and aggregation of data needed, for example, in OLAP (online analytical processing) -style applications.


2007 ◽  
pp. 199-219
Author(s):  
Angelo Brayner ◽  
Macelo Meireles ◽  
José de Aguiar Moraes Filho

Integrating data sources published on the web requires an integration strategy that guarantees local data sources autonomy. Multidatabase System (MDBS) has been consolidated as an approach to integrate multiple heterogeneous and distributed data sources in flexible and dynamic environments such as the Web. A key property of MDBSs is to guarantee a higher degree of local autonomy. In order to adopt the MDBS strategy, it is necessary to use a query language, called multidatabase language (MDL), which provides the necessary constructs for jointly manipulating and accessing data in heterogeneous data sources. In other words, the MDL is responsible for solving integration conflicts. This chapter describes an extension to the XQuery language, called MXQuery, which supports queries over several data sources and solves integration problems as semantic heterogeneity and incomplete information.


2015 ◽  
Vol 54 (01) ◽  
pp. 41-44 ◽  
Author(s):  
A. Taweel ◽  
S. Miles ◽  
B. C. Delaney ◽  
R. Bache

SummaryIntroduction: This article is part of the Focus Theme of Methods of Information in Medicine on “Managing Interoperability and Complexity in Health Systems”.Objectives: The increasing availability of electronic clinical data provides great potential for finding eligible patients for clinical research. However, data heterogeneity makes it difficult for clinical researchers to interrogate sources consistently. Existing standard query languages are often not sufficient to query across diverse representations. Thus, a higher- level domain language is needed so that queries become data-representation agnostic. To this end, we define a clinician-readable computational language for querying whether patients meet eligibility criteria (ECs) from clinical trials. This language is capable of implementing the temporal semantics required by many ECs, and can be automatically evaluated on heterogeneous data sources.Methods: By reference to standards and examples of existing ECs, a clinician-readable query language was developed. Using a model-based approach, it was implemented to transform captured ECs into queries that interrogate heterogeneous data warehouses. The query language was evaluated on two types of data sources, each different in structure and content.Results: The query language abstracts the level of expressivity so that researchers construct their ECs with no prior knowledge of the data sources. It was evaluated on two types of semantically and structurally diverse data warehouses. This query language is now used to express ECs in the EHR4CR project. A survey shows that it was perceived by the majority of users to be useful, easy to understand and unambiguous.Discussion: An EC-specific language enables clinical researchers to express their ECs as a query such that the user is isolated from complexities of different heterogeneous clinical data sets. More generally, the approach demonstrates that a domain query language has potential for overcoming the problems of semantic interoperability and is applicable where the nature of the queries is well understood and the data is conceptually similar but in different representations.Conclusions: Our language provides a strong basis for use across different clinical domains for expressing ECs by overcoming the heterogeneous nature of electronic clinical data whilst maintaining semantic consistency. It is readily comprehensible by target users. This demonstrates that a domain query language can be both usable and interoperable.


Author(s):  
Silvana Castano ◽  
Valeria De Antonellis ◽  
Sabrina De Capitani di Vimercati ◽  
Michele Melchiori

In the recent years, most enterprises have started to experience the use of the Web for work cooperation to improve efficiency and information interchange. As a consequence, enterprise information systems are being migrated onto the web, and methods and tools to effectively access data provided on the web in different formats from the autonomous heterogeneous data sources are required. In particular, integration tools are required to obtain a uniform data representation by abstracting from the formats in the origin data sources and thus to build a global information space suitable for query and access interface. The chapter will be devoted to discuss the characteristics of data schema integration in web-enabled, and to describe a comprehensive integration scheme for organizing heterogeneous information sources over the web, to enhance the capability of information interchange and interoperation among web-enabled systems.


Author(s):  
William Gardner ◽  
R. Rajugan

As many enterprise and industrial content management techniques are moving towards a distributed model, the need to exchange data between heterogeneous data sources in a seamless fashion is constantly increasing. These heterogeneous data sources could arise from server groups from different manufacturers or databases at different sites with their own schemas. Since its introduction in 1996, eXtensible Markup Language (XML) (W3C-XML, 2004) has established itself as the open, presentation independent data representation and exchange medium. XML provides a mechanism for seamless data exchange in many industrial informatics settings. In addition, XML is also emerging as the dominant standard for storing, describing, representing, and interchanging data among various enterprises systems and databases in the context of complex Web enterprises information systems (EIS).


Author(s):  
J. F. Aldana Montes ◽  
A. C. Gómez Lora ◽  
N. Moreno Vergara ◽  
I. Navas Delgado ◽  
M. M. Roldán Garcia

Database community has been seriously disturbed with the Web technologies expansion. Particularly, two reports have produced a special commotion in database field. The first one, the Asilomar report (Bernstein et al., 1998), postulates the new directives in databases tendencies, previewing the Web impact in this field. The second one, Breaking out the Box (Silberschatz & Zdonik, 1996), proposes how database community must transfer its technology to be introduced into Web technology. In this sense, the database box must be broken out into its autonomous functional components, and they must be used to reach a solution for the problem of heterogeneous data sources integration.


Sign in / Sign up

Export Citation Format

Share Document