Towards Semistructured Data Integration

Author(s):  
Mengchi Liu ◽  
Tok Wang Ling

With the recent popularity of the World Wide Web, an enormous amount of heterogeneous information is now available online. As a result, information about the same real-world object often spreads over different data sources, and may be partial and inconsistent. How to obtain information as complete as possible and detect inconsistency from these sources is thus a challenge. Previous work using a simple graph-based or tree-based data model to represent heterogeneous data coming from various sites fail to provide a proper foundation for the integration of data with partial and inconsistent information. In order to integrate such data, we need a powerful data model that is more expressive than the existing graph-based and tree-based ones to account for the existence of partial and inconsistent information from different data sources. In this chapter, we propose a novel data model for such data and study how to integrate such data spread in various sources and check consistency in the meantime. We propose a new operator called integration for this purpose and discuss its semantic properties.

Author(s):  
Samir Mohammad ◽  
Patrick Martin

Extensible Markup Language (XML), which provides a flexible way to define semistructured data, is a de facto standard for information exchange in the World Wide Web. The trend towards storing data in its XML format has meant a rapid growth in XML databases and the need to query them. Indexing plays a key role in improving the execution of a query. In this chapter the authors give a brief history of the creation and the development of the XML data model. They discuss the three main categories of indexes proposed in the literature to handle the XML semistructured data model and provide an evaluation of indexing schemes within these categories. Finally, they discuss limitations and open problems related to the major existing indexing schemes.


2011 ◽  
pp. 277-297 ◽  
Author(s):  
Carlo Combi ◽  
Barbara Oliboni

This chapter describes a graph-based approach to represent information stored in a data warehouse, by means of a temporal semistructured data model. We consider issues related to the representation of semistructured data warehouses, and discuss the set of constraints needed to manage in a correct way the warehouse time, i.e. the time dimension considered storing data in the data warehouse itself. We use a temporal semistructured data model because a data warehouse can contain data coming from different and heterogeneous data sources. This means that data stored in a data warehouse are semistructured in nature, i.e. in different documents the same information can be represented in different ways, and moreover, the document schemata can be available or not. Moreover, information stored into a data warehouse is often time varying, thus as for semistructured data, also in the data warehouse context, it could be useful to consider time.


Data Mining ◽  
2011 ◽  
pp. 437-452 ◽  
Author(s):  
Jeffrey Hsu

Every day, enormous amounts of information are generated from all sectors, whether it be business, education, the scientific community, the World Wide Web (WWW), or one of many readily available off-line and online data sources. From all of this, which represents a sizable repository of data and information, it is possible to generate worthwhile and usable knowledge. As a result, the field of Data Mining (DM) and knowledge discovery in databases (KDD) has grown in leaps and bounds and has shown great potential for the future (Han & Kamber, 2001). The purpose of this chapter is to survey many of the critical and future trends in the field of DM, with a focus on those which are thought to have the most promise and applicability to future DM applications.


Author(s):  
Bethany Aram ◽  
Aurelio López Fernández ◽  
Daniel Muñiz Amian

Abstract This article presents a relational database capable of integrating data from a variety of types of written sources as well as material remains. In response to historical research questions, information from such diverse sources as documentary, bioanthropological, isotopic, and DNA analyses has been assessed, homogenized, and situated in time and space. Multidisciplinary ontologies offer complementary and integrated perspectives regarding persons and goods. While responding to specific research questions about the impact of globalization on the isthmus of Panama during the sixteenth and seventeenth centuries, the data model and user interface promote the ongoing interrogation of diverse information about complex, changing societies. To this end, the application designed makes it possible to search, consult, and download data that researchers have contributed from anywhere in the world.


2014 ◽  
Vol 2014 ◽  
pp. 1-11 ◽  
Author(s):  
Shihan Yang ◽  
Hongyan Tan ◽  
Jinzhao Wu

Semantic collision is inevitable while building a domain ontology from heterogeneous data sources (semi-)automatically. Therefore, the semantic consistency is indispensable precondition for building a correct ontology. In this paper, a model-checking-based method is proposed to handle the semantic consistency problem with a kind of middle-model methodology, which could extract a domain ontology from structured and semistructured data sources semiautomatically. The method translates the middle model into the Kripke structure, and consistency assertions into CTL formulae, so a consistency checking problem is promoted to a global model checking. Moreover, the feasibility and correctness of the transformation is proved, and case studies are provided.


Author(s):  
Lihua Lu ◽  
Hengzhen Zhang ◽  
Xiao-Zhi Gao

Purpose – Data integration is to combine data residing at different sources and to provide the users with a unified interface of these data. An important issue on data integration is the existence of conflicts among the different data sources. Data sources may conflict with each other at data level, which is defined as data inconsistency. The purpose of this paper is to aim at this problem and propose a solution for data inconsistency in data integration. Design/methodology/approach – A relational data model extended with data source quality criteria is first defined. Then based on the proposed data model, a data inconsistency solution strategy is provided. To accomplish the strategy, fuzzy multi-attribute decision-making (MADM) approach based on data source quality criteria is applied to obtain the results. Finally, users feedbacks strategies are proposed to optimize the result of fuzzy MADM approach as the final data inconsistent solution. Findings – To evaluate the proposed method, the data obtained from the sensors are extracted. Some experiments are designed and performed to explain the effectiveness of the proposed strategy. The results substantiate that the solution has a better performance than the other methods on correctness, time cost and stability indicators. Practical implications – Since the inconsistent data collected from the sensors are pervasive, the proposed method can solve this problem and correct the wrong choice to some extent. Originality/value – In this paper, for the first time the authors study the effect of users feedbacks on integration results aiming at the inconsistent data.


Author(s):  
Ivan Bojicic ◽  
Zoran Marjanovic ◽  
Nina Turajlic ◽  
Marko Petrovic ◽  
Milica Vuckovic ◽  
...  

In order for a data warehouse to be able to adequately fulfill its integrative and historical purpose, its data model must enable the appropriate and consistent representation of the different states of a system. In effect, a DW data model, representing the physical structure of the DW, must be general enough, to be able to consume data from heterogeneous data sources and reconcile the semantic differences of the data source models, and, at the same time, be resilient to the constant changes in the structure of the data sources. One of the main problems related to DW development is the absence of a standardized DW data model. In this paper a comparative analysis of the four most prominent DW data models (namely the relational/normalized model, data vault model, anchor model and dimensional model) will be given. On the basis of the results of [1]a, the new DW data model (the Domain/Mapping model- DMM) which would more adequately fulfill the posed requirements is presented.


Author(s):  
José A. Alonso-Jiménez ◽  
Joaquín Borrego-Díaz ◽  
Antonia M. Chávez-González

Nowadays, data management on the World Wide Web needs to consider very large knowledge databases (KDB). The larger is a KDB, the smaller the possibility of being consistent. Consistency in checking algorithms and systems fails to analyse very large KDBs, and so many have to work every day with inconsistent information.


Sign in / Sign up

Export Citation Format

Share Document