Duplicate Record Detection for Data Integration

In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them, and the similarity is measured as the weight of such matching. Based on similarity estimation, the basic idea in this chapter is to estimate the range of the records similarity and to determine whether they are duplicate records according to the estimation. When data integration is performed on XML data, there are many problems because of the flexibility of XML. One of the current implementations is to use Data Exchange to carry out the above operations. This chapter proposes the concept of quality assurance mechanisms besides the data integrity and reliability.

2015 ◽  
Vol 11 (3) ◽  
pp. 370-396 ◽  
Author(s):  
Tuan-Dat Trinh ◽  
Peter Wetz ◽  
Ba-Lam Do ◽  
Elmar Kiesling ◽  
A Min Tjoa

Purpose – This paper aims to present a collaborative mashup platform for dynamic integration of heterogeneous data sources. The platform encourages sharing and connects data publishers, integrators, developers and end users. Design/methodology/approach – This approach is based on a visual programming paradigm and follows three fundamental principles: openness, connectedness and reusability. The platform is based on semantic Web technologies and the concept of linked widgets, i.e. semantic modules that allow users to access, integrate and visualize data in a creative and collaborative manner. Findings – The platform can effectively tackle data integration challenges by allowing users to explore relevant data sources for different contexts, tackling the data heterogeneity problem and facilitating automatic data integration, easing data integration via simple operations and fostering reusability of data processing tasks. Research limitations/implications – This research has focused exclusively on conceptual and technical aspects so far; a comprehensive user study, extensive performance and scalability testing is left for future work. Originality/value – A key contribution of this paper is the concept of distributed mashups. These ad hoc data integration applications allow users to perform data processing tasks in a collaborative and distributed manner simultaneously on multiple devices. This approach requires no server infrastructure to upload data, but rather allows each user to keep control over their data and expose only relevant subsets. Distributed mashups can run persistently in the background and are hence ideal for real-time data monitoring or data streaming use cases. Furthermore, we introduce automatic mashup composition as an innovative approach based on an explicit semantic widget model.


2018 ◽  
Vol 3 (2) ◽  
pp. 162
Author(s):  
Slamet Sudaryanto Nurhendratno ◽  
Sudaryanto Sudaryanto

 Data integration is an important step in integrating information from multiple sources. The problem is how to find and combine data from scattered data sources that are heterogeneous and have semantically informant interconnections optimally. The heterogeneity of data sources is the result of a number of factors, including storing databases in different formats, using different software and hardware for database storage systems, designing in different data semantic models (Katsis & Papakonstantiou, 2009, Ziegler & Dittrich , 2004). Nowadays there are two approaches in doing data integration that is Global as View (GAV) and Local as View (LAV), but both have different advantages and limitations so that proper analysis is needed in its application. Some of the major factors to be considered in making efficient and effective data integration of heterogeneous data sources are the understanding of the type and structure of the source data (source schema). Another factor to consider is also the view type of integration result (target schema). The results of the integration can be displayed into one type of global view or a variety of other views. So in integrating data whose source is structured the approach will be different from the integration of the data if the data source is not structured or semi-structured. Scheme mapping is a specific declaration that describes the relationship between the source scheme and the target scheme. In the scheme mapping is expressed in in some logical formulas that can help applications in data interoperability, data exchange and data integration. In this paper, in the case of establishing a patient referral center data center, it requires integration of data whose source is derived from a number of different health facilities, it is necessary to design a schema mapping system (to support optimization). Data Center as the target orientation schema (target schema) from various reference service units as a source schema (source schema) has the characterization and nature of data that is structured and independence. So that the source of data can be integrated tersetruktur of the data source into an integrated view (as a data center) with an equivalent query rewriting (equivalent). The data center as a global schema serves as a schema target requires a "mediator" that serves "guides" to maintain global schemes and map (mapping) between global and local schemes. Data center as from Global As View (GAV) here tends to be single and unified view so to be effective in its integration process with various sources of schema which is needed integration facilities "integration". The "Pemadu" facility is a declarative mapping language that allows to specifically link each of the various schema sources to the data center. So that type of query rewriting equivalent is suitable to be applied in the context of query optimization and maintenance of physical data independence.Keywords: Global as View (GAV), Local as View (LAV), source schema ,mapping schema


Author(s):  
Juan M. Gómez ◽  
Ricardo Colomo ◽  
Marcos Ruano ◽  
Ángel García

Technological advances in high-throughput techniques and efficient data gathering methods, coupled computational biology efforts, have resulted in a vast amount of life science data often available in distributed and heterogeneous repositories. These repositories contain information such as sequence and structure data, annotations for biological data, results of complex computations, genetic sequences and multiple bio-datasets. However, the heterogeneity of these data, have created a need for research in resource integration and platform independent processing of investigative queries, involving heterogeneous data sources. When processing huge amounts of data, information integration is one of the most critical issues, because it’s crucial to preserve the intrinsic semantics of all the merged data sources. This integration would allow the proper organization of data, fostering the analysis and access the information to accomplish critical tasks, such as the processing of micro-array data to study protein function and medical researches in making detailed studies of protein structures to facilitate drug design (Ignacimuthu, 2005). Furthermore, DNA micro-array research community urgently requires technology to allow up-to-date micro-array data information to be found, accessed and delivered in a secure framework (Sinnot, 2007). Several research disciplines, such as Bioinformatics, where information integration is critical, could benefit from harnessing the potential of a new approach: the Semantic Web (SW). The SW term was coined by Berners-Lee, Hendler and Lassila (2001) to describe the evolution of a Web that consisted of largely documents for humans to read towards a new paradigm that included data and information for computers to manipulate. The SW is about adding machine-understandable and machine-processable metadata to Web resource through its key-enabling technology: ontologies (Fensel, 2002). Ontologies are a formal explicit and shared specification of a conceptualization. The SW was conceived as a way to solve the need for data integration on the Web. This article expounds SAMIDI, a Semantics-based Architecture for Micro-array Information and Data Integration. The most remarkable innovation offered by SAMIDI is the use of semantics as a tool for leveraging different vocabularies and terminologies and foster integration. SAMIDI is composed of a methodology for the unification of heterogeneous data sources from the analysis of the requirements of the unified data set and a software architecture.


Author(s):  
Seán O’Riain ◽  
Andreas Harth ◽  
Edward Curry

With increased dependence on efficient use and inclusion of diverse corporate and Web based data sources for business information analysis, financial information providers will increasingly need agile information integration capabilities. Linked Data is a set of technologies and best practices that provide such a level of agility for information integration, access, and use. Current approaches struggle to cope with multiple data sources inclusion in near real-time, and have looked to Semantic Web technologies for assistance with infrastructure access, and dealing with multiple data formats and their vocabularies. This chapter discusses the challenges of financial data integration, provides the component architecture of Web enabled financial data integration and outlines the emergence of a financial ecosystem, based upon existing Web standards usage. Introductions to Semantic Web technologies are given, and the chapter supports this with insight and discussion gathered from multiple financial services use case implementations. Finally, best practice for integrating Web data based on the Linked Data principles and emergent areas are described.


2011 ◽  
Vol 268-270 ◽  
pp. 2127-2132
Author(s):  
Wu Bin Ma ◽  
Zhi Yong Tang ◽  
Su Deng ◽  
Hong Bin Huang

Data exchange is one of the key problems of information integration. The most important process in data exchange problem is how to solute it based on circulatory dependence. In this paper, we proposed an improved chase approach in the special background of application for the data exchange. Solving the problem that finding the most approximately solution in polynomial time for data exchange problem by amending the dependence condition appropriately.


Author(s):  
И.В. Бычков ◽  
Г.М. Ружников ◽  
В.В. Парамонов ◽  
А.С. Шумилов ◽  
Р.К. Фёдоров

Рассмотрен инфраструктурный подход обработки пространственных данных для решения задач управления территориальным развитием, который основан на сервис-ориентированной парадигме, стандартах OGC, web-технологиях, WPS-сервисах и геопортале. The development of territories is a multi-dimensional and multi-aspect process, which can be characterized by large volumes of financial, natural resources, social, ecological and economic data. The data is highly localized and non-coordinated, which limits its complex analysis and usage. One of the methods of large volume data processing is information-analytical environments. The architecture and implementation of the information-analytical environment of the territorial development in the form of Geoportal is presented. Geoportal provides software instruments for spatial and thematic data exchange for its users, as well as OGC-based distributed services that deal with the data processing. Implementation of the processing and storing of the data in the form of services located on distributed servers allows simplifying their updating and maintenance. In addition, it allows publishing and makes processing to be more open and controlled process. Geoportal consists of following modules: content management system Calipso (presentation of user interface, user management, data visualization), RDBMS PostgreSQL with spatial data processing extension, services of relational data entry and editing, subsystem of launching and execution of WPS-services, as well as services of spatial data processing, deployed at the local cloud environment. The presented article states the necessity of using the infrastructural approach when creating the information-analytical environment for the territory management, which is characterized by large volumes of spatial and thematical data that needs to be processed. The data is stored in various formats and applications of service-oriented paradigm, OGC standards, web-technologies, Geoportal and distributed WPS-services. The developed software system was tested on a number of tasks that arise during the territory development.


2021 ◽  
Vol 2 (3) ◽  
pp. 59
Author(s):  
Susanti Krismon ◽  
Syukri Iska

This article discusses the implementation of wages in agriculture in Nagari Bukit Kandung Subdistrict X Koto Atas, Solok Regency in a review of muamalah fiqh. The type of research is field research (field research). The data sources consist of primary data sources, namely from farmers and farm laborers who were carried out to 8 people and 4 farm workers, while the secondary data were obtained from documents in the form of the Bukit Kandung Nagari Profile that were related to this research, which could provide information or data. Addition to strengthen the primary data. Data collection techniques that the author uses are observation, interviews and documentation. The data processing that the author uses is qualitative. Based on the results of this study, the implementation of wages in agriculture carried out in Nagari Bukit Kandung District X Koto Diatas Solok Regency is farm laborers who ask for their wages to be given in advance before they carry out their work without an agreement to give their wages at the beginning. Because farm laborers ask for their wages to be given at the beginning, many farm workers work not as expected by farmers and there are also farm workers who are not on time to do the work that should be done. According to the muamalah fiqh review, the implementation of wages in agriculture in Nagari Bukit Kandung is not allowed because there is an element of gharar in the contract and there are parties who are disadvantaged in the contract, namely the owner of the fields.


2014 ◽  
Vol 23 (01) ◽  
pp. 27-35 ◽  
Author(s):  
S. de Lusignan ◽  
S-T. Liaw ◽  
C. Kuziemsky ◽  
F. Mold ◽  
P. Krause ◽  
...  

Summary Background: Generally benefits and risks of vaccines can be determined from studies carried out as part of regulatory compliance, followed by surveillance of routine data; however there are some rarer and more long term events that require new methods. Big data generated by increasingly affordable personalised computing, and from pervasive computing devices is rapidly growing and low cost, high volume, cloud computing makes the processing of these data inexpensive. Objective: To describe how big data and related analytical methods might be applied to assess the benefits and risks of vaccines. Method: We reviewed the literature on the use of big data to improve health, applied to generic vaccine use cases, that illustrate benefits and risks of vaccination. We defined a use case as the interaction between a user and an information system to achieve a goal. We used flu vaccination and pre-school childhood immunisation as exemplars. Results: We reviewed three big data use cases relevant to assessing vaccine benefits and risks: (i) Big data processing using crowd-sourcing, distributed big data processing, and predictive analytics, (ii) Data integration from heterogeneous big data sources, e.g. the increasing range of devices in the “internet of things”, and (iii) Real-time monitoring for the direct monitoring of epidemics as well as vaccine effects via social media and other data sources. Conclusions: Big data raises new ethical dilemmas, though its analysis methods can bring complementary real-time capabilities for monitoring epidemics and assessing vaccine benefit-risk balance.


Sign in / Sign up

Export Citation Format

Share Document