An ontology-based documentation of data discovery and integration process in cancer outcomes research

Abstract Background To reduce cancer mortality and improve cancer outcomes, it is critical to understand the various cancer risk factors (RFs) across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual, interpersonal, and community levels). However, prior research on RFs of cancer outcomes, has primarily focused on individual level RFs due to the lack of integrated datasets that contain multi-level, multi-domain RFs. Further, the lack of a consensus and proper guidance on systematically identify RFs also increase the difficulty of RF selection from heterogenous data sources in a multi-level integrative data analysis (mIDA) study. More importantly, as mIDA studies require integrating heterogenous data sources, the data integration processes in the limited number of existing mIDA studies are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility. Methods Informed by the National Institute on Minority Health and Health Disparities (NIMHD) research framework, we (1) reviewed existing reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network and (2) developed a theory-driven reporting guideline to guide the RF variable selection, data source selection, and data integration process. Then, we developed an ontology to standardize the documentation of the RF selection and data integration process in mIDA studies. Results We summarized the review results and created a reporting guideline—ATTEST—for reporting the variable selection and data source selection and integration process. We provided an ATTEST check list to help researchers to annotate and clearly document each step of their mIDA studies to ensure the transparency and reproducibility. We used the ATTEST to report two mIDA case studies and further transformed annotation results into sematic triples, so that the relationships among variables, data sources and integration processes are explicitly standardized and modeled using the classes and properties from OD-ATTEST. Conclusion Our ontology-based reporting guideline solves some key challenges in current mIDA studies for cancer outcomes research, through providing (1) a theory-driven guidance for multi-level and multi-domain RF variable and data source selection; and (2) a standardized documentation of the data selection and integration processes powered by an ontology, thus a way to enable sharing of mIDA study reports among researchers.

Download Full-text

An Ontology-based Approach to Guide and Document Variable and Data Source Selection and Data Integration Process to Support Integrative Data Analysis in Cancer Outcomes Research

10.1101/2020.05.28.20115907 ◽

2020 ◽

Cited By ~ 1

Author(s):

Hansi Zhang ◽

Yi Guo ◽

Jiang Bian

Keyword(s):

Risk Factors ◽

Data Analysis ◽

Data Integration ◽

Reporting Guideline ◽

Data Sources ◽

Integration Process ◽

Cancer Outcomes ◽

Source Selection ◽

Data Source ◽

Multi Level

AbstractBackgroundTo reduce cancer mortality and improve cancer outcomes, it is critical to understand the various cancer risk factors (RFs) across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual, interpersonal, and community levels). However, prior research on RFs of cancer outcomes, has primarily focused on individual level RFs due to the lack of integrated datasets that contain multi-level, multi-domain RFs. Further, the lack of a consensus and proper guidance on systematically identify RFs also increase the difficulty of RF selection from heterogenous data sources in a multi-level integrative data analysis (mIDA) study. More importantly, as mIDA studies require integrating heterogenous data sources, the data integration processes in the limited number of existing mIDA studies are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility.MethodsInformed by the National Institute on Minority Health and Health Disparities (NIMHD) research framework, we (1) reviewed existing reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network and (2) developed a theory-driven reporting guideline to guide the RF variable selection, data source selection, and data integration process. Then, we developed an ontology to standardize the documentation of the RF selection and data integration process in mIDA studies.ResultsWe summarized the review results and created a reporting guideline—ATTEST—for reporting the variable selection and data source selection and integration process. We provided an ATTEST check list to help researchers to annotate and clearly document each step of their mIDA studies to ensure the transparency and reproducibility. We used the ATTEST to report two mIDA case studies and further transformed annotation results into sematic triples, so that the relationships among variables, data sources and integration processes are explicitly standardized and modeled using the classes and properties from OD-ATTEST.ConclusionOur ontology-based reporting guideline solves some key challenges in current mIDA studies for cancer outcomes research, through providing (1) a theory-driven guidance for multi-level and multi-domain RF variable and data source selection; and (2) a standardized documentation of the data selection and integration processes powered by an ontology, thus a way to enable sharing of mIDA study reports among researchers.

Download Full-text

Data Source Selection Support in the Big Data Integration Process – Towards a Taxonomy

10.1007/978-3-030-86800-0_1 ◽

2021 ◽

pp. 5-21

Author(s):

Felix Kruse ◽

Christoph Schröer ◽

Jorge Marx Gómez

Keyword(s):

Big Data ◽

Data Integration ◽

Integration Process ◽

Source Selection ◽

Data Source

Download Full-text

Semantic-Based Geospatial Data Integration With Unique Features

Geospatial Intelligence ◽

10.4018/978-1-5225-8054-6.ch012 ◽

2019 ◽

pp. 254-277 ◽

Cited By ~ 1

Author(s):

Ying Zhang ◽

Chaopeng Li ◽

Na Chen ◽

Shaowen Liu ◽

Liming Du ◽

...

Keyword(s):

Data Integration ◽

High Performance ◽

Data Access ◽

Heterogeneous Data ◽

Geospatial Data ◽

Experimental Results ◽

Data Sources ◽

Data Format ◽

Access Protocols ◽

Data Source

Since large amount of geospatial data are produced by various sources, geospatial data integration is difficult because of the shortage of semantics. Despite standardised data format and data access protocols, such as Web Feature Service (WFS), can enable end-users with access to heterogeneous data stored in different formats from various sources, it is still time-consuming and ineffective due to the lack of semantics. To solve this problem, a prototype to implement the geospatial data integration is proposed by addressing the following four problems, i.e., geospatial data retrieving, modeling, linking and integrating. We mainly adopt four kinds of geospatial data sources to evaluate the performance of the proposed approach. The experimental results illustrate that the proposed linking method can get high performance in generating the matched candidate record pairs in terms of Reduction Ratio(RR), Pairs Completeness(PC), Pairs Quality(PQ) and F-score. The integrating results denote that each data source can get much Complementary Completeness(CC) and Increased Completeness(IC).

Download Full-text

Semantic Web and Geospatial Unique Features Based Geospatial Data Integration

Geospatial Intelligence ◽

10.4018/978-1-5225-8054-6.ch011 ◽

2019 ◽

pp. 230-253

Author(s):

Ying Zhang ◽

Chaopeng Li ◽

Na Chen ◽

Shaowen Liu ◽

Liming Du ◽

...

Keyword(s):

Semantic Web ◽

Data Integration ◽

High Performance ◽

Data Access ◽

Heterogeneous Data ◽

Geospatial Data ◽

Data Sources ◽

Modeling Process ◽

Translation Function ◽

Data Source

Since large amount of geospatial data are produced by various sources and stored in incompatible formats, geospatial data integration is difficult because of the shortage of semantics. Despite standardised data format and data access protocols, such as Web Feature Service (WFS), can enable end-users with access to heterogeneous data stored in different formats from various sources, it is still time-consuming and ineffective due to the lack of semantics. To solve this problem, a prototype to implement the geospatial data integration is proposed by addressing the following four problems, i.e., geospatial data retrieving, modeling, linking and integrating. First, we provide a uniform integration paradigm for users to retrieve geospatial data. Then, we align the retrieved geospatial data in the modeling process to eliminate heterogeneity with the help of Karma. Our main contribution focuses on addressing the third problem. Previous work has been done by defining a set of semantic rules for performing the linking process. However, the geospatial data has some specific geospatial relationships, which is significant for linking but cannot be solved by the Semantic Web techniques directly. We take advantage of such unique features about geospatial data to implement the linking process. In addition, the previous work will meet a complicated problem when the geospatial data sources are in different languages. In contrast, our proposed linking algorithms are endowed with translation function, which can save the translating cost among all the geospatial sources with different languages. Finally, the geospatial data is integrated by eliminating data redundancy and combining the complementary properties from the linked records. We mainly adopt four kinds of geospatial data sources, namely, OpenStreetMap(OSM), Wikmapia, USGS and EPA, to evaluate the performance of the proposed approach. The experimental results illustrate that the proposed linking method can get high performance in generating the matched candidate record pairs in terms of Reduction Ratio(RR), Pairs Completeness(PC), Pairs Quality(PQ) and F-score. The integrating results denote that each data source can get much Complementary Completeness(CC) and Increased Completeness(IC).

Download Full-text

Semantic-Based Geospatial Data Integration With Unique Features

Innovations, Developments, and Applications of Semantic Web and Information Systems - Advances in Web Technologies and Engineering ◽

10.4018/978-1-5225-5042-6.ch015 ◽

2018 ◽

pp. 393-416

Author(s):

Ying Zhang ◽

Chaopeng Li ◽

Na Chen ◽

Shaowen Liu ◽

Liming Du ◽

...

Keyword(s):

Data Integration ◽

High Performance ◽

Data Access ◽

Heterogeneous Data ◽

Geospatial Data ◽

Experimental Results ◽

Data Sources ◽

Data Format ◽

Access Protocols ◽

Data Source

Download Full-text

Integrate inconsistent and heterogeneous data based on user feedback

International Journal of Intelligent Computing and Cybernetics ◽

10.1108/ijicc-04-2014-0013 ◽

2015 ◽

Vol 8 (2) ◽

pp. 187-203 ◽

Cited By ~ 2

Author(s):

Lihua Lu ◽

Hengzhen Zhang ◽

Xiao-Zhi Gao

Keyword(s):

Data Integration ◽

Data Model ◽

Quality Criteria ◽

Heterogeneous Data ◽

Data Sources ◽

Content Type ◽

Inconsistent Data ◽

Data Inconsistency ◽

Multi Attribute Decision Making ◽

Data Source

Purpose – Data integration is to combine data residing at different sources and to provide the users with a unified interface of these data. An important issue on data integration is the existence of conflicts among the different data sources. Data sources may conflict with each other at data level, which is defined as data inconsistency. The purpose of this paper is to aim at this problem and propose a solution for data inconsistency in data integration. Design/methodology/approach – A relational data model extended with data source quality criteria is first defined. Then based on the proposed data model, a data inconsistency solution strategy is provided. To accomplish the strategy, fuzzy multi-attribute decision-making (MADM) approach based on data source quality criteria is applied to obtain the results. Finally, users feedbacks strategies are proposed to optimize the result of fuzzy MADM approach as the final data inconsistent solution. Findings – To evaluate the proposed method, the data obtained from the sensors are extracted. Some experiments are designed and performed to explain the effectiveness of the proposed strategy. The results substantiate that the solution has a better performance than the other methods on correctness, time cost and stability indicators. Practical implications – Since the inconsistent data collected from the sensors are pervasive, the proposed method can solve this problem and correct the wrong choice to some extent. Originality/value – In this paper, for the first time the authors study the effect of users feedbacks on integration results aiming at the inconsistent data.

Download Full-text

Top-K data source selection for keyword queries over multiple XML data sources

Journal of Information Science ◽

10.1177/0165551511435875 ◽

2012 ◽

Vol 38 (2) ◽

pp. 156-175 ◽

Cited By ~ 3

Author(s):

Khanh Nguyen ◽

Jinli Cao

Keyword(s):

Data Sources ◽

Xml Data ◽

Source Selection ◽

Data Source ◽

Selection For

Download Full-text

DATA INTEGRATION MODEL DESIGN FOR SUPPORTING DATA CENTER PATIENT SERVICES DISTRIBUTED INSURANCE PURCHASE WITH VIEW BASED DATA INTEGRATION

Computer Engineering Science and System Journal ◽

10.24114/cess.v3i2.8895 ◽

2018 ◽

Vol 3 (2) ◽

pp. 162

Author(s):

Slamet Sudaryanto Nurhendratno ◽

Sudaryanto Sudaryanto

Keyword(s):

Data Integration ◽

Data Center ◽

Data Exchange ◽

Heterogeneous Data ◽

Data Sources ◽

Query Rewriting ◽

Schema Mapping ◽

Data Interoperability ◽

Integration Model ◽

Data Source

Data integration is an important step in integrating information from multiple sources. The problem is how to find and combine data from scattered data sources that are heterogeneous and have semantically informant interconnections optimally. The heterogeneity of data sources is the result of a number of factors, including storing databases in different formats, using different software and hardware for database storage systems, designing in different data semantic models (Katsis & Papakonstantiou, 2009, Ziegler & Dittrich , 2004). Nowadays there are two approaches in doing data integration that is Global as View (GAV) and Local as View (LAV), but both have different advantages and limitations so that proper analysis is needed in its application. Some of the major factors to be considered in making efficient and effective data integration of heterogeneous data sources are the understanding of the type and structure of the source data (source schema). Another factor to consider is also the view type of integration result (target schema). The results of the integration can be displayed into one type of global view or a variety of other views. So in integrating data whose source is structured the approach will be different from the integration of the data if the data source is not structured or semi-structured. Scheme mapping is a specific declaration that describes the relationship between the source scheme and the target scheme. In the scheme mapping is expressed in in some logical formulas that can help applications in data interoperability, data exchange and data integration. In this paper, in the case of establishing a patient referral center data center, it requires integration of data whose source is derived from a number of different health facilities, it is necessary to design a schema mapping system (to support optimization). Data Center as the target orientation schema (target schema) from various reference service units as a source schema (source schema) has the characterization and nature of data that is structured and independence. So that the source of data can be integrated tersetruktur of the data source into an integrated view (as a data center) with an equivalent query rewriting (equivalent). The data center as a global schema serves as a schema target requires a "mediator" that serves "guides" to maintain global schemes and map (mapping) between global and local schemes. Data center as from Global As View (GAV) here tends to be single and unified view so to be effective in its integration process with various sources of schema which is needed integration facilities "integration". The "Pemadu" facility is a declarative mapping language that allows to specifically link each of the various schema sources to the data center. So that type of query rewriting equivalent is suitable to be applied in the context of query optimization and maintenance of physical data independence.Keywords: Global as View (GAV), Local as View (LAV), source schema ,mapping schema

Download Full-text

Integration of Data Sources through Data Mining

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch118 ◽

2011 ◽

pp. 625-629

Author(s):

Andreas Koeller

Keyword(s):

Data Mining ◽

Data Integration ◽

Data Transformation ◽

Data Sources ◽

Integration Process

Integration of data sources refers to the task of developing a common schema as well as data transformation solutions for a number of data sources with related content. The large number and size of modern data sources make manual approaches at integration increasingly impractical. Data mining can help to partially or fully automate the data integration process.

Download Full-text

A Framework for Enhancing Big Data Integration in Biological Domain Using Distributed Processing

Applied Sciences ◽

10.3390/app10207092 ◽

2020 ◽

Vol 10 (20) ◽

pp. 7092

Author(s):

Ameera Almasoud ◽

Hend Al-Khalifa ◽

AbdulMalik Al-salman ◽

Miltiadis Lytras

Keyword(s):

Big Data ◽

Data Integration ◽

Distributed Processing ◽

Integration Process ◽

Integration Framework ◽

Logical Consistency ◽

Biological Domain ◽

Unified View ◽

Data Source ◽

Local Integration

Massive heterogeneous big data residing at different sites with various types and formats need to be integrated into a single unified view before starting data mining processes. Furthermore, in most of applications and research, a single big data source is not enough to complete the analysis and achieve goals. Unfortunately, there is no general or standardized integration process; the nature of an integration process depends on the data type, domain, and integration purpose. Based on these parameters, we proposed, implemented, and tested a big data integration framework that integrates big data in the biology domain, based on the domain ontology and using distributed processing. The integration resulted in the same result as that obtained from the local integration. The results are equivalent in terms of the ontology size before the integration; in the number of added items, skipped items, and overlapped items; in the ontology size after the integration; and in the number of edges, vertices, and roots. The results also do not violate any logical consistency rules, passing all the logical consistency tests, such as Jena Ontology API, HermiT, and Pellet reasoners. The integration result is a new big data source that combines big data from several critical sources in the biology domain and transforms it into one unified format to help researchers and specialists use it for further research and analysis.

Download Full-text