Top-K data source selection for keyword queries over multiple XML data sources

There are numerous approaches for integrating data from heterogeneous data sources. A common background assumption is that the data sources remain quite stable and are known in advance. Hence an integration system can be built to manipulate them. In practice there is, however, often a demand for supporting ad hoc information needs concerning unexpected autonomous data sources containing volatile data. A different approach is therefore needed. We propose that semantically similar data are harmonized when extracting data from XML-based data sources. We introduce a constructor algebra, which is a powerful tool in the harmonization of XML data. This algebra is able to form for any XML data source a unique relational representation, called an XML relation. We demonstrate that the XML relation representation supports grouping and aggregation of data needed, for example, in OLAP (online analytical processing) -style applications.

Download Full-text

Data source selection for approximate query

Journal of Combinatorial Optimization ◽

10.1007/s10878-021-00760-y ◽

2021 ◽

Author(s):

Hongjie Guo ◽

Jianzhong Li ◽

Hong Gao

Keyword(s):

Source Selection ◽

Data Source ◽

Approximate Query ◽

Selection For

Download Full-text

Significance of wave data source selection for vessel response prediction and fatigue damage estimation

Ocean Engineering ◽

10.1016/j.oceaneng.2020.107610 ◽

2020 ◽

Vol 216 ◽

pp. 107610

Author(s):

Matthew L. Schirmann ◽

Matthew D. Collette ◽

James W. Gose

Keyword(s):

Fatigue Damage ◽

Response Prediction ◽

Damage Estimation ◽

Wave Data ◽

Source Selection ◽

Vessel Response ◽

Data Source ◽

Selection For

Download Full-text

An ontology-based documentation of data discovery and integration process in cancer outcomes research

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-020-01270-3 ◽

2020 ◽

Vol 20 (S4) ◽

Cited By ~ 1

Author(s):

Hansi Zhang ◽

Yi Guo ◽

Mattia Prosperi ◽

Jiang Bian

Keyword(s):

Risk Factors ◽

Data Integration ◽

Outcomes Research ◽

Reporting Guideline ◽

Data Sources ◽

Integration Process ◽

Cancer Outcomes ◽

Source Selection ◽

Data Source ◽

Multi Level

Abstract Background To reduce cancer mortality and improve cancer outcomes, it is critical to understand the various cancer risk factors (RFs) across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual, interpersonal, and community levels). However, prior research on RFs of cancer outcomes, has primarily focused on individual level RFs due to the lack of integrated datasets that contain multi-level, multi-domain RFs. Further, the lack of a consensus and proper guidance on systematically identify RFs also increase the difficulty of RF selection from heterogenous data sources in a multi-level integrative data analysis (mIDA) study. More importantly, as mIDA studies require integrating heterogenous data sources, the data integration processes in the limited number of existing mIDA studies are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility. Methods Informed by the National Institute on Minority Health and Health Disparities (NIMHD) research framework, we (1) reviewed existing reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network and (2) developed a theory-driven reporting guideline to guide the RF variable selection, data source selection, and data integration process. Then, we developed an ontology to standardize the documentation of the RF selection and data integration process in mIDA studies. Results We summarized the review results and created a reporting guideline—ATTEST—for reporting the variable selection and data source selection and integration process. We provided an ATTEST check list to help researchers to annotate and clearly document each step of their mIDA studies to ensure the transparency and reproducibility. We used the ATTEST to report two mIDA case studies and further transformed annotation results into sematic triples, so that the relationships among variables, data sources and integration processes are explicitly standardized and modeled using the classes and properties from OD-ATTEST. Conclusion Our ontology-based reporting guideline solves some key challenges in current mIDA studies for cancer outcomes research, through providing (1) a theory-driven guidance for multi-level and multi-domain RF variable and data source selection; and (2) a standardized documentation of the data selection and integration processes powered by an ontology, thus a way to enable sharing of mIDA study reports among researchers.

Download Full-text

Data sources selection for XML data sources

International Journal of Intelligent Information and Database Systems ◽

10.1504/ijiids.2008.021446 ◽

2008 ◽

Vol 2 (4) ◽

pp. 422 ◽

Cited By ~ 1

Author(s):

Hongzhi Wang ◽

Jianzhong Li ◽

Jizhou Luo

Keyword(s):

Data Sources ◽

Xml Data ◽

Selection For

Download Full-text

An Ontology-based Approach to Guide and Document Variable and Data Source Selection and Data Integration Process to Support Integrative Data Analysis in Cancer Outcomes Research

10.1101/2020.05.28.20115907 ◽

2020 ◽

Cited By ~ 1

Author(s):

Hansi Zhang ◽

Yi Guo ◽

Jiang Bian

Keyword(s):

Risk Factors ◽

Data Analysis ◽

Data Integration ◽

Reporting Guideline ◽

Data Sources ◽

Integration Process ◽

Cancer Outcomes ◽

Source Selection ◽

Data Source ◽

Multi Level

AbstractBackgroundTo reduce cancer mortality and improve cancer outcomes, it is critical to understand the various cancer risk factors (RFs) across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual, interpersonal, and community levels). However, prior research on RFs of cancer outcomes, has primarily focused on individual level RFs due to the lack of integrated datasets that contain multi-level, multi-domain RFs. Further, the lack of a consensus and proper guidance on systematically identify RFs also increase the difficulty of RF selection from heterogenous data sources in a multi-level integrative data analysis (mIDA) study. More importantly, as mIDA studies require integrating heterogenous data sources, the data integration processes in the limited number of existing mIDA studies are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility.MethodsInformed by the National Institute on Minority Health and Health Disparities (NIMHD) research framework, we (1) reviewed existing reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network and (2) developed a theory-driven reporting guideline to guide the RF variable selection, data source selection, and data integration process. Then, we developed an ontology to standardize the documentation of the RF selection and data integration process in mIDA studies.ResultsWe summarized the review results and created a reporting guideline—ATTEST—for reporting the variable selection and data source selection and integration process. We provided an ATTEST check list to help researchers to annotate and clearly document each step of their mIDA studies to ensure the transparency and reproducibility. We used the ATTEST to report two mIDA case studies and further transformed annotation results into sematic triples, so that the relationships among variables, data sources and integration processes are explicitly standardized and modeled using the classes and properties from OD-ATTEST.ConclusionOur ontology-based reporting guideline solves some key challenges in current mIDA studies for cancer outcomes research, through providing (1) a theory-driven guidance for multi-level and multi-domain RF variable and data source selection; and (2) a standardized documentation of the data selection and integration processes powered by an ontology, thus a way to enable sharing of mIDA study reports among researchers.

Download Full-text