The road towards data integration in human genomics: players, steps and interactions

Abstract Thousands of new experimental datasets are becoming available every day; in many cases, they are produced within the scope of large cooperative efforts, involving a variety of laboratories spread all over the world, and typically open for public use. Although the potential collective amount of available information is huge, the effective combination of such public sources is hindered by data heterogeneity, as the datasets exhibit a wide variety of notations and formats, concerning both experimental values and metadata. Thus, data integration is becoming a fundamental activity, to be performed prior to data analysis and biological knowledge discovery, consisting of subsequent steps of data extraction, normalization, matching and enrichment; once applied to heterogeneous data sources, it builds multiple perspectives over the genome, leading to the identification of meaningful relationships that could not be perceived by using incompatible data formats. In this paper, we first describe a technological pipeline from data production to data integration; we then propose a taxonomy of genomic data players (based on the distinction between contributors, repository hosts, consortia, integrators and consumers) and apply the taxonomy to describe about 30 important players in genomic data management. We specifically focus on the integrator players and analyse the issues in solving the genomic data integration challenges, as well as evaluate the computational environments that they provide to follow up data integration by means of visualization and analysis tools.

Download Full-text

Enabling semantic queries across federated bioinformatics databases

Database ◽

10.1093/database/baz106 ◽

2019 ◽

Vol 2019 ◽

Cited By ~ 9

Author(s):

Ana Claudia Sima ◽

Tarcisio Mendes de Farias ◽

Erich Zbinden ◽

Maria Anisimova ◽

Manuel Gil ◽

...

Keyword(s):

Gene Expression ◽

Data Integration ◽

Heterogeneous Data ◽

Biological Data ◽

Data Sources ◽

Biological Knowledge ◽

Biological Databases ◽

Semantic Level ◽

Sparql Endpoint ◽

Description Framework

Abstract Motivation: Data integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data available publicly. However, the heterogeneity of the different data sources, both at the syntactic and the semantic level, still poses significant challenges for achieving interoperability among biological databases. Results: We introduce an ontology-based federated approach for data integration. We applied this approach to three heterogeneous data stores that span different areas of biological knowledge: (i) Bgee, a gene expression relational database; (ii) Orthologous Matrix (OMA), a Hierarchical Data Format 5 orthology DS; and (iii) UniProtKB, a Resource Description Framework (RDF) store containing protein sequence and functional information. To enable federated queries across these sources, we first defined a new semantic model for gene expression called GenEx. We then show how the relational data in Bgee can be expressed as a virtual RDF graph, instantiating GenEx, through dedicated relational-to-RDF mappings. By applying these mappings, Bgee data are now accessible through a public SPARQL endpoint. Similarly, the materialized RDF data of OMA, expressed in terms of the Orthology ontology, is made available in a public SPARQL endpoint. We identified and formally described intersection points (i.e. virtual links) among the three data sources. These allow performing joint queries across the data stores. Finally, we lay the groundwork to enable nontechnical users to benefit from the integrated data, by providing a natural language template-based search interface.

Download Full-text

Enabling Semantic Queries Across Federated Bioinformatics Databases

10.1101/686600 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ana Claudia Sima ◽

Tarcisio Mendes de Farias ◽

Erich Zbinden ◽

Maria Anisimova ◽

Manuel Gil ◽

...

Keyword(s):

Gene Expression ◽

Data Integration ◽

Heterogeneous Data ◽

Biological Data ◽

Data Sources ◽

Biological Knowledge ◽

Biological Databases ◽

Semantic Level ◽

Sparql Endpoint ◽

Link Type

MotivationData integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data available publicly. However, the heterogeneity of the different data sources, both at the syntactic and the semantic level, still poses significant challenges for achieving interoperability among biological databases.ResultsWe introduce an ontology-based federated approach for data integration. We applied this approach to three heterogeneous data stores that span different areas of biological knowledge: 1) Bgee, a gene expression relational database; 2) OMA, a Hierarchical Data Format 5 (HDF5) orthology data store, and 3) UniProtKB, a Resource Description Framework (RDF) store containing protein sequence and functional information. To enable federated queries across these sources, we first defined a new semantic model for gene expression called GenEx. We then show how the relational data in Bgee can be expressed as a virtual RDF graph, instantiating GenEx, through dedicated relational-to-RDF mappings. By applying these mappings, Bgee data are now accessible through a public SPARQL endpoint. Similarly, the materialised RDF data of OMA, expressed in terms of the Orthology ontology, is made available in a public SPARQL endpoint. We identified and formally described intersection points (i.e. virtual links) among the three data sources. These allow performing joint queries across the data stores. Finally, we lay the groundwork to enable nontechnical users to benefit from the integrated data, by providing a natural language template-based search interface.Project URLhttp://biosoda.expasy.org, https://github.com/biosoda/bioquery

Download Full-text

VGEs-Oriented Multi-sourced Heterogeneous Data Integration

Geo-information Science ◽

10.3724/sp.j.1047.2009.00292 ◽

2010 ◽

Vol 11 (3) ◽

pp. 292-298

Author(s):

Hongjun SU ◽

Yehua SHENG ◽

Yongning WEN ◽

Min CHEN

Keyword(s):

Data Integration ◽

Heterogeneous Data ◽

Heterogeneous Data Integration

Download Full-text

Methodology of Big Data Integration from A Priori Unknown Heterogeneous Data Sources

Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence - CSAI '18 ◽

10.1145/3297156.3297249 ◽

2018 ◽

Author(s):

Alexey Samoylov ◽

Nikolay Sergeev ◽

Margarita Kucherova ◽

Boris Denisov

Keyword(s):

Big Data ◽

Data Integration ◽

A Priori ◽

Heterogeneous Data ◽

Data Sources ◽

Heterogeneous Data Sources

Download Full-text

MuSA: a graphical user interface for multi-OMICs data integration in radiogenomic studies

Scientific Reports ◽

10.1038/s41598-021-81200-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Mario Zanfardino ◽

Rossana Castaldo ◽

Katia Pane ◽

Ornella Affinito ◽

Marco Aiello ◽

...

Keyword(s):

User Interface ◽

Data Integration ◽

Graphical User Interface ◽

Data Science ◽

Heterogeneous Data ◽

Biological Information ◽

Omics Data ◽

Correlation Clustering ◽

Downstream Analysis ◽

Omics Data Integration

AbstractAnalysis of large-scale omics data along with biomedical images has gaining a huge interest in predicting phenotypic conditions towards personalized medicine. Multiple layers of investigations such as genomics, transcriptomics and proteomics, have led to high dimensionality and heterogeneity of data. Multi-omics data integration can provide meaningful contribution to early diagnosis and an accurate estimate of prognosis and treatment in cancer. Some multi-layer data structures have been developed to integrate multi-omics biological information, but none of these has been developed and evaluated to include radiomic data. We proposed to use MultiAssayExperiment (MAE) as an integrated data structure to combine multi-omics data facilitating the exploration of heterogeneous data. We improved the usability of the MAE, developing a Multi-omics Statistical Approaches (MuSA) tool that uses a Shiny graphical user interface, able to simplify the management and the analysis of radiogenomic datasets. The capabilities of MuSA were shown using public breast cancer datasets from TCGA-TCIA databases. MuSA architecture is modular and can be divided in Pre-processing and Downstream analysis. The pre-processing section allows data filtering and normalization. The downstream analysis section contains modules for data science such as correlation, clustering (i.e., heatmap) and feature selection methods. The results are dynamically shown in MuSA. MuSA tool provides an easy-to-use way to create, manage and analyze radiogenomic data. The application is specifically designed to guide no-programmer researchers through different computational steps. Integration analysis is implemented in a modular structure, making MuSA an easily expansible open-source software.

Download Full-text

Reconsideration of in silico siRNA design from a perspective of heterogeneous data integration: problems and solutions

Briefings in Bioinformatics ◽

10.1093/bib/bbs073 ◽

2012 ◽

Vol 15 (2) ◽

pp. 292-305 ◽

Cited By ~ 5

Author(s):

Q. Liu ◽

H. Zhou ◽

R. Zhu ◽

Y. Xu ◽

Z. Cao

Keyword(s):

Data Integration ◽

In Silico ◽

Heterogeneous Data ◽

Heterogeneous Data Integration ◽

Problems And Solutions ◽

Sirna Design ◽

Integration Problems

Download Full-text

Data quality-aware genomic data integration

Computer Methods and Programs in Biomedicine Update ◽

10.1016/j.cmpbup.2021.100009 ◽

2021 ◽

pp. 100009

Author(s):

Anna Bernasconi

Keyword(s):

Data Integration ◽

Data Quality ◽

Genomic Data ◽

Genomic Data Integration

Download Full-text

Hepatitis C and the absence of genomic data in low-income countries: a barrier on the road to elimination?

The Lancet Gastroenterology & Hepatology ◽

10.1016/s2468-1253(17)30257-1 ◽

2017 ◽

Vol 2 (10) ◽

pp. 700-701 ◽

Cited By ~ 13

Author(s):

Marc Niebel ◽

Joshua B Singer ◽

Sema Nickbakhsh ◽

Robert J Gifford ◽

Emma C Thomson

Keyword(s):

Hepatitis C ◽

Low Income ◽

Genomic Data ◽

Low Income Countries ◽

The Road ◽

On The Road

Download Full-text

Integração, Relacionamento e Representação de Dados em Cidades Inteligentes: Uma Revisão de Literatura

10.5753/wbci.2018.3231 ◽

2018 ◽

Author(s):

Larysse Silva ◽

José Alex Lima ◽

Nélio Cacho ◽

Eiji Adachi ◽

Frederico Lopes ◽

...

Keyword(s):

Decision Making ◽

Literature Review ◽

Data Integration ◽

Smart Cities ◽

Heterogeneous Data ◽

Data Sources ◽

Application Development ◽

Continuous Integration ◽

Heterogeneous Data Sources ◽

Computational Systems

A notable characteristic of smart cities is the increase in the amount of available data generated by several devices and computational systems, thus augmenting the challenges related to the development of software that involves the integration of larges volumes of data. In this context, this paper presents a literature review aimed to identify the main strategies used in the development of solutions for data integration, relationship, and representation in smart cities. This study systematically selected and analyzed eleven studies published from 2015 to 2017. The achieved results reveal gaps regarding solutions for the continuous integration of heterogeneous data sources towards supporting application development and decision-making.

Download Full-text

A Data Model for Heterogeneous Data Integration Architecture

Communications in Computer and Information Science - Beyond Databases, Architectures, and Structures ◽

10.1007/978-3-319-06932-6_53 ◽

2014 ◽

pp. 547-556 ◽

Cited By ~ 7

Author(s):

Michał Chromiak ◽

Krzysztof Stencel

Keyword(s):

Data Integration ◽

Data Model ◽

Heterogeneous Data ◽

Heterogeneous Data Integration ◽

Integration Architecture

Download Full-text