Evaluating Graph Database Systems for Biological Data

Author(s):  
Minghe Yu ◽  
Yaxuan Zang ◽  
Shaopeng Dai ◽  
Daoyi Zheng ◽  
Jinheng Li
Database ◽  
2020 ◽  
Vol 2020 ◽  
Author(s):  
Claire M Simpson ◽  
Florian Gnad

Abstract Graph representations provide an elegant solution to capture and analyze complex molecular mechanisms in the cell. Co-expression networks are undirected graph representations of transcriptional co-behavior indicating (co-)regulations, functional modules or even physical interactions between the corresponding gene products. The growing avalanche of available RNA sequencing (RNAseq) data fuels the construction of such networks, which are usually stored in relational databases like most other biological data. Inferring linkage by recursive multiple-join statements, however, is computationally expensive and complex to design in relational databases. In contrast, graph databases store and represent complex interconnected data as nodes, edges and properties, making it fast and intuitive to query and analyze relationships. While graph-based database technologies are on their way from a fringe domain to going mainstream, there are only a few studies reporting their application to biological data. We used the graph database management system Neo4j to store and analyze co-expression networks derived from RNAseq data from The Cancer Genome Atlas. Comparing co-expression in tumors versus healthy tissues in six cancer types revealed significant perturbation tracing back to erroneous or rewired gene regulation. Applying centrality, community detection and pathfinding graph algorithms uncovered the destruction or creation of central nodes, modules and relationships in co-expression networks of tumors. Given the speed, accuracy and straightforwardness of managing these densely connected networks, we conclude that graph databases are ready for entering the arena of biological data.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3509 ◽  
Author(s):  
Raquel L. Costa ◽  
Luiz Gadelha ◽  
Marcelo Ribeiro-Alves ◽  
Fábio Porto

There are many steps in analyzing transcriptome data, from the acquisition of raw data to the selection of a subset of representative genes that explain a scientific hypothesis. The data produced can be represented as networks of interactions among genes and these may additionally be integrated with other biological databases, such as Protein-Protein Interactions, transcription factors and gene annotation. However, the results of these analyses remain fragmented, imposing difficulties, either for posterior inspection of results, or for meta-analysis by the incorporation of new related data. Integrating databases and tools into scientific workflows, orchestrating their execution, and managing the resulting data and its respective metadata are challenging tasks. Additionally, a great amount of effort is equally required to run in-silico experiments to structure and compose the information as needed for analysis. Different programs may need to be applied and different files are produced during the experiment cycle. In this context, the availability of a platform supporting experiment execution is paramount. We present GeNNet, an integrated transcriptome analysis platform that unifies scientific workflows with graph databases for selecting relevant genes according to the evaluated biological systems. It includes GeNNet-Wf, a scientific workflow that pre-loads biological data, pre-processes raw microarray data and conducts a series of analyses including normalization, differential expression inference, clusterization and gene set enrichment analysis. A user-friendly web interface, GeNNet-Web, allows for setting parameters, executing, and visualizing the results of GeNNet-Wf executions. To demonstrate the features of GeNNet, we performed case studies with data retrieved from GEO, particularly using a single-factor experiment in different analysis scenarios. As a result, we obtained differentially expressed genes for which biological functions were analyzed. The results are integrated into GeNNet-DB, a database about genes, clusters, experiments and their properties and relationships. The resulting graph database is explored with queries that demonstrate the expressiveness of this data model for reasoning about gene interaction networks. GeNNet is the first platform to integrate the analytical process of transcriptome data with graph databases. It provides a comprehensive set of tools that would otherwise be challenging for non-expert users to install and use. Developers can add new functionality to components of GeNNet. The derived data allows for testing previous hypotheses about an experiment and exploring new ones through the interactive graph database environment. It enables the analysis of different data on humans, rhesus, mice and rat coming from Affymetrix platforms. GeNNet is available as an open source platform at https://github.com/raquele/GeNNet and can be retrieved as a software container with the command docker pull quelopes/gennet.


2014 ◽  
Author(s):  
Egon Willighagen

Background. Semantic Web technologies are increasingly used in biological database systems. The improved expressiveness show advantages in tracking provenance and allowing knowledge to be more explicitly annotated. The list of semantic web standards needs a complementary set of tools to handle data in those formats to use them in bioinformatics workflows. Methods. The approach proposed in this paper uses the Apache Jena library to create an environment where semantic web technologies can be use in the statistical environment R. The code is exposed as two R packages available from the Comprehensive R Archive Network (CRAN). The RJava library and a custom convenience class is used to bridge between R and the Jena library. Results. We here present two examples showing how the Resource Description Framework (RDF) and SPARQL query standards can be employed in R. The first example takes input on BRCA1 SNPs from a BioMart and converts this into a RDF data set. The second example runs a query on an experimental remote SPARQL end point provided by Uniprot, and searches textual annotations of proteins encoded by the BRCA1 gene. The third example shows how the package can be used to handle RDF returned by OpenTox web services. Discussion. The two provided library bring basic semantic web technologies to R. While only a subset of Apache Jena is currently exposed, it provides key methods to deal with RDF data and resources. The libraries are freely available from the CRAN under the Affero GNU Public License version 3: http://cran.r-project.org/web/packages/rrdf/.


Author(s):  
Elisa Pappalardo ◽  
Domenico Cantone

The successful sequencing of the genoma of various species leads to a great amount of data that need to be managed and analyzed. With the increasing popularity of high-throughput sequencing technologies, such data require the design of flexible scalable, efficient algorithms and enterprise data structures to be manipulated by both biologists and computational scientists; this emerging scenario requires flexible, scalable, efficient algorithms and enterprise data structures. This chapter focuses on the design of large scale database-driven applications for genomic and proteomic data; it is largely believed that biological databases are similar to any standard database-drive application; however, a number of different and increasingly complex challenges arises. In particular, while standard databases are used just to manage information, in biology, they represent a main source for further computational analysis, which frequently focuses on the identification of relations and properties of a network of entities. The analysis starts from the first text-based storage approach and ends with new insights on object relational mapping for biological data.


2019 ◽  
Vol 30 (1) ◽  
pp. 41-60 ◽  
Author(s):  
Gustavo Cordeiro Galvão Van Erven ◽  
Rommel Novaes Carvalho ◽  
Waldeyr Mendes Cordeiro da Silva ◽  
Sergio Lifschitz ◽  
Harley Vera-Olivera ◽  
...  

In recent years, graph database systems have become very popular and been deployed mainly in situations where the relationship between data is significant, such as in social networks. Although they do not require a particular schema design, a data model contributes to their consistency. Designing diagrams is an approach to satisfying this demand for a conceptual data model. While researchers and companies have been developing concepts and notations for graph database modeling, their notations focus on their specific implementations. In this article, the authors propose a diagram to address this lack of a generic and comprehensive notation for graph databases modeling, named GRAPHED (Graph Description Diagram for Graph Databases). The authors verified the effectiveness and compatibility of GRAPHED in two case studies: fraud identification, and a biological network model.


Author(s):  
Maurizio Nolé ◽  
Carlo Sartiani

 In the recent years many real-world applications have been modeled by graph structures (e.g., social networks, mobile phone networks, web graphs, etc.), and many systems have been developed to manage, query, and analyze these datasets. These systems could be divided into specialized graph database systems and large-scale graph analytics systems. The first ones consider end-to-end data management issues including storage representations, transactions, and query languages, whereas the second ones focus on processing specific tasks over large data graphs. In this paper we provide an overview of several  graph database systems and graph processing systems, with the aim of assisting the reader in identifying the best-suited solution for her application scenario.


2020 ◽  
Author(s):  
Divyansh Sehgal

Bioinformatics is an ever-growing field due to the availability of vast database systems and increasing biological data. This rapid development deals with research and development activities and requires adequate protection in the form of Intellectual Property Rights(IPR) as it adds value to the discoveries and provides incentives to the investors. The study includes the role of IPR in bioinformatics with a major focus on patents and related laws. The paper will also analyze, what type of bioinformatics are patentable, how does patent protect bioinformatics innovations specifically software which analyses DNA sequences. The paper will be presented in four parts namely, part one will consist of the introduction, while the part two will focus on what is bioinformatics and how it is related to IPR, part three will focus on the patent eligibility criteria for bioinformatics and lastly, part four will present a conclusion.


Author(s):  
Robinson Vespucio Vaz ◽  
Jones Dhyemison Quito de Oliveira ◽  
Leonardo Andrade Ribeiro

2016 ◽  
Author(s):  
Raquel L. Costa ◽  
Luiz M. R. Gadelha ◽  
Marcelo Ribeiro-Alves ◽  
Fabio Porto

AbstractBackgroundThere are many steps in analyzing transcriptome data, from the acquisition of raw data to the selection of a subset of representative genes that explain a scientific hypothesis. The data produced may additionally be integrated with other biological databases, such as Protein-Protein Interactions and annotations. However, the results of these analyses remain fragmented, imposing difficulties, either for posterior inspection of results, or for meta-analysis by the incorporation of new related data. Integrating databases and tools into scientific workflows, orchestrating their execution, and managingthe resulting data and its respective metadata are challenging tasks. Running in-silico experiments to structure and compose the information as needed for analysis is a daunting task. Different programsmay need to be applied and different files are produced during the experiment cycle. In this context,the availability of a platform supporting experiment execution is paramount.ResultsWe present GeNNet, an integrated transcriptome analysis platform that unifies scientific workflows with graph databases for selecting relevant genes according to the evaluated biological systems. GeNNet includes pre-loaded biological data, pre-processes raw microarray data and conducts a series of analyses including normalization, differential expression inference, clusterization and geneset enrichment analysis. To demonstrate the features of GeNNet, we performed case studies with data retrieved from GEO, particularly using a single-factor experiment. As a result, we obtained differentially expressed genes for which biological functions were analyzed. The results are integrated into GeNNet-DB, a database about genes, clusters, experiments and their properties and relationships.The resulting graph database is explored with queries that demonstrate the expressiveness of this data model for reasoning about gene regulatory networks.ConclusionsGeNNet is the first platform to integrate the analytical process of transcriptome data with graph database. It provides a comprehensive set of tools that would otherwise be challenging for non-expert users to install and use. Developers as well can add new functionality to each component of GeNNet. The resulting data allows for testing previous hypotheses about an experiment as well as exploring new ones through the interactive graph database environment. It enables the analysis of different data on humans, rhesus, mice and rat coming from Affymetrix platforms.


Sign in / Sign up

Export Citation Format

Share Document