scholarly journals FAIR.ReD: Semantic knowledge graph infrastructure for the life sciences

Author(s):  
Lars Vogt ◽  
Sören Auer ◽  
Thomas Bartolomaeus ◽  
Pier Luigi Buttigieg ◽  
Peter Grobe ◽  
...  

We would like to present FAIR Research Data: Semantic Knowledge Graph Infrastructure for the Life Sciences (in short, FAIR.ReD), a project initiative that is currently being evaluated for funding. FAIR.ReD is a software environment for developing data management solutions according to the FAIR (Findable, Accessible, Interoperable, Reusable; Wilkinson et al. 2016) data principles. It utilizes what we call a Data Sea Storage, which employs the idea of Data Lakes to decouple data storage from data access but modifies it by storing data in a semantically structured format as either semantic graphs or semantic tables, instead of storing them in their native form. Storage follows a top-down approach, resulting in a standardized storage model, which allows sharing data across all FAIR.ReD Knowledge Graph Applications (KGAs) connected to the same Sea, with newly developed KGAs having automatically access to all contents in the Sea. In contrast access and export of data follows a bottom-up approach that allows the specification of additional data models to meet the varying domain-specific and programmatic needs for accessing structured data. The FAIR.ReD engine enables bidirectional data conversion between the two storage models and any additional data model, which will substantially reduce conversion workload for data-rich institutes (Fig. 1). Moreover, with the possibility to store data in semantic tables, FAIR.ReD provides high performance storage for incoming data streams such as sensory data. FAIR.ReD KGAs are modularly organized. Modules can be edited using the FAIR.ReD editor and combined to form coherent KGAs. The editor allows domain experts to develop their own modules and KGAs without any programming experience required, thus also allowing smaller projects and individual researchers to build their own FAIR data management solution. Contents from FAIR.ReD KGAs can be published under a Creative Commons license as documents, micropublications, or nanopublications, each receiving their own DOI. A publication-life-cycle is implemented in FAIR.ReD and allows updating published contents for corrections or additions without overwriting the originally published version. Together with the fact that data and metadata are semantically structured and machine-readable, all contents from FAIR.ReD KGAs will comply with the FAIR Guiding Principles. Due to all FAIR.Red KGAs providing access to semantic knowledge graphs in both a human-readable and a machine-readable version, FAIR.ReD seamlessly integrates the complex RDF (Resource Description Framework) world with a more intuitively comprehensible presentation of data in form of data entry forms, charts, and tables. Guided by use cases, the FAIR.ReD environment will be developed using semantic programming where the source code of an application is stored in its own ontology. The set of source code ontologies of a KGA and its modules provides the steering logic for running the KGA. With this clear separation of steering logic from interpretation logic, semantic programming follows the idea of separating main layers of an application, analog to the separation of interpretation logic and presentation logic. Each KGA and module is specified exactly in this way and their source code ontologies stored in the Data Sea. Thus, all data and metadata are semantically transparent and so is the data management application itself, which substantially improves their sustainability on all levels of data processing and storing.

Author(s):  
Peter Grobe ◽  
Roman Baum ◽  
Philipp Bhatty ◽  
Christian Köhler ◽  
Sandra Meid ◽  
...  

The landscape of currently existing repositories of specimen data consists of isolated islands, with each applying its own underlying data model. Using standardized protocols such as DarwinCore or ABCD, specimen data and metadata are exchanged and published on web portals such as GBIF. However, data models differ across repositories. This can lead to problems when comparing and integrating content from different systems. for example, in one system there is a field with the label 'determination', in another there is a field with the label 'taxonomic identification'. Both might refer to the same concepts of organism identification process (e.g., 'obi:organism identification assay'; http://purl.obolibrary.org/obo/OBI_0001624), but the intuitive meaning of the content is not clear and the understanding of the providers of the information might differ from that of the users. Without additional information, data integration across isolated repositories is thus difficult and error-prone. As a consequence, interoperability and retrievability of data across isolated repositories is difficult. Linked Open Data (LOD) promises an improvement. URIs can be used for concepts that are ideally created and accepted by a community and that provide machine-readable meanings. LOD thereby supports transfer of data into information and then into knowledge, thus making the data FAIR (Findable, Accessible, Interoperable, Reusable; Wilkinson et al. 2016). Annotating specimen associated data with LOD, therefore, seems to be a promising approach to guarantee interoperability across different repositories. However, all currently used specimen collection management systems are based on relational database systems, which lack semantic transparency and thus do not provide easily accessible, machine-readable meanings for the terms used in their data models. As a consequence, transferring their data contents into an LOD framework may lead to loss or misinterpretation of information. This discrepancy between LOD and relational databases results from the lack of semantic transparency and machine-readability of data in relational databases. Storing specimen collection data as semantic Knowledge Graphs provides semantic transparency and machine-readability of data. Semantic Knowledge Graphs are graphs that are based on the syntax of ‘Subject – Property – Object’ of the Resource Description Framework (RDF). The ‘Subject’ and ‘Property’ position is taken by URIs and the ‘Object’ position can be taken either by a URI or by a label or value. Since a given URI can take the ‘Subject’ position in one RDF statement and the ‘Object’ position in another RDF statement, several RDF statements can be connected to form a directed labeled graph, i.e. a semantic graph. Semantic Knowledge Graphs are graphs in which each described specimen and its parts and properties possess their own URI and thus can be individually referenced. These URIs are used to describe the respective specimen and its properties using the RDF syntax. Additional RDF statements specify the ontology class that each part and property instantiates. The reference to the URIs of the instantiated ontology classes guarantees the Findability, Interoperability, and Reusability of information contained in semantic Knowledge Graphs. Specimen collection data contained in semantic Knowledge Graphs can be made Accessible in a human-readable form through an interface and in a machine-readable form through a SPARQL endpoint (https://en.wikipedia.org/wiki/SPARQL). As a consequence, semantic Knowledge Graphs comply with the FAIR guiding principles. By using URIs for the semantic Knowledge Graph of each specimen in the collection, it is also available as LOD. With semantic Morph·D·Base, we have implemented a prototype to this approach that is based on Semantic Programming. We present the prototype and discuss different aspects of how specimen collection data are handled. By using community created terminologies and standardized methods for the contents created (e.g. species identification) as well as URIs for each expression, we make the data and metadata semantically transparent and communicable. The source code for Semantic Programming and for semantic Morph·D·Base is available from https://github.com/SemanticProgramming. The prototype of semantic Morph·D·Base can be accessed here: https://proto.morphdbase.de.


2020 ◽  
Vol 17 (12) ◽  
pp. 5229-5237
Author(s):  
P. Selvaraj ◽  
Venkatesh Kannan ◽  
Bruno Voisin

The real time applications demands high speed and reliable data access from the remote database. An effective logical data management strategy that handles simultaneous connections with better performance negotiation is inevitable. This work considers an e-health care application that proposes MongoDB based modified indexing and performance tuning methods. To cope with certain high frequency use case and its performance mandates, a flexible and efficient logical data management may be preferred. By analysing the data dependency, data decomposition concerns and the performance requirements of the specific use case of the medical application, a logical schema may be customized on an ala-carte basis. This work focused on the flexible logical data modeling schemes and its performance factors of the NoSql DB. The efficiency of unstructured data base management in storing and retrieving the e-health care data was analysed with a web based tool. To enable faster data retrieval and query processing over the distributed nodes, a Spark based storage engine was built on top of the MongoDB based data storage management. With Spark tool, the database has been made distributed as master–slave structures with suitable data replication mechanisms. In such distributed database the fail-over also implemented with the suitable replication mechanism. This work considered MongoDB based flexible schema modeling and Spark based distributed computation with multiple chunks of data. The flexible data modeling scheme with MongoDB with the on-demand Spark based computation framework was proposed. To facilitate the eventual consistency, scalability aspects of the e-health care applications, use case based indexing was proposed. With the effective data management, faster query processing the horizontal scalability has been increased. The overall efficiency and scalability of the proposed logical data management approach was analysed. Through the simulation studies, the proposed approach has been claimed to boost the performance of the bigdata based application to a considerable extent.


2012 ◽  
Vol 39 (11) ◽  
pp. 948 ◽  
Author(s):  
Kenny Billiau ◽  
Heike Sprenger ◽  
Christian Schudoma ◽  
Dirk Walther ◽  
Karin I. Köhl

In plant breeding, plants have to be characterised precisely, consistently and rapidly by different people at several field sites within defined time spans. For a meaningful data evaluation and statistical analysis, standardised data storage is required. Data access must be provided on a long-term basis and be independent of organisational barriers without endangering data integrity or intellectual property rights. We discuss the associated technical challenges and demonstrate adequate solutions exemplified in a data management pipeline for a project to identify markers for drought tolerance in potato. This project involves 11 groups from academia and breeding companies, 11 sites and four analytical platforms. Our data warehouse concept combines central data storage in databases and a file server and integrates existing and specialised database solutions for particular data types with new, project-specific databases. The strict use of controlled vocabularies and the application of web-access technologies proved vital to the successful data exchange between diverse institutes and data management concepts and infrastructures. By presenting our data management system and making the software available, we aim to support related phenotyping projects.


2006 ◽  
Vol 2 (4) ◽  
pp. 193-209 ◽  
Author(s):  
Mieso K. Denko ◽  
Hua Lu

A mobile ad hoc network (MANET) is a collection of wireless mobile nodes that forms a temporary network without the aid of a fixed communication infrastructure. Since every node can be mobile and network topology changes can occur frequently, node disconnection is a common mode of operation in MANETs. Providing reliable data access and message delivery is a challenge in this dynamic network environment. Caching and replica allocation within the network can improve data accessibility by storing the data and accessing them locally. However, maintaining data consistency among replicas becomes a challenging problem. Hence, balancing data accessibility and consistency is an important step toward data management in MANETs. In this paper, we propose a replica-based data-storage mechanism and undelivered-message queue schemes to provide reliable data storage and dissemination. We also propose replica update strategies to maintain data consistency while improving data accessibility. These solutions are based on a clustered MANET where nodes in the network are divided into small groups that are suitable for localized data management. The goal is to reduce communication overhead, support localized computation, and enhance scalability. A simulation environment was built using an NS-2 network simulator to evaluate the performance of the proposed schemes. The results show that our schemes distribute replicas effectively, provide high data accessibility rates and maintain consistency.


2021 ◽  
Vol 13 (4) ◽  
pp. 2005
Author(s):  
Junnan Liu ◽  
Haiyan Liu ◽  
Xiaohui Chen ◽  
Xuan Guo ◽  
Qingbo Zhao ◽  
...  

Information resources have increased rapidly in the big data era. Geospatial data plays an indispensable role in spatially informed analyses, while data in different areas are relatively isolated. Therefore, it is inadequate to use relational data in handling many semantic intricacies and retrieving geospatial data. In light of this, a heterogeneous retrieval method based on knowledge graph is proposed in this paper. There are three advantages of this method: (1) the semantic knowledge of geospatial data is considered; (2) more information required by users could be obtained; (3) data retrieval speed can be improved. Firstly, implicit semantic knowledge is studied and applied to construct a knowledge graph, integrating semantics in multi-source heterogeneous geospatial data. Then, the query expansion rules and the mappings between knowledge and database are designed to construct retrieval statements and obtain related spatial entities. Finally, the effectiveness and efficiency are verified through comparative analysis and practices. The experiment indicates that the method could automatically construct database retrieval statements and retrieve more relevant data. Additionally, users could reduce the dependence on data storage mode and database Structured Query Language syntax. This paper would facilitate the sharing and outreach of geospatial knowledge for various spatial studies.


IEEE Software ◽  
2020 ◽  
Vol 37 (2) ◽  
pp. 89-94
Author(s):  
Bob van Luijt ◽  
Micha Verhagen

Sign in / Sign up

Export Citation Format

Share Document