scholarly journals Options to streamline and enrich biodiversity data aggregation

2018 ◽  
Vol 2 ◽  
pp. e26808
Author(s):  
Donald Hobern ◽  
Andrea Hahn ◽  
Tim Robertson

The success of Darwin Core and ABCD Schema as flexible standards for sharing specimen data and species occurrence records has enabled GBIF to aggregate around one billion data records. At the same time, other thematic, national or regional aggregators have developed a wide range of other data indexes and portals, many of which enrich the data by interpreting and normalising elements not currently handled by GBIF or by linking other data from geospatial layers, trait databases, etc. Unfortunately, although each of these aggregators has specific strengths and supports particular audiences, this diversification produces many weaknesses and deficiencies for data publishers and for data users, including: incomplete and inconsistent inclusion of relevant datasets; proliferation of record identifiers; inconsistent and bespoke workflows to interpret and standardise data; absence of any shared basis for linked open data and annotations; divergent data formats and APIs; lack of clarity around provenance and impact; etc. The time is ripe for the global community to review these processes. From a technical standpoint, it would be feasible to develop a shared, integrated pipeline which harvested, validated and normalised all relevant biodiversity data records on behalf of all stakeholders. Such a system could build on TDWG expertise to standardise data checks and all stages in data transformation. It could incorporate a modular structure that allowed thematic, national or regional networks to generate additional data elements appropriate to the needs of their users, but for all of these elements to remain part of a single record with a single identifier, facilitating a much more rigorous approach to linked open data. Most of the other issues we currently face around fitness-for-use, predictability and repeatability, transparency and provenance could be supported much more readily under such a model. The key challenges that would need to be overcome would be around social factors, particularly to deliver a flexible and appropriate governance model and to allow research networks, national agencies, etc. to embed modular components within a shared workflow. Given the urgent need to improve data management to support Essential Biodiversity Variables and to deliver an effective global virtual natural history collection, we should review these challenges and seek to establish a data management and aggregation architecture that will support us for the coming decades.

Author(s):  
Lyubomir Penev ◽  
Teodor Georgiev ◽  
Viktor Senderov ◽  
Mariya Dimitrova ◽  
Pavel Stoev

As one of the first advocates of open access and open data in the field of biodiversity publishiing, Pensoft has adopted a multiple data publishing model, resulting in the ARPHA-BioDiv toolbox (Penev et al. 2017). ARPHA-BioDiv consists of several data publishing workflows and tools described in the Strategies and Guidelines for Publishing of Biodiversity Data and elsewhere: Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. The above mentioned approaches are supported by a whole ecosystem of additional workflows and tools, for example: (1) pre-publication data auditing, involving both human and machine data quality checks (workflow 2); (2) web-service integration with data repositories and data centres, such as Global Biodiversity Information Facility (GBIF), Barcode of Life Data Systems (BOLD), Integrated Digitized Biocollections (iDigBio), Data Observation Network for Earth (DataONE), Long Term Ecological Research (LTER), PlutoF, Dryad, and others (workflows 1,2); (3) semantic markup of the article texts in the TaxPub format facilitating further extraction, distribution and re-use of sub-article elements and data (workflows 3,4); (4) server-to-server import of specimen data from GBIF, BOLD, iDigBio and PlutoR into manuscript text (workflow 3); (5) automated conversion of EML metadata into data paper manuscripts (workflow 2); (6) export of Darwin Core Archive and automated deposition in GBIF (workflow 3); (7) submission of individual images and supplementary data under own DOIs to the Biodiversity Literature Repository, BLR (workflows 1-3); (8) conversion of key data elements from TaxPub articles and taxonomic treatments extracted by Plazi into RDF handled by OpenBiodiv (workflow 5). These approaches represent different aspects of the prospective scholarly publishing of biodiversity data, which in a combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, lay the ground of an entire data publishing ecosystem for biodiversity, supplying FAIR (Findable, Accessible, Interoperable and Reusable data to several interoperable overarching infrastructures, such as GBIF, BLR, Plazi TreatmentBank, OpenBiodiv and various end users.


2020 ◽  
Vol 6 ◽  
Author(s):  
Christoph Steinbeck ◽  
Oliver Koepler ◽  
Felix Bach ◽  
Sonja Herres-Pawlis ◽  
Nicole Jung ◽  
...  

The vision of NFDI4Chem is the digitalisation of all key steps in chemical research to support scientists in their efforts to collect, store, process, analyse, disclose and re-use research data. Measures to promote Open Science and Research Data Management (RDM) in agreement with the FAIR data principles are fundamental aims of NFDI4Chem to serve the chemistry community with a holistic concept for access to research data. To this end, the overarching objective is the development and maintenance of a national research data infrastructure for the research domain of chemistry in Germany, and to enable innovative and easy to use services and novel scientific approaches based on re-use of research data. NFDI4Chem intends to represent all disciplines of chemistry in academia. We aim to collaborate closely with thematically related consortia. In the initial phase, NFDI4Chem focuses on data related to molecules and reactions including data for their experimental and theoretical characterisation. This overarching goal is achieved by working towards a number of key objectives: Key Objective 1: Establish a virtual environment of federated repositories for storing, disclosing, searching and re-using research data across distributed data sources. Connect existing data repositories and, based on a requirements analysis, establish domain-specific research data repositories for the national research community, and link them to international repositories. Key Objective 2: Initiate international community processes to establish minimum information (MI) standards for data and machine-readable metadata as well as open data standards in key areas of chemistry. Identify and recommend open data standards in key areas of chemistry, in order to support the FAIR principles for research data. Finally, develop standards, if there is a lack. Key Objective 3: Foster cultural and digital change towards Smart Laboratory Environments by promoting the use of digital tools in all stages of research and promote subsequent Research Data Management (RDM) at all levels of academia, beginning in undergraduate studies curricula. Key Objective 4: Engage with the chemistry community in Germany through a wide range of measures to create awareness for and foster the adoption of FAIR data management. Initiate processes to integrate RDM and data science into curricula. Offer a wide range of training opportunities for researchers. Key Objective 5: Explore synergies with other consortia and promote cross-cutting development within the NFDI. Key Objective 6: Provide a legally reliable framework of policies and guidelines for FAIR and open RDM.


2019 ◽  
Vol 2 ◽  
Author(s):  
Lyubomir Penev

"Data ownership" is actually an oxymoron, because there could not be a copyright (ownership) on facts or ideas, hence no data onwership rights and law exist. The term refers to various kinds of data protection instruments: Intellectual Property Rights (IPR) (mostly copyright) asserted to indicate some kind of data ownership, confidentiality clauses/rules, database right protection (in the European Union only), or personal data protection (GDPR) (Scassa 2018). Data protection is often realised via different mechanisms of "data hoarding", that is witholding access to data for various reasons (Sieber 1989). Data hoarding, however, does not put the data into someone's ownership. Nonetheless, the access to and the re-use of data, and biodiversuty data in particular, is hampered by technical, economic, sociological, legal and other factors, although there should be no formal legal provisions related to copyright that may prevent anyone who needs to use them (Egloff et al. 2014, Egloff et al. 2017, see also the Bouchout Declaration). One of the best ways to provide access to data is to publish these so that the data creators and holders are credited for their efforts. As one of the pioneers in biodiversity data publishing, Pensoft has adopted a multiple-approach data publishing model, resulting in the ARPHA-BioDiv toolbox and in extensive Strategies and Guidelines for Publishing of Biodiversity Data (Penev et al. 2017a, Penev et al. 2017b). ARPHA-BioDiv consists of several data publishing workflows: Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph In combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, these approaches show different angles to the future of biodiversity data publishing and, lay the foundations of an entire data publishing ecosystem in the field, while also supplying FAIR (Findable, Accessible, Interoperable and Reusable) data to several interoperable overarching infrastructures, such as Global Biodiversity Information Facility (GBIF), Biodiversity Literature Repository (BLR), Plazi TreatmentBank, OpenBiodiv, as well as to various end users.


Author(s):  
Cecilia Avila-Garzon

Advances in semantic web technologies have rocketed the volume of linked data published on the web. In this regard, linked open data (LOD) has long been a topic of great interest in a wide range of fields (e.g. open government, business, culture, education, etc.). This article reports the results of a systematic literature review on LOD. 250 articles were reviewed for providing a general overview of the current applications, technologies, and methodologies for LOD. The main findings include: i) most of the studies conducted so far focus on the use of semantic web technologies and tools applied to contexts such as biology, social sciences, libraries, research, and education; ii) there is a lack of research with regard to a standardized methodology for managing LOD; and iii) a plenty of tools can be used for managing LOD, but most of them lack of user-friendly interfaces for querying datasets.


2015 ◽  
Vol 6 (2) ◽  
pp. 224-231 ◽  
Author(s):  
Sarah Whitcher Kansa

The inability of journals and books to accommodate data and to make it reusable has led to the gradual loss of vast amounts of information. The practice of disseminating selected sub-sets of data (usually in summary tables) permits only very limited types of reuse, and thus hampers scholarship. In recent years, largely in response to increasing government and institutional requirements for full data access, the scholarly community is giving data more attention, and solutions for data management are emerging. However, seeing data management primarily as a matter of compliance means that the research community faces continued data loss, as many datasets enter repositories without adequate description to enable their reuse. Furthermore, because many archaeologists do not yet have experience in data reuse, they lack understanding of what “good” data management means in terms of their own research practices. This paper discusses Linked Open Data (LOD) as an approach to improving data description, intelligibility and discoverability to facilitate reuse. I present examples of how annotating zooarchaeology datasets with LOD can facilitate data integration without forcing standardization. I conclude by recognizing that data sharing is not without its challenges. However, the research community’s careful attention and recognition of datasets as valuable scholarly outputs will go a long way toward ensuring that the products of our work are more widely useful.


Semantic Web ◽  
2020 ◽  
pp. 1-25
Author(s):  
Andre Gomes Regino ◽  
Julio Cesar dos Reis ◽  
Rodrigo Bonacin ◽  
Ahsan Morshed ◽  
Timos Sellis

RDF data has been extensively deployed describing various types of resources in a structured way. Links between data elements described by RDF models stand for the core of Semantic Web. The rising amount of structured data published in public RDF repositories, also known as Linked Open Data, elucidates the success of the global and unified dataset proposed by the vision of the Semantic Web. Nowadays, semi-automatic algorithms build connections among these datasets by exploring a variety of methods. Interconnected open data demands automatic methods and tools to maintain their consistency over time. The update of linked data is considered as key process due to the evolutionary characteristic of such structured datasets. However, data changing operations might influence well-formed links, which turns difficult to maintain the consistencies of connections over time. In this article, we propose a thorough survey that provides a systematic review of the state of the art in link maintenance in linked open data evolution scenario. We conduct a detailed analysis of the literature for characterising and understanding methods and algorithms responsible for detecting, fixing and updating links between RDF data. Our investigation provides a categorisation of existing approaches as well as describes and discusses existing studies. The results reveal an absence of comprehensive solutions suited to fully detect, warn and automatically maintain the consistency of linked data over time.


2020 ◽  
Author(s):  
Stephan van Gasselt ◽  
Andrea Nass

<p>Since the mid 1990s, off-the-shelf Geographic Information Systems (GIS) have been increasingly accepted as essential tools for data management, data analysis and visualization in the planetary sciences, in particular in planetary surface studies.</p><p>With that advance, small homebrew and niche solutions have been slowly abandoned in favor of commercial off-the-shelf (COTS) and established free and open-source software (FOSS) which are capable of providing a wide range of generic analyses tools.</p><p>This transition has likely been facilitated by three contemporaneous developments:</p><ol><li>the integrability and provision of planetary spheroid specifications with arbitrary radii definitions,</li> <li>the possibility to ingest planetary data in their native formats or to be able to use tools exporting data into common formats,</li> <li>the need to be able to ingest and co-register data at medium low (>200 m) as well as highest resolution (<5 m) at the same time as well as the need to make extensive use of digital terrain model analyses. These needs resulted from the release of data with varying spatial and temporal resolution initiated in the course of the Mars Global Surveyor mission.</li> </ol><p>To no surprise, user demands have been increasing over the last two decades due to high data-volume returns from Mars, the Moon and from Saturn’s satellites.</p><p>This particular development as well as an education which has been increasingly centered on spatial awareness helped shaping the landscape of spatial data management, data analysis and visualization supported by GIS technology. New challenges in these fields currently arise while other challenges just became more apparent and have been ghosting around for over 30 years without being solved thus far. Some of the new challenges evolve around the obvious need to be able to integrate large amounts of variable data, not only in terms of storing and managing, but also with respect to extracting meaningful information with purposeful tools as well as with respect to visualization. While the exponential data growth and the need for more sophisticated tools did certainly not come as a surprise, innovation and solutions to cope with such a demand lag far behind.</p><p>Open standards and stable interfaces allowing to extend functionalities have been demanded and discussed as essential challenge in GIS development for more than 30 years, and yet, “open data” has seemingly only recently become a market “vision”, and the future will show if interoperability will become bidirectional at the end. The relatively small planetary sciences community will need to come up (and has come up) with their own tools to extend GIS functionalities although that experience might be hampered by ever-changing interface specifications with new GIS releases rendering updates unsustainable on the long run. Other challenges, e.g., cartography of irregular bodies, cannot be addressed using additional tools as they target the very core of contemporary GIS tools.</p><p>In this presentation we will summarize and discuss recent challenges in Planetary GIS and focus on perspectives within a currently changing GIS landscape and try to address potential solutions and bypasses.</p>


Sign in / Sign up

Export Citation Format

Share Document