scholarly journals Data ownership and data publishing

2019 ◽  
Vol 2 ◽  
Author(s):  
Lyubomir Penev

"Data ownership" is actually an oxymoron, because there could not be a copyright (ownership) on facts or ideas, hence no data onwership rights and law exist. The term refers to various kinds of data protection instruments: Intellectual Property Rights (IPR) (mostly copyright) asserted to indicate some kind of data ownership, confidentiality clauses/rules, database right protection (in the European Union only), or personal data protection (GDPR) (Scassa 2018). Data protection is often realised via different mechanisms of "data hoarding", that is witholding access to data for various reasons (Sieber 1989). Data hoarding, however, does not put the data into someone's ownership. Nonetheless, the access to and the re-use of data, and biodiversuty data in particular, is hampered by technical, economic, sociological, legal and other factors, although there should be no formal legal provisions related to copyright that may prevent anyone who needs to use them (Egloff et al. 2014, Egloff et al. 2017, see also the Bouchout Declaration). One of the best ways to provide access to data is to publish these so that the data creators and holders are credited for their efforts. As one of the pioneers in biodiversity data publishing, Pensoft has adopted a multiple-approach data publishing model, resulting in the ARPHA-BioDiv toolbox and in extensive Strategies and Guidelines for Publishing of Biodiversity Data (Penev et al. 2017a, Penev et al. 2017b). ARPHA-BioDiv consists of several data publishing workflows: Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph In combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, these approaches show different angles to the future of biodiversity data publishing and, lay the foundations of an entire data publishing ecosystem in the field, while also supplying FAIR (Findable, Accessible, Interoperable and Reusable) data to several interoperable overarching infrastructures, such as Global Biodiversity Information Facility (GBIF), Biodiversity Literature Repository (BLR), Plazi TreatmentBank, OpenBiodiv, as well as to various end users.

Author(s):  
Lyubomir Penev ◽  
Teodor Georgiev ◽  
Viktor Senderov ◽  
Mariya Dimitrova ◽  
Pavel Stoev

As one of the first advocates of open access and open data in the field of biodiversity publishiing, Pensoft has adopted a multiple data publishing model, resulting in the ARPHA-BioDiv toolbox (Penev et al. 2017). ARPHA-BioDiv consists of several data publishing workflows and tools described in the Strategies and Guidelines for Publishing of Biodiversity Data and elsewhere: Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. The above mentioned approaches are supported by a whole ecosystem of additional workflows and tools, for example: (1) pre-publication data auditing, involving both human and machine data quality checks (workflow 2); (2) web-service integration with data repositories and data centres, such as Global Biodiversity Information Facility (GBIF), Barcode of Life Data Systems (BOLD), Integrated Digitized Biocollections (iDigBio), Data Observation Network for Earth (DataONE), Long Term Ecological Research (LTER), PlutoF, Dryad, and others (workflows 1,2); (3) semantic markup of the article texts in the TaxPub format facilitating further extraction, distribution and re-use of sub-article elements and data (workflows 3,4); (4) server-to-server import of specimen data from GBIF, BOLD, iDigBio and PlutoR into manuscript text (workflow 3); (5) automated conversion of EML metadata into data paper manuscripts (workflow 2); (6) export of Darwin Core Archive and automated deposition in GBIF (workflow 3); (7) submission of individual images and supplementary data under own DOIs to the Biodiversity Literature Repository, BLR (workflows 1-3); (8) conversion of key data elements from TaxPub articles and taxonomic treatments extracted by Plazi into RDF handled by OpenBiodiv (workflow 5). These approaches represent different aspects of the prospective scholarly publishing of biodiversity data, which in a combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, lay the ground of an entire data publishing ecosystem for biodiversity, supplying FAIR (Findable, Accessible, Interoperable and Reusable data to several interoperable overarching infrastructures, such as GBIF, BLR, Plazi TreatmentBank, OpenBiodiv and various end users.


Author(s):  
Lyubomir Penev ◽  
Dimitrios Koureas ◽  
Quentin Groom ◽  
Jerry Lanfear ◽  
Donat Agosti ◽  
...  

The Horizon 2020 project Biodiversity Community Integrated Knowledge Library (BiCIKL) (started 1st of May 2021, duration 3 years) will build a new European community of key research infrastructures, researchers, citizen scientists and other stakeholders in biodiversity and life sciences. Together, the BiCIKL 14 partners will solidify open science practices by providing access to data, tools and services at each stage of, and along the entire biodiversity research and data life cycle (specimens, sequences, taxon names, analytics, publications, biodiversity knowledge graph) (Fig. 1, see also the BiCIKL kick-off presentation through Suppl. material 1), in compliance with the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. The existing services provided by the participating infrastructures will expand through development and adoption of shared, common or interoperable domain standards, resulting in liberated and enhanced flows of data and knowledge across these domains. BiCIKL puts a special focus on the biodiversity literature. Over the span of the project, BiCIKL will develop new methods and workflows for semantic publishing and integrated access to harvesting, liberating, linking, and re-using sub-article-level data extracted from literature (i.e., specimens, material citations, sequences, taxonomic names, taxonomic treatments, figures, tables). Data linkages may be realised with different technologies (e.g., data warehousing, linking between FAIR Data Objects, Linked Open Data) and can be bi-lateral (between two data infrastructures) or multi-lateral (among multiple data infrastructures). The main challenge of BiCIKL is to design, develop and implement a FAIR Data Place (FDP), a central tool for search, discovery and management of interlinked FAIR data across different domains. The key final output of BiCIKL will the future Biodiversity Knowledge Hub (BKH), a one-stop portal, providing access to the BiCIKL services, tools and workflows, beyond the lifetime of the project.


Author(s):  
Dag Endresen ◽  
Armine Abrahamyan ◽  
Akobir Mirzorakhimov ◽  
Andreas Melikyan ◽  
Brecht Verstraete ◽  
...  

BioDATA (Biodiversity Data for Internationalisation in Higher Education) is an international project to develop and deliver biodiversity data training for undergraduate and postgraduate students from Armenia, Belarus, Tajikistan, and Ukraine. By training early career (student) biodiversity scholars, we aim at turning the current academic and education biodiversity landscape into a more open-data-friendly one. Professional practitioners (researchers, museum curators, and collection managers involved in data publishing) from each country were also invited to join the project as assistant teachers (mentors). The project is developed by the Research School in Biosystematics - ForBio and the Norwegian GBIF-node, both at the Natural History Museum of the University of Oslo in collaboration with the Secretariat of the Global Biodiversity Information Facility (GBIF) and partners from each of the target countries. The teaching material is based on the GBIF curriculum for data mobilization and all students will have the opportunity to gain the respective GBIF certification. All materials are made freely available for reuse and even in this very early phase of the project, we have already seen the first successful reuse of teaching materials among the project partners. The first BioDATA training event was organized in Minsk (Belarus) in February 2019 with the objective to train a minimum of four mentors from each target country. The mentor-trainees from this event will help us to teach the course to students in their home country together with teachers from the project team. BioDATA mentors will have the opportunity to gain GBIF certification as expert mentors which will open opportunities to contribute to future training events in the larger GBIF network. The BioDATA training events for the students will take place in Dushanbe (Tajikistan) in June 2019, in Minsk (Belarus) in November 2019, in Yerevan (Armenia) in April 2020, and in Kiev (Ukraine) in October 2020. Students from each country are invited to express their interest to participate by contacting their national project partner. We will close the project with a final symposium at the University of Oslo in March 2021. The project is funded by the Norwegian Agency for International Cooperation and Quality Enhancement in Higher Education (DIKU).


Author(s):  
Jennifer Hammock ◽  
Katja Schulz

The Encyclopedia of Life currently hosts ~8M attribute records for ~400k taxa (March 2019, not including geographic categories, Fig. 1). Our aggregation priorities include Essential Biodiversity Variables (Kissling et al. 2018) and other global scale research data priorities. Our primary strategy remains partnership with specialist open data aggregators; we are also developing tools for the deployment of evolutionarily conserved attribute values that scale quickly for global taxonomic coverage, for instance: tissue mineralization type (aragonite, calcite, silica...); trophic guild in certain clades; sensory modalities. To support the aggregation and integration of trait information, data sets should be well structured, properly annotated and free of licensing or contractual restrictions so that they are ‘findable, accessible, interoperable, and reusable’ for both humans and machines (FAIR principles; Wilkinson et al. 2016). To this end, we are improving the documentation of protocols for the transformation, curation, and analysis of EOL data, and associated scripts and software are made available to ensure reproducibility. Proper acknowledgement of contributors and tracking of credit through derived data products promote both open data sharing and the use of aggregated resources. By exposing unique identifiers for data products, people, and institutions, data providers and aggregators can stimulate the development of automated solutions for the creation of contribution metrics. Since different aspects of provenance will be significant depending on the intended data use, better standardization of contributor roles (e.g., author, compiler, publisher, funder) is needed, as well as more detailed attribution guidance for data users. Global scale biodiversity data resources should resolve into a graph, linking taxa, specimens, occurrences, attributes, localities, and ecological interactions, as well as human agents, publications and institutions. Two key data categories for ensuring rich connectivity in the graph will be taxonomic and trait data. This graph can be supported by existing data hubs, if they share identifiers and/or create mappings between them, using standards and sharing practices developed by the biodiversity data community. Versioned archives of the combined graph could be published at intervals to appropriate open data repositories, and open source tools and training provided for researchers to access the combined graph of biodiversity knowledge from all sources. To achieve this, good communication among data hubs will be needed. We will need to share information about preferred vocabularies and identifier management practices, and collaborate on identifier mappings.


2018 ◽  
Vol 2 ◽  
pp. e26808
Author(s):  
Donald Hobern ◽  
Andrea Hahn ◽  
Tim Robertson

The success of Darwin Core and ABCD Schema as flexible standards for sharing specimen data and species occurrence records has enabled GBIF to aggregate around one billion data records. At the same time, other thematic, national or regional aggregators have developed a wide range of other data indexes and portals, many of which enrich the data by interpreting and normalising elements not currently handled by GBIF or by linking other data from geospatial layers, trait databases, etc. Unfortunately, although each of these aggregators has specific strengths and supports particular audiences, this diversification produces many weaknesses and deficiencies for data publishers and for data users, including: incomplete and inconsistent inclusion of relevant datasets; proliferation of record identifiers; inconsistent and bespoke workflows to interpret and standardise data; absence of any shared basis for linked open data and annotations; divergent data formats and APIs; lack of clarity around provenance and impact; etc. The time is ripe for the global community to review these processes. From a technical standpoint, it would be feasible to develop a shared, integrated pipeline which harvested, validated and normalised all relevant biodiversity data records on behalf of all stakeholders. Such a system could build on TDWG expertise to standardise data checks and all stages in data transformation. It could incorporate a modular structure that allowed thematic, national or regional networks to generate additional data elements appropriate to the needs of their users, but for all of these elements to remain part of a single record with a single identifier, facilitating a much more rigorous approach to linked open data. Most of the other issues we currently face around fitness-for-use, predictability and repeatability, transparency and provenance could be supported much more readily under such a model. The key challenges that would need to be overcome would be around social factors, particularly to deliver a flexible and appropriate governance model and to allow research networks, national agencies, etc. to embed modular components within a shared workflow. Given the urgent need to improve data management to support Essential Biodiversity Variables and to deliver an effective global virtual natural history collection, we should review these challenges and seek to establish a data management and aggregation architecture that will support us for the coming decades.


Author(s):  
Gautam Talukdar ◽  
Andrew Townsend Peterson ◽  
Vinod Mathur

In India, biodiversity data and information are gaining significance for sustainable development and preparing National Biodiversity Strategies and Action Plans (NBSAPs). Civil societies and individuals are seeking open access to data and information generated with public funds, whereas sensitivity requirements often demand restrictions on the availability of sensitive data. In India, the traditional classification of data for sharing was based on the "Open Series Data" model; i.e. data not specifically included remains inaccessible. The National Data Sharing and Accessibility Policy (NDSAP Anonymous 2012Suppl. material 1) published in 2012 produced a new data sharing framework more focused on the declaration of data as closed. NDSAP is a clear statement that data that are produced by the Government of India should be shared openly. Although much of the verbiage is focused on sharing within the Government to meet national goals, the document does include clear statements about sharing with the public. The policy is intended to apply "to all data and information created, generated, collected and archived using public funds provided by the Government of India". The policy is quite clear that it should apply to all such data, and that such data should be categorized into open-access, registered-access, or restricted-access. NDSAP indicates that all Government of India-produced/funded data is to be opened to the broader community, but provides three access categories (open, registered, restricted). Although NDSAP does not offer much guidance about what sorts of data should fall in each of the categories, it clearly focuses on data sensitive in terms of national security (i.e., data that must be restricted), such as high-resolution satellite imagery of disputed border regions. Institutions collecting biodiversity data usually include primary, research-grade data in the restricted-access category and secondary / derived data (e.g., vegetation maps, species distribution maps) in the open or registered-access category. The conservative approach of not making bioidiversity data easily accessible, is not in accordance with the NDSAP policy, which emphasizes the openness of data. It also counters the main currents in science, which are shifting massively in the direction of opening access to data. Though NDSAP was intended for full implementation by 2014, its uptake by the institutions engaged in primary biodiversity data collection has been slow mainly because: providing primary data in some cases can endanger elements of the natural world; and many researchers wish to keep the data that result from their research activities shielded from full, open access out of a desire to retain control of those data for future analysis or publication. providing primary data in some cases can endanger elements of the natural world; and many researchers wish to keep the data that result from their research activities shielded from full, open access out of a desire to retain control of those data for future analysis or publication. Biodiversity data collected as part of institutional activities belong, in some sense, to the institution, and the institution should value such data over the long term. If institutions curate their biodiversity data for posterity, they can reap the benefits. Imagine the returns if biodiversity data from current ongoing projects were to be compared to data collected 50-100 years later. Thus, organizations should emphasize the long-term view of institutionalizing data resources through fair data restrictions and emphasise on public access, rather than on individual rights and control. This approach may be debatable, but we reckon that it will translate into massive science pay-offs.


2021 ◽  
Author(s):  
Björn Reetz ◽  
Hella Riede ◽  
Dirk Fuchs ◽  
Renate Hagedorn

<p>Since 2017, Open Data has been a part of the DWD data distribution strategy. Starting with a small selection of meteorological products, the number of available datasets has grown continuously over the last years. Since the start, users can access datasets anonymously via the website https://opendata.dwd.de to download file-based meteorological products. Free access and the variety of products has been welcomed by the general public as well as private met service providers. The more datasets are provided in a directory structure, however, the more tedious it is to find and select among all available data. Also, metadata and documentation were available, but on separate public websites. This turned out to be an issue, especially for new users of DWD's open data.</p><p>To help users explore the available datasets as well as to quickly decide on their suitability for a certain use case, the Open Data team at DWD is developing a geoportal. It enables free-text search along with combined access to data, metadata, and description along with interactive previews via OGC WMS.</p><p>Cloud technology is a suitable way forward for hosting the geoportal along with the data in its operational state. Benefits are expected for the easy integration of rich APIs with the geoportal, and the flexible and fast deployment and scaling of optional or prototypical services such as WMS-based previews. Flexibility is also mandatory to respond to fluctuating user demands, depending on time of day and critical weather situations, which is supported by containerization. The growing overall volume of meteorological data at DWD may mandate to allow customers to bring their code to the data – for on-demand processing including slicing and interpolation –  instead of transferring files to every customer. Shared cloud instances are the ideal interface for this purpose.</p><p>The contribution will outline a protoype version of the new geoportal and discuss further steps for launching it to the public.</p>


Author(s):  
Julián Rojas ◽  
Bert Marcelis ◽  
Eveline Vlassenroot ◽  
Mathias Van Compernolle ◽  
Pieter Colpaert ◽  
...  

Chapter 8 in the edited volume Situating Open Data.


2018 ◽  
Vol 12 (2) ◽  
pp. 179-220
Author(s):  
Jozef Andraško ◽  
Matúš Mesarčík

New technologies have irreversibly changed the nature of the traditional way of exercising the right to free access to information. In the current information society, the information available to public authorities is not just a tool for controlling the public administration and increasing its transparency. Information has become an asset that individuals and legal entities also seek to use for business purposes. PSI particularly in form of open data create new opportunities for developing and improving the performance of public administration.In that regard, authors analyze the term open data and its legal framework from the perspective of European Union law, Slovak legal order and Czech legal order. Furthermore, authors focus is on the relation between open data regime, public sector information re-use regime and free access to information regime.New data protection regime represented by General Data Protection Regulation poses several challenges when it comes to processing of public sector information in form of open data. The article highlights the most important challenges of new regime being compliance with purpose specification, selection of legal ground and other important issues.


Sign in / Sign up

Export Citation Format

Share Document