scholarly journals Options to Apply the IGSN Model to Biodiversity Data

2018 ◽  
Vol 2 ◽  
pp. e27087
Author(s):  
Donald Hobern ◽  
Andrea Hahn ◽  
Tim Robertson

For more than a decade, the biodiversity informatics community has recognised the importance of stable resolvable identifiers to enable unambiguous references to data objects and the associated concepts and entities, including museum/herbarium specimens and, more broadly, all records serving as evidence of species occurrence in time and space. Early efforts built on the Darwin Core institutionCode, collectionCode and catalogueNumber terms, treated as a triple and expected to uniquely to identify a specimen. Following review of current technologies for globally unique identifiers, TDWG adopted Life Science Identifiers (LSIDs) (Pereira et al. 2009). Unfortunately, the key stakeholders in the LSID consortium soon withdrew support for the technology, leaving TDWG committed to a moribund technology. Subsequently, publishers of biodiversity data have adopted a range of technologies to provide unique identifiers, including (among others) HTTP Universal Resource Identifiers (URIs), Universal Unique Identifiers (UUIDs), Archival Resource Keys (ARKs), and Handles. Each of these technologies has merit but they do not provide consistent guarantees of persistence or resolvability. More importantly, the heterogeneity of these solutions hampers delivery of services that can treat all of these data objects as part of a consistent linked-open-data domain. The geoscience community has established the System for Earth Sample Registration (SESAR) that enables collections to publish standard metadata records for their samples and for each of these to be associated with an International Geo Sample Number (IGSN http://www.geosamples.org/igsnabout). IGSNs follow a standard format, distribute responsibility for uniqueness between SESAR and the publishing collections, and support resolution via HTTP URI or Handles. Each IGSN resolves to a standard metadata page, roughly equivalent in detail to a Darwin Core specimen record. The standardisation of identifiers has allowed the community to secure support from some journal publishers for promotion and use of IGSNs within articles. The biodiversity informatics community encompasses a much larger number of publishers and greater pre-existing variation in identifier formats. Nevertheless, it would be possible to deliver a shared global identifier scheme with the same features as IGSNs by building off the aggregation services offered by the Global Biodiversity Information Facility (GBIF). The GBIF data index includes normalised Darwin Core metadata for all data records from registered data sources and could serve as a platform for resolution of HTTP URIs and/or Handles for all specimens and for all occurrence records. The most significant trade-off requiring consideration would be between autonomy for collections and other publishers in how they format identifiers within their own data and the benefits that may arise from greater consistency and predictability in the form of resolvable identifiers.

Author(s):  
Nora Escribano ◽  
David Galicia ◽  
Arturo H. Ariño

Building on the development of Biodiversity Informatics, the Global Biodiversity Information Facility (GBIF) undertook the task of enabling access to the world’s wealth of biodiversity data via the Internet. To date, GBIF has become, in many respects, the most extensive biodiversity information exchange infrastructure in the world, opening up a full range of possibilities for science. Science has benefited from such access to biodiversity data in research areas ranging from the effects of environmental change on biodiversity to the spread of invasive species, among many others. As of this writing, more than 7,000 published items (scientific papers, reviews, conference proceedings) have been indexed in the GBIF Secretariat’s literature tracking programme. On the basis on this database, we will represent trends in GBIF in the users’ behaviour over time regarding openness, social structure, and other features associated to such scientific production: what is the measurable impact of research using GBIF data? How is the GBIF community of users growing? Is the science made with, and enabled by, open data, actually open? Mapping GBIF users’ choices will show how biodiversity research is evolving through time, synthesising past and current priorities of this community in an attempt to forecast whether summer—or winter—is coming.


2016 ◽  
Vol 11 ◽  
Author(s):  
Alex Asase ◽  
A. Townsend Peterson

Providing comprehensive, informative, primary, research-grade biodiversity information represents an important focus of biodiversity informatics initiatives. Recent efforts within Ghana have digitized >90% of primary biodiversity data records associated with specimen sheets in Ghanaian herbaria; additional herbarium data are available from other institutions via biodiversity informatics initiatives such as the Global Biodiversity Information Facility. However, data on the plants of Ghana have not as yet been integrated and assessed to establish how complete site inventories are, so that appropriate levels of confidence can be applied. In this study, we assessed inventory completeness and identified gaps in current Digital Accessible Knowledge (DAK) of the plants of Ghana, to prioritize areas for future surveys and inventories. We evaluated the completeness of inventories at ½° spatial resolution using statistics that summarize inventory completeness, and characterized gaps in coverage in terms of geographic distance and climatic difference from well-documented sites across the country. The southwestern and southeastern parts of the country held many well-known grid cells; the largest spatial gaps were found in central and northern parts of the country. Climatic difference showed contrasting patterns, with a dramatic gap in coverage in central-northern Ghana. This study provides a detailed case study of how to prioritize for new botanical surveys and inventories based on existing DAK.


Author(s):  
José Augusto Salim ◽  
Antonio Saraiva

For those biologists and biodiversity data managers who are unfamiliar with information science data practices of data standardization, the use of complex software to assist in the creation of standardized datasets can be a barrier to sharing data. Since the ratification of the Darwin Core Standard (DwC) (Darwin Core Task Group 2009) by the Biodiversity Information Standards (TDWG) in 2009, many datasets have been published and shared through a variety of data portals. In the early stages of biodiversity data sharing, the protocol Distributed Generic Information Retrieval (DiGIR), progenitor of DwC, and later the protocols BioCASe and TDWG Access Protocol for Information Retrieval (TAPIR) (De Giovanni et al. 2010) were introduced for discovery, search and retrieval of distributed data, simplifying data exchange between information systems. Although these protocols are still in use, they are known to be inefficient for transferring large amounts of data (GBIF 2017). Because of that, in 2011 the Global Biodiversity Information Facility (GBIF) introduced the Darwin Core Archive (DwC-A), which allows more efficient data transfer, and has become the preferred format for publishing data in the GBIF network. DwC-A is a structured collection of text files, which makes use of the DwC terms to produce a single, self-contained dataset. Many tools for assisting data sharing using DwC-A have been introduced, such as the Integrated Publishing Toolkit (IPT) (Robertson et al. 2014), the Darwin Core Archive Assistant (GBIF 2010) and the Darwin Core Archive Validator. Despite promoting and facilitating data sharing, many users have difficulties using such tools, mainly because of the lack of training in information science in the biodiversity curriculum (Convention on Biological Diversiity 2012, Enke et al. 2012). However, most users are very familiar with spreadsheets to store and organize their data, but the adoption of the available solutions requires data transformation and training in information science and more specifically, biodiversity informatics. For an example of how spreadsheets can simplify data sharing see Stoev et al. (2016). In order to provide a more "familiar" approach to data sharing using DwC-A, we introduce a new tool as a Google Sheet Add-on. The Add-on, called Darwin Core Archive Assistant Add-on can be installed in the user's Google Account from the G Suite MarketPlace and used in conjunction with the Google Sheets application. The Add-on assists the mapping of spreadsheet columns/fields to DwC terms (Fig. 1), similar to IPT, but with the advantage that it does not require the user to export the spreadsheet and import it into another software. Additionally, the Add-on facilitates the creation of a star schema in accordance with DwC-A, by the definition of a "CORE_ID" (e.g. occurrenceID, eventID, taxonID) field between sheets of a document (Fig. 2). The Add-on also provides an Ecological Metadata Language (EML) (Jones et al. 2019) editor (Fig. 3) with minimal fields to be filled in (i.e., mandatory fields required by IPT), and helps users to generate and share DwC-Archives stored in the user's Google Drive, which can be downloaded as a DwC-A or automatically uploaded to another public storage resource like a user's Zenodo Account (Fig. 4). We expect that the Google Sheet Add-on introduced here, in conjunction with IPT, will promote biodiversity data sharing in a standardized format, as it requires minimal training and simplifies the process of data sharing from the user's perspective, mainly for those users not familiar with IPT, but that historically have worked with spreadsheets. Although the DwC-A generated by the add-on still needs to be published using IPT, it does provide a simpler interface (i.e., spreadsheet) for mapping data sets to DwC than IPT. Even though the IPT includes many more features than the Darwin Core Assistant Add-on, we expect that the Add-on can be a "starting point" for users unfamiliar with biodiversity informatics before they move on to more advanced data publishing tools. On the other hand, Zenodo integration allows users to share and cite their standardized data sets without publishing them via IPT, which can be useful for users without access to an IPT installation. Additionally, we are working on new features and future releases will include the automatic generation of Global Unique Identifiers for shared records, the possibility of adding additional data standards and DwC extensions, integration with GBIF REST API and with IPT REST API.


Author(s):  
Filipi Soares ◽  
Benildes Maculan ◽  
Debora Drucker

Agricultural Biodiversity has been defined by the Convention on Biological Diversity as the set of elements of biodiversity that are relevant to agriculture and food production. These elements are arranged into an agro-ecosystem that compasses "the variability among living organisms from all sources including terrestrial, marine and other aquatic ecosystems and the ecological complexes of which they are part: this includes diversity within species, between species and of ecosystems" (UNEP 1992). As with any other field in Biology, Agricultural Biodiversity work produces data. In order to publish data in a way it can be efficiently retrieved on web, one must describe it with proper metadata. A metadata element set is a group of statements made about something. These statements have three elements, named subject (thing represented), predicate (space filled up with data) and object (data itself). This representation is called triples. For example, the title is a metadata element. A book is the subject; title is the predicate; and The Chronicles of Narnia is the object. Some metadata standards have been developed to describe biodiversity data, as ABCD Data Schema, Darwin Core (DwC) and Ecological Metadata Language (EML). The DwC is said to be the most used metadata standard to publish data about species occurrence worldwide (Global Biodiversity Information Facility 2019). "Darwin Core is a standard maintained by the Darwin Core maintenance group. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing identifiers, labels, and definitions. Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information" (Biodiversity Information Standards (TDWG) 2014). Within this thematic context, a master research project is in progress at the Federal University of Minas Gerais in partnership with the Brazilian Agricultural Research Corporation (EMBRAPA). It aims to apply the DwC on Brazil’s Agricultural Biodiversity data. A pragmatic analysis of DwC and DwC Extensions demonstrated that important concepts and relations from Agricultural Biodiversity are not represented in DwC elements. For example, DwC does not have significant metadata to describe biological interactions, to convey important information about relations between organisms in an ecological perspective. Pollination is one of the biological interactions relevant to Agricultural Biodiversity, for which we need enhanced metadata. Given these problems, the principles of metadata construction of DwC will be followed in order to develop a metadata extension able to represent data about Agricultural Biodiversity. These principles are the Dublin Core Abstract Model, which present propositions for creating the triples (subject-predicate-object). The standard format of DwC Extensions (see Darwin Core Archive Validator) will be followed to shape the metadata extension. At the end of the research, we expect to present a model of DwC metadata record to publish data about Agricultural Biodiversity in Brazil, including metadata already existent in Simple DwC and the new metadata of Brazil’s Agricultural Biodiversity Metadata Extension. The resulting extension will be useful to represent Agricultural Diversity worldwide.


2020 ◽  
Vol 15 (4) ◽  
pp. 411-437 ◽  
Author(s):  
Marcos Zárate ◽  
Germán Braun ◽  
Pablo Fillottrani ◽  
Claudio Delrieux ◽  
Mirtha Lewis

Great progress to digitize the world’s available Biodiversity and Biogeography data have been made recently, but managing data from many different providers and research domains still remains a challenge. A review of the current landscape of metadata standards and ontologies in Biodiversity sciences suggests that existing standards, such as the Darwin Core terminology, are inadequate for describing Biodiversity data in a semantically meaningful and computationally useful way. As a contribution to fill this gap, we present an ontology-based system, called BiGe-Onto, designed to manage data together from Biodiversity and Biogeography. As data sources, we use two internationally recognized repositories: the Global Biodiversity Information Facility (GBIF) and the Ocean Biogeographic Information System (OBIS). BiGe-Onto system is composed of (i) BiGe-Onto Architecture (ii) a conceptual model called BiGe-Onto specified in OntoUML, (iii) an operational version of BiGe-Onto encoded in OWL 2, and (iv) an integrated dataset for its exploitation through a SPARQL endpoint. We will show use cases that allow researchers to answer questions that manage information from both domains.


Author(s):  
Dag Endresen ◽  
Armine Abrahamyan ◽  
Akobir Mirzorakhimov ◽  
Andreas Melikyan ◽  
Brecht Verstraete ◽  
...  

BioDATA (Biodiversity Data for Internationalisation in Higher Education) is an international project to develop and deliver biodiversity data training for undergraduate and postgraduate students from Armenia, Belarus, Tajikistan, and Ukraine. By training early career (student) biodiversity scholars, we aim at turning the current academic and education biodiversity landscape into a more open-data-friendly one. Professional practitioners (researchers, museum curators, and collection managers involved in data publishing) from each country were also invited to join the project as assistant teachers (mentors). The project is developed by the Research School in Biosystematics - ForBio and the Norwegian GBIF-node, both at the Natural History Museum of the University of Oslo in collaboration with the Secretariat of the Global Biodiversity Information Facility (GBIF) and partners from each of the target countries. The teaching material is based on the GBIF curriculum for data mobilization and all students will have the opportunity to gain the respective GBIF certification. All materials are made freely available for reuse and even in this very early phase of the project, we have already seen the first successful reuse of teaching materials among the project partners. The first BioDATA training event was organized in Minsk (Belarus) in February 2019 with the objective to train a minimum of four mentors from each target country. The mentor-trainees from this event will help us to teach the course to students in their home country together with teachers from the project team. BioDATA mentors will have the opportunity to gain GBIF certification as expert mentors which will open opportunities to contribute to future training events in the larger GBIF network. The BioDATA training events for the students will take place in Dushanbe (Tajikistan) in June 2019, in Minsk (Belarus) in November 2019, in Yerevan (Armenia) in April 2020, and in Kiev (Ukraine) in October 2020. Students from each country are invited to express their interest to participate by contacting their national project partner. We will close the project with a final symposium at the University of Oslo in March 2021. The project is funded by the Norwegian Agency for International Cooperation and Quality Enhancement in Higher Education (DIKU).


Author(s):  
Franck Michel ◽  
Antonia Ettorre ◽  
Catherine Faron ◽  
Julien Kaplan ◽  
Olivier Gargominy

Harnessing worldwide biodiversity data requires integrating myriad pieces of information, often sparse and incomplete, into a global, coherent data space. To do so, projects like the Global Biodiversity Information Facility, Catalog of Life and Encyclopedia of Life have set up platforms that gather, consolidate, and centralize billions of records from multiple data sources. This approach lowers the entry barrier for scientists willing to consume aggregated biodiversity data but tends to build silos that hamper cross-platform interoperability. The Web of Data embodies a different approach underpinned by the Linked Open Data (LOD) principles (Heath and Bizer 2011). These principles bring about the building of a large, distributed, cross-domain knowledge graph (KG), wherein data description relies on vocabularies with shared, formal, machine-processable semantics. So far however, little biodiversity data have been published this way. Early efforts focused primarily on taxonomic registers, such as NCBI, VTO and AGROVOC. More recent efforts have started paving the way for the publication of more diverse biodiversity KGs (Page 2019, Penev et al. 2019, Michel et al. 2017). Today, we believe that it is time for more biodiversity data producers to join in and start publishing connected KGs spanning a much broader set of domains, far beyond just taxonomic registers. In this talk, we wish to present an on-going endeavor in line with this vision. In a previous work, we published TAXREF-LD (Michel et al. 2017), a LOD representation of the French taxonomic register developed and maintained by the French National Museum of Natural History. We modeled nomenclatural information as a thesaurus of scientific names, taxonomic information as an ontology of classes denoting taxa, and additional information such as ranks and vernacular names. Recently, we have extended the scope of TAXREF-LD to represent and interlink data as various as geographic locations, species interactions, development stages, trophic levels, as well as conservation, biogeographic, and legal status (regulations, protections, etc.). We put a specific effort into working out a model that accurately accounts for the semantics of the data while respecting knowledge engineering practices. For instance, a common design shortcoming is to attach all information as properties of a taxon. This is a rightful choice for some properties like a scientific name or conservation status, but properties that actually pertain to biological individuals themselves, e.g. habitat and trophic level, should better be attched to class members. With the presentation of this work, we wish to advance the discussion about integration scenarios based on knowledge graphs with the different biodiversity data stakeholders.


Author(s):  
Daniel Noesgaard

The work required to collect, clean and publish biodiversity datasets is significant, and those who do it deserve recognition for their efforts. Researchers publish studies using open biodiversity data available from GBIF—the Global Biodiversity Information Facility—at a rate of about two papers a day. These studies cover areas such as macroecology, evolution, climate change, and invasive alien species, relying on data sharing by hundreds of publishing institutions and the curatorial work of thousands of individual contributors. With more than 90 per cent of these datasets licensed under Creative Commons Attribution licenses (CC BY and CC BY-NC), data users are required to credit the dataset providers. For GBIF, it is crucial to link these scientific uses to the underlying data as one means of demonstrating the value and impact of open science, while seeking to ensure attribution of individual, organizational and national contributions to the global pool of open data about biodiversity. Every single authenticated download of occurrence records from GBIF.org is issued a unique Digital Object Identifier (DOI). These DOIs each resolve to a landing page that contains details of the search parameters used to generate the download a quantitative map of the underlying datasets that contributed to the download a simple citation to be included in works that rely on the data the search parameters used to generate the download a quantitative map of the underlying datasets that contributed to the download a simple citation to be included in works that rely on the data When used properly by authors and deposited correctly by journals in the article metadata, the DOI citation establishes a direct link between a scientific paper and the underlying data. Crossref—the main DOI Registration Agency for academic literature— exposes such links in Event Data, which can be consumed programmatically to report direct use of individual datasets. GBIF also records these links, permanently preserving the download archives while exposing a citation count on download landing pages that is also summarized on the landing pages of each contributing datasets and publishers. The citation counts can be expanded to produce lists of all papers unambiguously linked to use of specific datasets. In 2018, just 15 per cent of papers based on GBIF-mediated data used DOIs to cite or acknowledge the datasets used in the studies. To promote crediting of data publishers and digital recognition of data sharing, the GBIF Secretariat has been reaching out systematically to authors and publishers since April 2018 whenever a paper fails to include a proper data citation. While publishing lags may hinder immediate effects, preliminary findings suggest that uptake is improving—as the number of papers with DOI data citations during the first part of 2019 is up more than 60 per cent compared to 2018. Focusing on the value of linking scientific publications and data, this presentation will explore the potential for establishing automatic linkage through DOI metadata while demonstrating efforts to improve metrics of data use and attribution of data providers through outreach campaigns to authors and journal publishers.


Author(s):  
Arturo H. Ariño ◽  
Mónica González-Alonso ◽  
Anabel Pérez de Zabalza

With more than one billion primary biodiversity data records (PBR), the Global Biodiversity Information Facility (GBIF) is the largest and, arguably, most comprehensive and accurate resource about the biodiversity data on the planet. Yet, its gaps (taxonomical, geographical or chronological, among others) have often been brought to attention (Gaijy et al. 2013) and efforts are continuously made to ensure more uniform coverage. Especially as data obtained through this resource are increasingly being used for science, policy, and conservation (Ariño et al. 2018), drawing on every possible source of information to complement already existing data opens new opportunities for supplying the integrative knowledge required for global endeavors, such as understanding the global patterns of ecosystem and environment changes. One such potential source that exists, but so far has experienced little integration, is the vast body of data acquired through airborne particle monitoring systems (for example, the European Aeroallergen Network, EAN). A large portion of pollen data is comprised of quantitative sampling of airborne pollen collected through semi-automated spore traps throughout the world. Its main use is clinical, as it forms the basis of the widespread allergen forecast bulletins. While geolocating the source of airborne pollen is fraught with obviously large uncertainty radii, the time and taxon components of the PBR remain highly precise and are therefore fit for many other uses (Hill et al. 2010). Presence data, and under certain circumstances, frequency data inferred from pollen counts have been often proposed as an excellent proxy for past climate change assessments as far back as the start of the Holocene (Mauri et al. 2015) and might therefore also be possibly used for current climate change detection. We call for a concerted effort throughout the palynological community to first increase harmonizing, and then eventually standardizing, pollen data acquisition through the adoption of Darwin Core (DwC) and, eventually, DwC extensions to mine current data and pipeline future airborne pollen data as PBR. mine current data and pipeline future airborne pollen data as PBR. Success in this endeavor may contribute to a better understanding of global change.


2019 ◽  
Vol 7 ◽  
Author(s):  
Donald Hobern ◽  
Brigitte Baptiste ◽  
Kyle Copas ◽  
Robert Guralnick ◽  
Andrea Hahn ◽  
...  

There has been major progress over the last two decades in digitising historical knowledge of biodiversity and in making biodiversity data freely and openly accessible. Interlocking efforts bring together international partnerships and networks, national, regional and institutional projects and investments and countless individual contributors, spanning diverse biological and environmental research domains, government agencies and non-governmental organisations, citizen science and commercial enterprise. However, current efforts remain inefficient and inadequate to address the global need for accurate data on the world's species and on changing patterns and trends in biodiversity. Significant challenges include imbalances in regional engagement in biodiversity informatics activity, uneven progress in data mobilisation and sharing, the lack of stable persistent identifiers for data records, redundant and incompatible processes for cleaning and interpreting data and the absence of functional mechanisms for knowledgeable experts to curate and improve data. Recognising the need for greater alignment between efforts at all scales, the Global Biodiversity Information Facility (GBIF) convened the second Global Biodiversity Informatics Conference (GBIC2) in July 2018 to propose a coordination mechanism for developing shared roadmaps for biodiversity informatics. GBIC2 attendees reached consensus on the need for a global alliance for biodiversity knowledge, learning from examples such as the Global Alliance for Genomics and Health (GA4GH) and the open software communities under the Apache Software Foundation. These initiatives provide models for multiple stakeholders with decentralised funding and independent governance to combine resources and develop sustainable solutions that address common needs. This paper summarises the GBIC2 discussions and presents a set of 23 complementary ambitions to be addressed by the global community in the context of the proposed alliance. The authors call on all who are responsible for describing and monitoring natural systems, all who depend on biodiversity data for research, policy or sustainable environmental management and all who are involved in developing biodiversity informatics solutions to register interest at https://biodiversityinformatics.org/ and to participate in the next steps to establishing a collaborative alliance. The supplementary materials include brochures in a number of languages (English, Arabic, Spanish, Basque, French, Japanese, Dutch, Portuguese, Russian, Traditional Chinese and Simplified Chinese). These summarise the need for an alliance for biodiversity knowledge and call for collaboration in its establishment.


Sign in / Sign up

Export Citation Format

Share Document