scholarly journals Mapping and Publishing Sequence-Derived Data through Biodiversity Data Platforms

Author(s):  
Dmitry Schigel ◽  
Anders Andersson ◽  
Andrew Bissett ◽  
Anders Finstad ◽  
Frode Fossøy ◽  
...  

Most users will foresee the use of genetic sequences in the context of molecular ecology or phylogenetic research, however, a sequence with coordinates and a timestamp is a valuable biodiversity occurrence that is useful in a much broader context than its original purpose. To uncover this potential, sequence-derived data need to become findable, accessible, interoperable, and reusable through generalist biodiversity data platforms. Stimulated by the Biodiversity_Next discussions in 2019, we have worked for about 10 months to put together practical data mapping and data publishing experiences in Norway, Australia, Sweden, and Denmark, as well as in the UNITE and the GBIF (Global Biodiversity Information Facility) networks. The resulting guide was put together to provide practical instruction for mapping sequence-derived data. Biodiversity data communities remain dominated by the macroscopic, easily detectable, morphologically identifiable species. This is not only true for citizen science and other forms of biodiversity popularization, but is also visible in the university and museum department structures, financial resource allocations, biodiversity legislation, and policy design. Recent decades of molecular advances have increased the power of genetic methods for detecting, describing, and documenting global biodiversity. We have yet to see the wide shift of data generating efforts from the traditional taxonomic foci of biodiversity assesments to the more balanced and inclusive systems focusing on all functionally important taxa and environments. These include soil, limnic and marine environments, decomposing plants and deadwood, and all life therein. Environmental DNA data enable recording of present and past presence of micro- and macroscopic organisms with minimal effort and by non-invasive methods. The apparent ease of these methods requires a cautious approach to the resulting data and their interpretation. It remains important to define and agree on the organism recording and reporting routines for genetic data. DNA data represent a major addition to the many ways in which GBIF and other biodiversity data platforms index the living world. Our guide is resting on the shoulders of those who have been developing and improving MIxS (Minimum Information about any (x) Sequence), GGBN (Global Genome Biodiversity Network) and other data standards. The added value of publishing sequence-derived data through non-genetic biodiversity discovery platforms relates to spatio-temporal occurrences and sequence-based names. Reporting sequence-derived occurrences in an open and reproducible way has a wide range of benefits: notably, it increases citability, highlights the taxa concerned in the context of biological conservation, and contributes to taxonomic and ecological knowledge.

Author(s):  
Maxim Shashkov ◽  
Natalya Ivanova

Russia is a huge gap on the open access global biodiversity map of the Global Biodiversity Information Facility (GBIF). National biodiversity data are stored in various sources including museums, herbaria, scientific literature and reports as well as in the private collections and local databases. The best known and largest of the Russian herbarium collections are the collections stored in Komarov Botanical Institute of the Russian Academy of Science (>6 M sheets) and Moscow University (>1 M sheets). The largest zoological collection is located in Zoological institute of the Russian Academy of Science, with >60 M specimens. But most of the national biodiversity data is not yet digitized. The national biodiversity portal as well as the list of Russian biodiversity data sources are still absent. Despite this, projects and other activities are implemented to mobilize a national data using international biodiversity data standards. Currently Russia is not a GBIF member, but in the last 5 years, more than 1.6 M occurrences were published by Russian publishers through GBIF.org (69 datasets at the end of March 2019). The largest GBIF data provider in Russia is the Lomonosov Moscow State University. The Digital Moscow University Herbarium includes 971,732 specimens collected from Russia and many other countries. The Russian GBIF community is steadily expanding (Fig. 1); this is reflected in an increase in the number of publishers and published datasets. The current GBIF network infrastructure in Russia includes 5 IPT (Integrated Publishing Toolkit) installations in Saint Petersburg (two), Pushchino (Moscow region), Moscow, and Syktyvkar (Komi Republic). Russian-language biodiversity informatics materials are collected and presented from an informal web site http://gbif.ru/ with three main sections: data publishing through GBIF, Russian GBIF activities, and Russian biodiversity data sources. data publishing through GBIF, Russian GBIF activities, and Russian biodiversity data sources. Additional sections are dedicated to iNaturalist citizen science system and Russian Specify Software Project community. We provide technical helpdesk support not only for Russian publishers, but also for Russian speakers from the former USSR. The national mailing-list (via google groups) aims to provide a platform for news sharing. Now it includes >240 subscribers. Since the end of 2014, regular biodiversity informatics events are being held in Russia. Last year, two data training courses, funded by GBIF (project ID Russia-02 - "GBIF.ru data mobilization activities") and ForBIO (Research school in biosystematics), were organized in Moscow and Irkutsk region with the participation of 29 Russian researchers. National biodiversity informatics conferences were held in Apatity (2017) and Irkutsk (2018). We believe Russia already has a well established community that can become the basis for further development when Russia becomes a GBIF member.


Author(s):  
Edward Gilbert ◽  
Corinna Gries ◽  
Nico Franz ◽  
Landrum Leslie R. ◽  
Thomas H. Nash III

The SEINet Portal Network has a complex social and development history spanning nearly two decades. Initially established as a basic online search engine for a select handful of biological collections curated within the southwestern United States, SEINet has since matured into a biodiversity data network incorporating more than 330 institutions and 1,900 individual data contributors. Participating institutions manage and publish over 14 million specimen records, 215,000 observations, and 8 million images. Approximately 70% of the collections make use of the data portal as their primary "live" specimen management platform. The SEINet interface now supports 13 regional data portals distributed across the United States and northern Mexico (http://symbiota.org/docs/seinet/). Through many collaborative efforts, it has matured into a tool for biodiversity data exploration, which includes species inventories, interactive identification keys, specimen and field images, taxonomic information, species distribution maps, and taxonomic descriptions. SEINet’s initial developmental goals were to construct a read-only interface that integrated specimen records harvested from a handful of distributed natural history databases. Intermittent network conductivity and inconsistent data exchange protocols frequently restricted data persistence. National funding opportunities supported a complete redesign towards the development of a centralized data cache model with periodic "snapshot" updates from original data sources. A service-based management infrastructure was integrated into the interface to mobilize small- to medium-sized collections (<1 million specimen records) that commonly lack consistent infrastructure and technical expertise to maintain a standard compliant specimen database. These developments were the precursors to the Symbiota software project (Gries et al. 2014). Through further development of Symbiota, SEINet transformed into a robust specimen management system specifically geared toward specimen digitization with features including data entry from label images, harvesting data from specimen duplicates, batch georeferencing, data validation and cleaning, generating progress reports, and additional tools to improve the efficiency of the digitization process. The central developmental paradigm focused on data mobilization through the production of: a versatile import module capable of ingesting a diverse range of data structures, a robust toolkit to assist in digitizing and managing specimen data and images, and a Darwin Core Archive (DwC-A) compliant data publishing and export toolkit to facilitate data distribution to global aggregators such as Global Biodiversity Information Facility (GBIF) and iDigBio. a versatile import module capable of ingesting a diverse range of data structures, a robust toolkit to assist in digitizing and managing specimen data and images, and a Darwin Core Archive (DwC-A) compliant data publishing and export toolkit to facilitate data distribution to global aggregators such as Global Biodiversity Information Facility (GBIF) and iDigBio. User interfaces consist of a decentralized network of regional data portals, all connecting to a centralized shared data source. Each of the 13 data portals are configured to present a regional perspective specifically tailored to represent the needs of the local research community. This infrastructure has supported the formation of regional consortia, who provide network support to aid local institutions in digitizing and publishing their collections within the network. The community-based infrastructure creates a sense of ownership – perhaps even good-natured competition – by the data providers and provides extra incentive to improve data quality and expand the network. Certain areas of development remain challenging in spite of the project's overall success. For instance, data managers continuously struggle to maintain a current local taxonomic thesaurus used for name validation, data cleaning, and to resolve taxonomic discrepancies commonly encountered when integrating collection datasets. We will discuss the successes and challenges associated with the long-term sustainability model and explore potential future paths for SEINet that support the long-term goal of maintaining a data provider that is in full compliance with the FAIR use principles of making the datasets findable, accessible, interoperable, and reusable (Wilkinson et al. 2016).


2019 ◽  
Vol 5 ◽  
Author(s):  
Oleh Prylutskyi ◽  
Armine Abrahamyan ◽  
Nina Voronova ◽  
Tatevik Aloyan ◽  
Oleg Borodin ◽  
...  

BioDATA is an international project on developing skills in biodiversity data management and data publishing. Between 2018 and 2021, undergraduate and postgraduate students from Armenia, Belarus, Tajikistan, and Ukraine, have an opportunity to take part in the intensive courses to become certified professionals in biodiversity data management. They will gain practical skills and obtain appropriate knowledge on: international data standards (Darwin Core); data cleaning software, data publishing software such as the Integrated Publishing Toolkit (IPT), and preparation of data papers. Working with databases, creating datasets, managing data for statistical analyses and publishing research papers are essential for the everyday tasks of a modern biologist. At the same time, these skills are rarely taught in higher education. Most of the contemporary professionals in biodiversity have to gain these skills independently, through colleagues, or through supervision. In addition, all the participants familiarize themselves with one of the important international research data infrastructures such as the Global Biodiversity Information Facility (GBIF). The project is coordinated by the University of Oslo (Norway) and supported by the Global Biodiversity Information Facility (GBIF). The project is funded by the Norwegian Agency for International Cooperation and Quality Enhancement in Higher Education (DIKU).


2021 ◽  
Vol 9 ◽  
Author(s):  
Domingos Sandramo ◽  
Enrico Nicosia ◽  
Silvio Cianciullo ◽  
Bernardo Muatinte ◽  
Almeida Guissamulo

The collections of the Natural History Museum of Maputo have a crucial role in the safeguarding of Mozambique's biodiversity, representing an important repository of data and materials regarding the natural heritage of the country. In this paper, a dataset is described, based on the Museum’s Entomological Collection recording 409 species belonging to seven orders and 48 families. Each specimen’s available data, such as geographical coordinates and taxonomic information, have been digitised to build the dataset. The specimens included in the dataset were obtained between 1914–2018 by collectors and researchers from the Natural History Museum of Maputo (once known as “Museu Alváro de Castro”) in all the country’s provinces, with the exception of Cabo Delgado Province. This paper adds data to the Biodiversity Network of Mozambique and the Global Biodiversity Information Facility, within the objectives of the SECOSUD II Project and the Biodiversity Information for Development Programme. The aforementioned insect dataset is available on the GBIF Engine data portal (https://doi.org/10.15468/j8ikhb). Data were also shared on the Mozambican national portal of biodiversity data BioNoMo (https://bionomo.openscidata.org), developed by SECOSUD II Project.


Author(s):  
Gaurav Vaidya ◽  
Hilmar Lapp ◽  
Nico Cellinese

Most biological data and knowledge are directly or indirectly linked to biological taxa via taxon names. Using taxon names is one of the most fundamental and ubiquitous ways in which a wide range of biological data are integrated, aggregated, and indexed, from genomic and microbial diversity to macro-ecological data. To this day, the names used, as well as most methods and resources developed for this purpose, are drawn from Linnaean nomenclature. This leads to numerous problems when applied to data-intensive science that depends on computation to take full advantage of the vast – and rapidly increasing – amount of available digital biodiversity data. The theoretical and practical complexities of reconciling taxon names and concepts has plagued the systematics community for decades and now more than ever before, Linnaean names based in Linnaean taxonomy, by far the most prevalent means of linking data to taxa, are unfit for the age of computation-driven data science, due to fundamental theoretical and practical shortfalls that cannot be cured. We propose an alternate approach based on the use of phylogenetic clade definitions, which is a well-developed method for unambiguously defining the semantics of a clade concept in terms of shared evolutionary ancestry (de Queiroz and Gauthier 1990, de Queiroz and Gauthier 1994). These semantics allow locating the defined clade on any phylogeny, or showing that a clade is inconsistent with the topology of a given phylogeny and hence cannot be present on it at all. We have built a workflow for defining phylogenetic clade definitions in terms of shared ancestor and excluded lineage properties, and locating these definitions on any input phylogeny. Once these definitions have been located, we can use the list of species found within that clade on that phylogeny in order to aggregate occurrence data from the Global Biodiversity Information Facility (GBIF). Thus, our approach uses clade definitions with machine-understandable semantics to programmatically and reproducibly aggregate biodiversity data by higher-level taxonomic concepts. This approach has several advantages over the use of taxonomic hierarchies: Unlike taxa, the semantics of clade definitions can be expressed in unambiguous, machine-understandable and reproducible terms and language. The resolution of a given clade definition will depend on the phylogeny being used. Thus, if the phylogeny of groups of interest is updated in light of new evolutionary knowledge, the clade definition can be applied to the new phylogeny to obtain an updated list of clade members consistent with the updated evolutionary knowledge. Machine reproducibility of analyses is possible simply by archiving the machine-readable representations of the clade definition and the phylogeny being used. Unlike taxa, the semantics of clade definitions can be expressed in unambiguous, machine-understandable and reproducible terms and language. The resolution of a given clade definition will depend on the phylogeny being used. Thus, if the phylogeny of groups of interest is updated in light of new evolutionary knowledge, the clade definition can be applied to the new phylogeny to obtain an updated list of clade members consistent with the updated evolutionary knowledge. Machine reproducibility of analyses is possible simply by archiving the machine-readable representations of the clade definition and the phylogeny being used. Clade definitions can be created by biologists as needed or can be reused from those published in peer-reviewed journals. In addition, nearly 300 peer-reviewed clade definitions were recently published as part of the Phylonym volume of the PhyloCode (de Queiroz et al. 2020) and are now available on the Regnum website. As part of the Phyloreferencing Project, we digitize this collection as a machine-readable ontology, where each clade is represented as a class defined by logical conjunctions for class membership, corresponding to a set of necessary and sufficient conditions of shared or divergent evolutionary ancestry. We call these classes phyloreferences, and have created a fully automated workflow for digitizing the Regnum database content into an OWL ontology (W3C OWL Working Group 2012) that we call the Clade Ontology. This ontology includes reference phylogenies and additional metadata about the verbatim clade definitions. Once complete, the Clade Ontology will include all clade definitions from RegNum, both those included in Phylonym after passing peer-review, and those contributed by the community, whether or not under the PhyloCode nomenclature. As an openly available community resource, this will allow researchers to use them to aggregate biodiversity data for comparative biology with grouping semantics that are transparent, machine-processable, and reproducible. In our presentation, we will demonstrate the use of phyloreferences to locate clades on the Open Tree of Life synthetic tree (Hinchliff et al. 2015), to retrieve lists of species in each clade, and to use them to find and aggregate occurrence records in GBIF. We will also describe the workflow we are currently using to build and test the Clade Ontology, and describe our plans for publishing this resource. Finally, we will discuss the advantages and disadvantages of this approach as compared to taxonomic checklists.


2019 ◽  
Vol 7 ◽  
Author(s):  
Valéria da Silva ◽  
Manoel Aguiar-Neto ◽  
Dan Teixeira ◽  
Cleverson Santos ◽  
Marcos de Sousa ◽  
...  

We present a dataset with information from the Opiliones collection of the Museu Paraense Emílio Goeldi, Northern Brazil. This collection currently has 6,400 specimens distributed in 13 families, 30 genera and 32 species and holotypes of four species: Imeri ajuba Coronato-Ribeiro, Pinto-da-Rocha & Rheims, 2013, Phareicranaus patauateua Pinto-da-Rocha & Bonaldo, 2011, Protimesius trocaraincola Pinto-da-Rocha, 1997 and Sickesia tremembe Pinto-da-Rocha & Carvalho, 2009. The material of the collection is exclusive from Brazil, mostly from the Amazon Region. The dataset is now available for public consultation on the Sistema de Informação sobre a Biodiversidade Brasileira (SiBBr) (https://ipt.sibbr.gov.br/goeldi/resource?r=museuparaenseemiliogoeldi-collection-aracnologiaopiliones). SiBBr is the Brazilian Biodiversity Information System, an initiative of the government and the Brazilian node of the Global Biodiversity Information Facility (GBIF), which aims to consolidate and make primary biodiversity data available on a platform (Dias et al. 2017). Harvestmen or Opiliones constitute the third largest arachnid order, with approximately 6,500 described species. Brazil is the holder of the greatest diversity in the world, with more than 1,000 described species, 95% (960 species) of which are endemic to the country. Of these, 32 species were identified and deposited in the collection of the Museu Paraense Emílio Goeldi.


2018 ◽  
Vol 2 ◽  
pp. e25488
Author(s):  
Anne-Sophie Archambeau ◽  
Fabien Cavière ◽  
Kourouma Koura ◽  
Marie-Elise Lecoq ◽  
Sophie Pamerlon ◽  
...  

Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global Biodiversity Information Facility (GBIF) node of Australia. They developed an open and free platform for sharing and exploring biodiversity data. All the modules are publicly available for reuse and customization on their GitHub account (https://github.com/AtlasOfLivingAustralia). GBIF Benin, hosted at the University of Abomey-Calavi, has published more than 338 000 occurrence records from 87 datasets and 2 checklists. Through the GBIF Capacity Enhancement Support Programme (https://www.gbif.org/programme/82219/capacity-enhancement-support-programme), GBIF Benin, with the help of GBIF France, is in the process of deploying the Beninese data portal using the GBIF France back-end architecture. GBIF Benin is the first African country to implement this module of the ALA infrastructure. In this presentation, we will show you an overview of the registry and the occurrence search engine using the Beninese data portal. We will begin with the administration interface and how to manage metadata, then we will continue with the user interface of the registry and how you can find Beninese occurrences through the hub.


2018 ◽  
Vol 2 ◽  
pp. e25486
Author(s):  
Nick dos Remedios ◽  
Marie-Elise Lecoq ◽  
David Martin ◽  
Sophia Ratcliffe

Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global Biodiversity Information Facility (GBIF) node of Australia. Since 2010, they have developed and improved a platform for sharing and exploring biodiversity information. All the modules are publicly available for reuse and customization on their GitHub account (https://github.com/AtlasOfLivingAustralia). The National Biodiversity Network, a registered charity, is the UK GBIF node and has been sharing biodiversity data since 2000. They published more than 79 million occurrences from 818 datasets. In 2016, they launched the NBN Atlas Scotland (https://scotland.nbnatlas.org/) based on the Atlas of Living Australia infrastructure. Since then, they released the NBN Atlas (https://nbnatlas.org/), the NBN Atlas Wales (https://wales.nbnatlas.org/) and soon the NBN Atlas Isle of Man. In addition to the occurrence/species search engine and the metadata registry, they put in place several tools that help users to work with data published in the network: the spatial portal and "explore your region" module. Both elements are based on Atlas of Living Australia developments. Because the Atlas of Living Australia platform is really powerful an reusable, we want to show you these two applications used to make geographical analyses. In order to perform this, we will present you the specificities of each component by giving examples of some functionalities.


2021 ◽  
Vol 118 (6) ◽  
pp. e2018093118
Author(s):  
J. Mason Heberling ◽  
Joseph T. Miller ◽  
Daniel Noesgaard ◽  
Scott B. Weingart ◽  
Dmitry Schigel

The accessibility of global biodiversity information has surged in the past two decades, notably through widespread funding initiatives for museum specimen digitization and emergence of large-scale public participation in community science. Effective use of these data requires the integration of disconnected datasets, but the scientific impacts of consolidated biodiversity data networks have not yet been quantified. To determine whether data integration enables novel research, we carried out a quantitative text analysis and bibliographic synthesis of >4,000 studies published from 2003 to 2019 that use data mediated by the world’s largest biodiversity data network, the Global Biodiversity Information Facility (GBIF). Data available through GBIF increased 12-fold since 2007, a trend matched by global data use with roughly two publications using GBIF-mediated data per day in 2019. Data-use patterns were diverse by authorship, geographic extent, taxonomic group, and dataset type. Despite facilitating global authorship, legacies of colonial science remain. Studies involving species distribution modeling were most prevalent (31% of literature surveyed) but recently shifted in focus from theory to application. Topic prevalence was stable across the 17-y period for some research areas (e.g., macroecology), yet other topics proportionately declined (e.g., taxonomy) or increased (e.g., species interactions, disease). Although centered on biological subfields, GBIF-enabled research extends surprisingly across all major scientific disciplines. Biodiversity data mobilization through global data aggregation has enabled basic and applied research use at temporal, spatial, and taxonomic scales otherwise not possible, launching biodiversity sciences into a new era.


Sign in / Sign up

Export Citation Format

Share Document