Contribution to a reference library of DNA barcodes of Colombian freshwater fishes

Biodiversity Data Journal ◽

10.3897/bdj.10.e65981 ◽

2022 ◽

Vol 10 ◽

Author(s):

Manuela Mejía-Estrada ◽

Luz Fernanda Jiménez-Segura ◽

Marcela Hernández-Zapata ◽

Iván Soto Calderón

Keyword(s):

Dna Sequences ◽

Freshwater Fishes ◽

Dna Barcodes ◽

Global Biodiversity Information Facility ◽

Reference Collection ◽

Poecilia Sphenops ◽

The University ◽

Biodiversity Information ◽

Made In ◽

Occurrence Records

The Barcode of Life initiative was originally motivated by the large number of species, taxonomic difficulties and the limited number of expert taxonomists. Colombia has 1,610 freshwater fish species and comprises the second largest diversity of this group in the world. As genetic information continues to be limited, we constructed a reference collection of DNA sequences of Colombian freshwater fishes deposited in the Ichthyology Collection of the University of Antioquia (CIUA), thus joining the multiple efforts that have been made in the country to contribute to the knowledge of genetic diversity in order to strengthen the inventories of biological collections and facilitate the solution of taxonomic issues in the future. This study contributes to the knowledge on the DNA barcodes and occurrence records of 96 species of Colombian freshwater fishes. Fifty-seven of the species represented in this dataset were already available in the Barcode Of Life Data System (BOLD System), while 39 correspond to new species to the BOLD System. Forty-nine specimens were collected in the Atrato River Basin and 708 in the Magdalena-Cauca asin during the period 2010-2020. Two species (Loricariichthys brunneus (Hancock, 1828) and Poecilia sphenops Valenciennes, 1846) are considered exotic to the Atrato, Cauca and Magdalena Basins and four species (Oncorhynchus mykiss (Walbaum, 1792), Oreochromis niloticus (Linnaeus, 1758), Parachromis friedrichsthalii (Heckel, 1840) and Xiphophorus helleri Heckel, 1848) are exotic to the Colombian hydrogeographic regions. All specimens are deposited in CIUA and have their DNA barcodes made publicly available in the BOLD online database. The geographical distribution dataset can be freely accessed through the Global Biodiversity Information Facility (GBIF).

Download Full-text

Contribution to a reference library of DNA barcodes for Colombian freshwater fishes

10.3897/arphapreprints.e68554 ◽

2021 ◽

Author(s):

Manuela Mejía Estrada ◽

Luz Fernanda Jiménez-Segura ◽

Iván Soto Calderón

Keyword(s):

Dna Sequences ◽

Freshwater Fishes ◽

Dna Barcodes ◽

Reference Library ◽

Global Biodiversity Information Facility ◽

The World ◽

Poecilia Sphenops ◽

Key Indicators ◽

Biodiversity Information ◽

Occurrence Records

The Barcoding was proposed motivated by the mismatch between the low number of taxonomists that contrasts with the large number of species, the method requires the construction of reference collections of DNA sequences that represent existing biodiversity. Freshwater fishes are key indicators for understanding biogeography around the world. Colombia with 1610 species of freshwater fishes is the second richest country in the world in this group. However, genetic information of the species continues to be limited, the contribution to a reference library of DNA barcodes for Colombian freshwater fishes highlights the importance of biological collections and seeks to strengthen inventories and taxonomy of such collections in future studies. This dataset contributes to the knowledge on the DNA barcodes and occurrence records of 96 species of Freshwater fishes from Colombia. The species represented in this dataset correspond to an addition to BOLD public databases of 39 species. Forty-nine specimens were collected in Atrato bassin and 708 in Magdalena-Cauca bassin during the period of 2010 to 2020, two species (Loricariichthys brunneus and Poecilia sphenops) are considered exotic to the Atrato, Cauca and Magdalena basins and four species (Oncorhynchu mykiss, Oreochromis niloticus, Parachromis friedrichsthalii and Xiphophorus helleri) are exotic to Colombian hydrogeographic regions. All specimens are deposited in the CIUA collection at University of Antioquia and have their DNA barcodes made publicly available in the Barcode of Life Data System (BOLD) online database and the distribution dataset can be freely accessed through the Global Biodiversity Information Facility (GBIF).

Download Full-text

The Living Atlases community in action: the GBIF Benin data portal

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25488 ◽

2018 ◽

Vol 2 ◽

pp. e25488

Author(s):

Anne-Sophie Archambeau ◽

Fabien Cavière ◽

Kourouma Koura ◽

Marie-Elise Lecoq ◽

Sophie Pamerlon ◽

...

Keyword(s):

African Country ◽

Biodiversity Data ◽

Global Biodiversity Information Facility ◽

Capacity Enhancement ◽

Support Programme ◽

Data Portal ◽

Global Biodiversity ◽

The University ◽

Biodiversity Information ◽

Occurrence Records

Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global Biodiversity Information Facility (GBIF) node of Australia. They developed an open and free platform for sharing and exploring biodiversity data. All the modules are publicly available for reuse and customization on their GitHub account (https://github.com/AtlasOfLivingAustralia). GBIF Benin, hosted at the University of Abomey-Calavi, has published more than 338 000 occurrence records from 87 datasets and 2 checklists. Through the GBIF Capacity Enhancement Support Programme (https://www.gbif.org/programme/82219/capacity-enhancement-support-programme), GBIF Benin, with the help of GBIF France, is in the process of deploying the Beninese data portal using the GBIF France back-end architecture. GBIF Benin is the first African country to implement this module of the ALA infrastructure. In this presentation, we will show you an overview of the registry and the occurrence search engine using the Beninese data portal. We will begin with the administration interface and how to manage metadata, then we will continue with the user interface of the registry and how you can find Beninese occurrences through the hub.

Download Full-text

The InBIO Barcoding Initiative Database: contribution to the knowledge on DNA barcodes of Iberian Plecoptera

Biodiversity Data Journal ◽

10.3897/bdj.8.e55137 ◽

2020 ◽

Vol 8 ◽

Cited By ~ 1

Author(s):

Sonia Ferreira ◽

José Manuel Tierno de Figueroa ◽

Filipa Martins ◽

Joana Verissimo ◽

Lorenzo Quaglietta ◽

...

Keyword(s):

Dna Sequences ◽

Freshwater Ecosystems ◽

Distribution Data ◽

Dna Barcodes ◽

Limited Information ◽

Online Database ◽

Global Biodiversity Information Facility ◽

Life Data ◽

Biodiversity Information ◽

Comprehensive Reference

The use of DNA barcoding allows unprecedented advances in biodiversity assessments and monitoring schemes of freshwater ecosystems; nevertheless, it requires the construction of comprehensive reference collections of DNA sequences that represent the existing biodiversity. Plecoptera are considered particularly good ecological indicators and one of the most endangered groups of insects, but very limited information on their DNA barcodes is available in public databases. Currently, less than 50% of the Iberian species are represented in BOLD. The InBIO Barcoding Initiative Database: contribution to the knowledge on DNA barcodes of Iberian Plecoptera dataset contains records of 71 specimens of Plecoptera. All specimens have been morphologically identified to species level and belong to 29 species in total. This dataset contributes to the knowledge on the DNA barcodes and distribution of Plecoptera from the Iberian Peninsula and it is one of the IBI database public releases that makes available genetic and distribution data for a series of taxa. The species represented in this dataset correspond to an addition to public databases of 17 species and 21 BINs. Fifty-eight specimens were collected in Portugal and 18 in Spain during the period of 2004 to 2018. All specimens are deposited in the IBI collection at CIBIO, Research Center in Biodiversity and Genetic Resources and their DNA barcodes are publicly available in the Barcode of Life Data System (BOLD) online database. The distribution dataset can be freely accessed through the Global Biodiversity Information Facility (GBIF).

Download Full-text

The InBIO Barcoding Initiative Database: DNA barcodes of Portuguese Diptera 01

Biodiversity Data Journal ◽

10.3897/bdj.8.e49985 ◽

2020 ◽

Vol 8 ◽

Cited By ~ 2

Author(s):

Sonia Ferreira ◽

Rui Andrade ◽

Ana Gonçalves ◽

Pedro Sousa ◽

Joana Paupério ◽

...

Keyword(s):

Species Level ◽

Distribution Data ◽

Dna Barcodes ◽

Online Database ◽

Global Biodiversity Information Facility ◽

Life Data ◽

Tagus River ◽

Global Biodiversity ◽

Biodiversity Information ◽

Dipteran Species

The InBIO Barcoding Initiative (IBI) Diptera 01 dataset contains records of 203 specimens of Diptera. All specimens have been morphologically identified to species level, and belong to 154 species in total. The species represented in this dataset correspond to about 10% of continental Portugal dipteran species diversity. All specimens were collected north of the Tagus river in Portugal. Sampling took place from 2014 to 2018, and specimens are deposited in the IBI collection at CIBIO, Research Center in Biodiversity and Genetic Resources. This dataset contributes to the knowledge on the DNA barcodes and distribution of 154 species of Diptera from Portugal and is the first of the planned IBI database public releases, which will make available genetic and distribution data for a series of taxa. All specimens have their DNA barcodes made publicly available in the Barcode of Life Data System (BOLD) online database and the distribution dataset can be freely accessed through the Global Biodiversity Information Facility (GBIF).

Download Full-text

Mobilizing Data from Taxonomic Literature for an Iconic Species (Dinosauria, Theropoda, Tyrannosaurus rex)

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37078 ◽

2019 ◽

Vol 3 ◽

Author(s):

Jeremy Miller ◽

Yanell Braumuller ◽

Puneet Kishor ◽

David Shorthouse ◽

Mariya Dimitrova ◽

...

Keyword(s):

Global Biodiversity Information Facility ◽

Museum Exhibits ◽

The Public ◽

The Past ◽

Biodiversity Knowledge ◽

Informatics Infrastructure ◽

The Subject ◽

The Impact ◽

Biodiversity Information ◽

Occurrence Records

A vast amount of biodiversity data is reported in the primary taxonomic literature. In the past, we have demonstrated the use of semantic enhancement to extract data from taxonomic literature and make it available to a network of databases (Miller et al. 2015). For technical reasons, semantic enhancement of taxonomic literature is most efficient when customized according to the format of a particular journal. This journal-based approach captures and disseminates data on whatever taxa happen to be published therein. But if we want to extract all treatments on a particular taxon of interest, these are likely to be spread across multiple journals. Fortunately, the GoldenGATE Imagine document editor (Sautter 2019) is flexible enough to parse most taxonomic literature. Tyrannosaurus rex is an iconic dinosaur with broad public appeal, as well as the subject of more than a century of scholarship. The Naturalis Biodiversity Center recently acquired a specimen that has become a major attraction in the public exhibit space. For most species on earth, the primary taxonomic literature contains nearly everything that is known about it. Every described species on earth is the subject of one or more taxonomic treatments. A taxon-based approach to semantic enhancement can mobilize all this knowledge using the network of databases and resources that comprise the modern biodiversity informatics infrastructure. When a particular species is of special interest, a taxon-based approach to semantic enhancement can be a powerful tool for scholarship and communication. In light of this, we resolved to semantically enhance all taxonomic treatments on T. rex. Our objective was to make these treatments and associated data available for the broad range of stakeholders who might have an interest in this animal, including professional paleontologists, the curious public, and museum exhibits and public communications personnel. Among the routine parsing and data sharing activities in the Plazi workflow (Agosti and Egloff 2009), taxonomic treatments, as well as cited figures, are deposited in the Biodiversity Literature Repository (BLR), and occurrence records are shared with the Global Biodiversity Information Facility (GBIF). Treatment citations were enhanced with hyperlinks to the cited treatment on TreatmentBank, and specimen citations were linked to their entries on public facing collections databases. We used the OpenBiodiv biodiversity knowledge graph (Senderov et al. 2017) to discover other taxa mentioned together with T. rex, and to create a timeline of T. rex research to evaluate the impact of individual researchers and specimen repositories to T. rex research. We contributed treatment links to WikiData, and queried WikiData to discover identifiers to different platforms holding data about T. rex. We used bloodhound-tracker.net to disambiguate human agents, like collectors, identifiers, and authors. We evaluate the adequacy of the fields currently available to extract data from taxonomic treatments, and make recommendations for future standards.

Download Full-text

An audit of some processing effects in aggregated occurrence records

ZooKeys ◽

10.3897/zookeys.751.24791 ◽

2018 ◽

Vol 751 ◽

pp. 129-146 ◽

Cited By ~ 7

Author(s):

Robert Mesibov

Keyword(s):

Data Loss ◽

Global Biodiversity Information Facility ◽

Australian Museum ◽

Darwin Core ◽

Species Groups ◽

Processing Effects ◽

Global Biodiversity ◽

Name Changes ◽

Biodiversity Information ◽

Occurrence Records

A total of ca 800,000 occurrence records from the Australian Museum (AM), Museums Victoria (MV) and the New Zealand Arthropod Collection (NZAC) were audited for changes in selected Darwin Core fields after processing by the Atlas of Living Australia (ALA; for AM and MV records) and the Global Biodiversity Information Facility (GBIF; for AM, MV and NZAC records). Formal taxon names in the genus- and species-groups were changed in 13–21% of AM and MV records, depending on dataset and aggregator. There was little agreement between the two aggregators on processed names, with names changed in two to three times as many records by one aggregator alone compared to records with names changed by both aggregators. The type status of specimen records did not change with name changes, resulting in confusion as to the name with which a type was associated. Data losses of up to 100% were found after processing in some fields, apparently due to programming errors. The taxonomic usefulness of occurrence records could be improved if aggregators included both original and the processed taxonomic data items for each record. It is recommended that end-users check original and processed records for data loss and name replacements after processing by aggregators.

Download Full-text

Implementing GBIF Pipelines in the Atlas of Living Australia: The first step towards alignment and further collaboration

Biodiversity Information Science and Standards ◽

10.3897/biss.5.74335 ◽

2021 ◽

Vol 5 ◽

Author(s):

Javier Molina ◽

Peggy Newman ◽

David Martin ◽

Vicente Ruiz Jurado

Keyword(s):

Management Systems ◽

Records Management ◽

Global Biodiversity Information Facility ◽

Operation Costs ◽

Operational Costs ◽

Development Teams ◽

Infrastructure Project ◽

Data Ingestion ◽

Biodiversity Information ◽

Occurrence Records

The Global Biodiversity Information Facility (GBIF) and the Atlas of Living Australia (ALA) are two leading infrastructures serving the biodiversity community. In 2020, the ALA’s occurrence records management systems reached end of life after more than 10 years of operation, and the ALA embarked on a project to replace them. Significant overlap exists in the function of the ALA and GBIF data ingestion pipeline systems. Instead of the ALA developing new systems from scratch, we initiated a project to better align the two infrastructures. The collaboration brings benefits such as the improved reuse of modules and an overall reduction in development and operation costs. The ALA recently replaced its occurrence ingestion system with GBIF pipelines infrastructure and shared code. This is the first milestone of the broader ALA’s Core Infrastructure Project and some of the benefits from it are a more reliable, performant and scalable system, proven by the ability to ingest more and larger datasets while at the same time reducing infrastructure operational costs by more than 40% compared to the previous system. The new system is a key building block for an improved ingestion framework that is being developed within the ALA. The collaboration between the ALA and GBIF development teams will result in more consistent outputs from their respective processing pipelines. It will also allow the broader collective expertise of both infrastructure communities to inform future development and direction. The ALA’s adoption of GBIF pipelines will pave the way for the Living Atlases community to adopt GBIF systems and also contribute to them. In this talk we will introduce the project, share insights on how both the teams from the GBIF and the ALA worked together and finally we will delve into details about the technical implementation and benefits.

Download Full-text

Survey of Species Covered by DNA Barcoding Data in BOLD and GenBank for Integration of Data for Museomics

Biodiversity Information Science and Standards ◽

10.3897/biss.4.59065 ◽

2020 ◽

Vol 4 ◽

Author(s):

Takeru Nakazato

Keyword(s):

Dna Barcoding ◽

Fish Species ◽

Dna Sequences ◽

Dna Barcode ◽

Database Systems ◽

Species Level ◽

Ncbi Taxonomy ◽

Dna Barcodes ◽

Voucher Specimen ◽

Biodiversity Information

DNA barcoding technology has become employed widely for biodiversity and molecular biology researchers to identify species and analyze their phylogeny. Recently, DNA metabarcoding and environmental DNA (eDNA) technology have developed by expanding the concept of DNA barcoding. These techniques analyze the diversity and quantity of organisms within an environment by detecting biogenic DNA in water and soil. It is particularly popular for monitoring fish species living in rivers and lakes (Takahara et al. 2012). BOLD Systems (Barcode of Life Database systems, Ratnasingham and Hebert 2007) is a database for DNA barcoding, archiving 8.5 million of barcodes (as of August 2020) along with the voucher specimen, from which the DNA barcode sequence is derived, including taxonomy, collected country, and museum vouchered as metadata (e.g. https://www.boldsystems.org/index.php/Public_RecordView?processid=TRIBS054-16). Also, many barcoding data are submitted to GenBank (Sayers et al. 2020), which is a database for DNA sequences managed by NCBI (National Center for Biotechnology Information, US). The number of the records of DNA barcodes, i.e. COI (cytochrome c oxidase I) gene for animal, has grown significantly (Porter and Hajibabaei 2018). BOLD imports DNA barcoding data from GenBank, and lots of DNA barcoding data in GenBank are also assigned BOLD IDs. However, we have to refer to both BOLD and GenBank data when performing DNA barcoding. I have previously investigated the registration of DNA barcoding data in GenBank, especially the association with BOLD, using insects and flowering plants as examples (Nakazato 2019). Here, I surveyed the number of species covered by BOLD and GenBank. I used fish data as an example because eDNA research is particularly focused on fish. I downloaded all GenBank files for vertebrates from NCBI FTP (File Transfer Protocol) sites (as of November 2019). Of the GenBank fish entries, 86,958 (7.3%) were assigned BOLD identifiers (IDs). The NCBI taxonomy database has registrations for 39,127 species of fish, and 20,987 scientific names at the species level (i.e., excluding names that included sp., cf. or aff.). GenBank entries with BOLD IDs covered 11,784 species (30.1%) and 8,665 species-level names (41.3%). I also obtained whole "specimens and sequences combined data" for fish from BOLD systems (as of November 2019). In the BOLD, there are 273,426 entries that are registered as fish. Of these entries, 211,589 BOLD entries were assigned GenBank IDs, i.e. with values in “genbank_accession” column, and 121,748 entries were imported from GenBank, i.e. with "Mined from GenBank, NCBI" description in "institution_storing" column. The BOLD data covered 18,952 fish species and 15,063 species-level names, but 35,500 entries were assigned no species-level names and 22,123 entries were not even filled with family-level names. At the species level, 8,067 names co-occurred in GenBank and BOLD, with 6,997 BOLD-specific names and 599 GenBank-specific names. GenBank has 425,732 fish entries with voucher IDs, of which 340,386 were not assigned a BOLD ID. Of these 340,386 entries, 43,872 entries are registrations for COI genes, which could be candidates for DNA barcodes. These candidates include 4,201 species that are not included in BOLD, thus adding these data will enable us to identify 19,863 fish to the species level. For researchers, it would be very useful if both BOLD and GenBank DNA barcoding data could be searched in one place. For this purpose, it is necessary to integrate data from the two databases. A lot of biodiversity data are recorded based on the Darwin Core standard while DNA sequencing data are sometimes integrated or cross-linked by RDF (Resource Description Framework). It may not be technically difficult to integrate these data, but the species data referenced differ from the EoL (The Encyclopedia of Life) for BOLD and the NCBI taxonomy for GenBank, and the differences in taxonomic systems make it difficult to match by scientific name description. GenBank has fields for the latitude and longitude of the specimens sampled, and Porter and Hajibabaei 2018 argue that this information should be enhanced. However, this information may be better described in the specimen and occurrence databases. The integration of barcoding data with the specimen and occurrence data will solve these problems. Most importantly, it will save the researcher from having to register the same information in multiple databases. In the field of biodiversity, only DNA barcode sequences may have been focused on and used as gene sequences. The museomics community regards museum-preserved specimens as rich resources for DNA studies because their biodiversity information can accompany the extraction and analysis of their DNA (Nakazato 2018). GenBank is useful for biodiversity studies due to its low rate of mislabelling (Leray et al. 2019). In the future, we will be working with a variety of DNA, including genomes from museum specimens as well as DNA barcoding. This will require more integrated use of biodiversity information and DNA sequence data. This integration is also of interest to molecular biologists and bioinformaticians.

Download Full-text

The InBIO Barcoding Initiative Database: DNA barcodes of Portuguese Diptera 02 - Limoniidae, Pediciidae and Tipulidae

Biodiversity Data Journal ◽

10.3897/bdj.9.e69841 ◽

2021 ◽

Vol 9 ◽

Author(s):

Sónia Ferreira ◽

Pjotr Oosterbroek ◽

Jaroslav Starý ◽

Pedro Sousa ◽

Vanessa Mata ◽

...

Keyword(s):

Species Level ◽

Dna Barcodes ◽

Online Database ◽

Global Biodiversity Information Facility ◽

Life Data ◽

Crane Flies ◽

Collection Data ◽

Global Biodiversity ◽

First Time ◽

Biodiversity Information

The InBIO Barcoding Initiative (IBI) Diptera 02 dataset contains records of 412 crane fly specimens belonging to the Diptera families: Limoniidae, Pediciidae and Tipulidae. This dataset is the second release by IBI on Diptera and it greatly increases the knowledge on the DNA barcodes and distribution of crane flies from Portugal. All specimens were collected in Portugal, including six specimens from the Azores and Madeira archipelagos. Sampling took place from 2003 to 2019. Specimens have been morphologically identified to species level by taxonomists and belong to 83 species in total. The species, represented in this dataset, correspond to about 55% of all the crane fly species known from Portugal and 22% of crane fly species known from the Iberian Peninsula. All DNA extractions and most specimens are deposited in the IBI collection at CIBIO, Research Center in Biodiversity and Genetic Resources. Fifty-three species were new additions to the Barcode of Life Data System (BOLD), with another 18 species' barcodes added from under-represented species in BOLD. Furthermore, the submitted sequences were found to cluster in 88 BINs, 54 of which were new to BOLD. All specimens have their DNA barcodes publicly accessible through BOLD online database and its collection data can be accessed through the Global Biodiversity Information Facility (GBIF). One species, Gonomyia tenella (Limoniidae), is recorded for the first time from Portugal, raising the number of crane flies recorded in the country to 145 species.

Download Full-text

Going Molecular: Sequence-based spatiotemporal biodiversity evidence in GBIF

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37036 ◽

2019 ◽

Vol 3 ◽

Author(s):

Dmitry Schigel ◽

Thomas Jeppesen ◽

Robert Finn ◽

Guy Cochrane ◽

Urmas Kõljalg ◽

...

Keyword(s):

Dna Sequences ◽

Data Streams ◽

Large Scale ◽

Sequence Data ◽

Genetic Material ◽

Molecular Data ◽

Molecular Sequence ◽

Global Biodiversity ◽

Biodiversity Information ◽

Occurrence Records

The Global Biodiversity Information Facility (GBIF) was established by governments in 2001, largely through the initiative and leadership of the natural history collections community, following the 1999 recommendation by a working group under the Megascience Forum (predecessor of the Global Science Forum) of the Organization for Economic Cooperation and Development (OECD). Over 20 years, GBIF has helped develop standards and convened a global community of data-publishing institutions, aggregrating over one billion specimen occurrence records freely and openly available for use in research and policy making. These GBIF mediated data range from vouchered museum specimens to observation records generated by humans and machines. New data are being generated from integrated remote sensing, ecological sampling, and molecular sequencing that have strong geospatial components but lack traditional vouchers. GBIF is working with partners to develop best practices of bringing this data into the GBIF architecture. Following discussions during the second Global Biodiversity Information Conference in 2018, GBIF and the European Bioinformatics Institute (EMBL-EBI), supported by ELIXIR, have extended collaboration to share species occurrence records known only from their genetic material. When these data providers contribute data coordinates along with the sequences to the European Nucleotide Archive (ENA), the records will appear on GBIF maps and in spatial searches. This collaboration enables significant new molecular data streams to become discoverable through GBIF.org: by mid-March 2019, over 7.8m individual occurrence records via the ENA, and over 13.2m records as standardized Darwin Core sampling-event datasets via MGnify, a resource that provides taxonomic and functional annotations on sequences derived from environmental sequencing projects. Sequence-based occurrence records published by ENA and MGnify boost representation of microbial diversity which was underrepresented at GBIF. The ELIXIR-ENA-MGnify-GBIF partnership is working on further refinement of the dynamic data linkages, frequency of updates and other improvements. The API-based tool that connects GBIF data infrastructures is open to new data contributors and for indexes of molecular occurrences. Indexing of these data streams is dependent on the presence of a name (any rank) with the sequence. Under the current Codes of nomenclature, animals, fungi, plants, and algae cannot be described based on exclusively sequence data. Yet, a significant volume of biodiversity data has only been represented by DNA sequences. Barcoding and sequence clustering procedures vary among taxa and research communities, but clusters can be related to a taxon with a Latin name. Many DNA similarity clusters do not contain a sequence from a formally described taxon; however these sequence clusters provide provisional molecular names for nomenclatural communication. In the best cases, curated libraries of reference sequences, their metadata, clusters, alignments, and links to individuals and physical material become de facto naming conventions for certain taxonomic groups, and co-exist with Latin names. Integration of molecular names into the taxonomic backbone of GBIF started with Fungi and UNITE, a data management and identification environment for fungal ITS barcodes with 87,000+ fungal species hypotheses demarcating 800,000+ sequence specimens as of March 2019. Checklist publication of all names in UNITE through GBIF.org including Linnaean names and stable, DOI-trackable molecular sequence based ‘species hypotheses’, enables indexing of fungal metabarcoding data worldwide, such as BIOWIDE. As names are currently essential to indexing the world’s occurrence data, GBIF will develop similar linkages with names in the Barcode of Life data system (BOLD) and in SILVA - a resource for high-quality ribosomal RNA sequence data and taxonomy, and welcomes other reference systems to this development. Expanding the molecular data streams (Fig. 1) allows GBIF to address spatial, temporal and taxonomic gaps and biases, and to support large-scale data-intensive research openly and worldwide.

Download Full-text