Biodiversity Information Science and Standards
Latest Publications


TOTAL DOCUMENTS

1027
(FIVE YEARS 575)

H-INDEX

4
(FIVE YEARS 4)

Published By Pensoft Publishers

2535-0897

Author(s):  
Inna Kouper ◽  
Kimberly Cook

Panelists: James Macklin, Agriculture and Agri-Food Canada; Anne Thessen, University of Colorado Anschutz Medical Campus; Robbie Burger, University of Kentucky; Ben Norton, North Carolina Museum of Natural Sciences Organizers: Kimberly Cook, University of Kentucky; Inna Kouper, Indiana University As research incentives become increasingly focused on collaborative work, addressing the challenges of curating interdisciplinary data becomes a priority. A panel convened at the TDWG 2021 virtual conference on October 19 discussed these issues and provided the space where people with a variety of experience curating interdisciplinary biodiversity data shared their knowledge and expertise. The panel started with a brief introduction to the challenges of interdisciplinary and highly collaborative research (IHCR), which the panel organizers have previously observed (Kouper et al. 2021). In addition to varying definitions that focus on crossing the disciplinary boundaries or synthesizing knowledge, IHCR is characterized by an increasing emphasis on computation, integration of heterogeneous data sources, and work with multiple stakeholders. As such, IHCR data does not fit with traditional lifecycle models as it requires more iterations, coordination, and shared language. Narrowing the scope to biodiversity data, the panelists acknowledged that biodiversity is a truly interdisciplinary domain where researchers and practitioners bring their diverse expertise to take care of data. The domain has a variety of contributors, including data producers, users, and curators. While they share common goals, these contributors are often fragmented in separate projects that prioritize academic disciplines or public engagement. Lack of knowledge and awareness about contributors and their projects and expertise as well as a certain vulnerability in branching out into new areas, are among the factors that make it difficult to tear down silos. As James Macklin put it, “... you're crossing a boundary into a place you don't maybe know a lot about, and for some people, that's hard to do. Right? It takes a lot of listening and thinking.” Due to their complex and interactive nature, IHCR projects almost always have a higher overhead in terms of communication, coordination, and management. Panelists agreed that for such projects there needs to be a collaboration handbook that assigns roles and responsibilities and establishes rules for various aspects of collaboration, including authorship and handling disagreements. Successful IHCR projects create such handbooks at the beginning and revisit them regularly. Another useful strategy mentioned was to hold debriefing sessions that evaluate what went well and what didn’t. Strong leadership that takes IHCR complexities into account and builds a network of capable facilitators and “bridge-builders” or “translators” is a big factor that makes projects succeed. Recognizing and encouraging the role of facilitators from the onset of the project helps to develop productive relationships across disciplines and areas of expertise. It also enables everyone to focus on their strengths and build trust. Data and metadata integration is one of the big challenges in biodiversity, although it is not unique to it. Biodiversity brings together many disciplines and each of them identifies its own problems and collects data to address them. Data silos stem from disciplinary silos, and it will take a different, more integrated, kind of cyberinfrastructure and modeling to bring these pieces together. Creating such infrastructures and standards around interdisciplinary data and metadata are serious needs, although they are not valued and rewarded enough compared to, say publishing academic papers. Lack of standardization and infrastructure also stands in the way of improving the quality of data in biodiversity. To evaluate the quality of data and to trust its creators, data users need to know who gathered and processed the data and how. When the data is re-used within a collaborative project, there is an opportunity to ask questions and find out why, for example, someone had certain naming conventions or processing and analytical approaches. Long-term data such as species’ life history traits, however, can be collected over long periods of time. Improving the quality of biodiversity data requires going beyond interpersonal communication and addressing the issues of metadata and standards more systematically. Panelists also discussed the issue of openness in connection to biodiversity data. Openness contributes to the improved quality of data and an increased return on public investment in science and research. Panelists’ positions diverged in the degree to which biodiversity data should be open and approaches to address competitiveness and sensitivity in research. On one hand, they acknowledged the need for some form of embargo on data sharing to allow data originators to benefit from their effort; on the other, they argued that lack of openness promotes silos and diminishes the quality of research and its reproducibility. Panelists briefly discussed the COVID pandemic data as an example of how lack of openness and silos can be detrimental to finding solutions: “COVID has given us the best example we have of how silos do damage to things that could have gone better. ... the data wasn't available, if it had been open or not even necessarily open but had anybody had any idea that it existed somewhere, that would have helped a lot. … We are learning those lessons, governments are changing the way they do business because of it. And so for us, I mean, our community, I think this has been one of the best things that could have happened to us in some ways, simply because it forced a change of mindset. And it has forced citizens to get engaged.” [James Macklin] The panelists, who brought a wide range of expertise to the discussion, including semantic and digitization technologies, agricultural data, evolutionary biology, and mineralogy among others, discussed projects they work on, which engaged the audience and stimulated a discussion among all participants about the role of end users in biodiversity data curation, non-traditional careers in biodiversity, and approaches to reviewing data similar to traditional research publications. Panelists and the audience also discussed the differences between “cleaning” and “annotating” data, making annotations part of the biodiversity record and data reviews. These productive discussions provide a foundation for further developments in the research and practice of curating biodiversity data and building strong interdisciplinary communities.


Author(s):  
Steven J Baskauf ◽  
Paula Zermoglio

Users may be more likely to understand and utilize standards if they are able to read labels and definitions of terms in their own languages. Increasing standards usage in non-English speaking parts of the world will be important for making biodiversity data from across the globe more uniformly available. For these reasons, it is important for Biodiversity Information Standards (TDWG) to make its standards widely available in as many languages as possible. Currently, TDWG has six ratified controlled vocabularies*1, 2, 3, 4, 5, 6 that were originally available only in English. As an outcome of this workshop, we have made term labels and definitions in those vocabularies available in the languages of translators who participated in its sessions. In the introduction, we reviewed the concept of vocabularies, explained the distinction between term labels and controlled value strings, and described how multilingual labels and definitions fit into the standards development process. The introduction was followed by working sessions in which individual translators or small groups working in a single language filled out Google Sheets with their translations. The resulting translations were compiled along with attribution information for the translators and made freely available in JavaScript Object Notation (JSON) and comma separated values (CSV) formats.*7


Author(s):  
Papy Nsevolo

Insects play a vital role for humans. Apart from well-known ecosystem services (e.g., pollination, biological control, decomposition), they also serve as food for humans. An increasing number of research reports (Mitsuhashi 2017, Jongema 2018) indicate that entomophagy (the practice of eating insects by humans), is a long-standing practice in many countries around the globe. In Africa notably, more than 524 insects have been reported to be consumed by different ethnic groups, serving as a cheap, ecofriendly and renewable source of nutrients on the continent. Given the global recession due to the pandemic (COVID-19) and the threat induced to food security and food production systems, edible insects are of special interest in African countries, particularly the Democratic Republic of the Congo (DRC), where they have been reported as vital to sustain food security. Indeed, to date, the broadest lists of edible insects of the DRC reported (a maximum) 98 insects identified at species level (Monzambe 2002, Mitsuhashi 2017, Jongema 2018). But these lists are hampered by spelling mistakes or by redundancy. An additional problem is raised by insects only known by their vernacular names (ethnospecies) as local languages (more than 240 living ones) do not necessarily give rigorous information due to polysemy concerns. Based on the aforementioned challenges, entomophagy practices and edible insect species reported for DRC (from the independence year, 1960, to date) have been reviewed using four authoritative taxonomic databases: Catalogue of Life (CoL), Integrated Taxonomic Information System, Global Biodiversity Information Facility taxonomic backbone, and the Global Lepidoptera Names Index. Results confirm the top position of edible caterpillars (Lepidoptera, 50.8%) followed by Orthoptera (12.5%), Coleoptera and Hymenoptera (10.0% each). A total of 120 edible species (belonging to eighty genera, twenty-nine families and nine orders of insects) have been listed and mapped on a national scale. Likewise, host plants of edible insects have been inventoried after checking (using CoL, Plant Resources of Tropical Africa, and the International Union for Conservation of Nature's Red List of Threatened Species). The host plant diversity is dominated by multi-use trees belonging to Fabaceae (34.4%) followed by Phyllanthaceae (10.6%) and Meliaceae (4.9%). However, data indicated endangered (namely Millettia laurentii, Prioria balsamifera ) or critically endangered (Autranella congolensis) host plant species that call for conservation strategies. To the best of our knowledge, aforementioned results are the very first reports of such findings in Africa. Moreover, given issues encountered during data compilation and during cross-checking of scientific names, a call was made for greater collaboration between local people and expert taxonomists (through citizen science), in order to unravel unidentified ethnospecies. Given the challenge of information technology infrastructure in Africa, such a target could be achieved thanks to mobile apps. Likewise, a further call should be made for: bеtter synchronization of taxonomic databases, the need of qualitative scientific photographs in taxonomic databases, and additional data (i.e., conservational status, proteins or DNA sequences notably) as edible insects need to be rigorously identified and durably managed. bеtter synchronization of taxonomic databases, the need of qualitative scientific photographs in taxonomic databases, and additional data (i.e., conservational status, proteins or DNA sequences notably) as edible insects need to be rigorously identified and durably managed. Indeed, these complementary data are very crucial, given the limitations and issues of conventional/traditional identification methods based on morphometric or dichotomous keys and the lack of voucher specimens in many African museums and/or collections. This could be achieved by QR (Quick Response) coding insect species and centralizing data about edible insects in a main authoritative taxonomic database whose role is undebatable, as edible insects are today earmarked as nutrient-rich source of proteins, fat, vitamins and fiber to mitigate food insecurity and poor diets, which are an aggravating factor for the impact of COVID-19.


Author(s):  
Yasin Bakış ◽  
Xiaojun Wang ◽  
Hank Bart

Over 1 billion biodiversity collection specimens ranging from fungi to fish to fossils are housed in more than 1,600 natural history collections across the United States. The digitization of these specimens has risen significantly within the last few decades and this is only likely to increase, as the use of digitized data gains more importance every day. Numerous experiments with automated image analysis have proven the practicality and usefulness of digitized biodiversity images by computational techniques such as neural networks and image processing. However, most of the computational techniques to analyze images of biodiversity collection specimens require a good curation of this data. One of the challenges in curating multimedia data of biodiversity collection specimens is the quality of the multimedia objects—in our case, two dimensional images. To tackle the image quality problem, multimedia needs to be captured in a specific format and presented with appropriate descriptors. In this study we present an analysis of two image repositories each consisting of 2D images of fish specimens from several institutions—the Integrated Digitized Biocollections (iDigBio) and the Great Lakes Invasives Network (GLIN). Approximately 70 thousand images have been processed from the GLIN repository and 450 thousand images have been processed from the iDigBio repository and their suitability assessed for use in neural network-based species identification and trait extraction applications. Our findings showed that images that came from the GLIN dataset were more successful for image processing and machine learning purposes. Almost 40% of the species have been represented with less than 10 images while only 20% have more than 100 images per species. We identified and captured 20 metadata descriptors that define quality and usability of the image. According to the captured metadata information, 70% of the GLIN dataset images were found to be useful for further analysis according to the overall image quality score. Quality issues with the remaining images included: curved specimens, non-fish objects in the images such as tags, labels and rocks that obstructed the view of the specimen; color, focus and brightness issues; folded or overlapping parts as well as missing parts. We used both the web interface and the API (Application Programming Interface) for downloading images from iDigBio. We searched for all fish genera, families and classes in three different searches with the images-only option selected. Then we combined all of the search results and removed duplicates. Our search on the iDigBio database for fish taxa returned approximately 450 thousand records with images. We narrowed this down to 90 thousand fish images aided by the multimedia metadata with the downloaded search results, excluding some non-fish images, fossil samples, X-ray and CT (computed tomography) scans and several others. Only 44% of these 90 thousand images were found to be suitable for further analysis. In this study, we discovered some of the limitations of biodiversity image datasets and built an infrastructure for assessing the quality of biodiversity images for neural network analysis. Our experience with the fish images gathered from two different image repositories has enabled describing image quality metadata features. With the help of these metadata descriptors, one can simply create a dataset for a desired image quality for the purpose of analysis. Likewise, the availability of the metadata descriptors will help advance our understanding of quality issues, while helping data technicians, curators and the other digitization staff be more aware of multimedia.


Author(s):  
Aurore Gourraud ◽  
Régine Vignes Lebbe ◽  
Adeline Kerner ◽  
Marc Pignal

The joint use of two tools applied to plant description, XPER3 and Recolnat Annotate, made it possible to study vegetative architectural patterns (Fig. 1) of the Dendrobium (Orchidaceae) in New Caledonia defined by N. Hallé (1977). This approach is not directly related to taxonomy, but to the definition of sets of species grouped according to a growth pattern. In the course of this work, the characters stated by N. Hallé were analysed and eventually amended to produce a data matrix and generate an identification key. Study materials: Dendrobium Sw. in New Caledonia New Caledonia is an archipelago in the Pacific Ocean, a French overseas territory located east of Australia. It is one of the 36 biodiversity hotspots in the world. The genus Dendrobium Sw. sensu lato is one of the largest in the family Orchidaceae and contains over 1220 species. In New Caledonia, it includes 46 species. In his revision of the family, N. Hallé (1977) defined 14 architectural groups, into which he divided the 31 species known at that time. These models are based on those defined by F. Hallé and Oldeman (1970). But they are clearly intended to group species together for identification purposes. Architectural pattern: A pattern is a set of vegetative or reproductive characters that define the general shape of an individual. Developed by mechanisms linked to the dominance of the terminal buds, the architectural groups are differentiated by the arrangement of the leaves, the position of the inflorescences or the shape of the stem (Fig. 1). Plants obeying a given pattern do not necessarily have phylogenetic relationships. These models have a useful application in the field for identifying groups of plants. Monocotyledonous plants, and in particular the Orchidaceae, lend themselves well to this approach, which produces stable architectural patterns. Recolnat Annotate Recolnat Annotate is a free tool for observing qualitative features and making physical measurements (angle, length, area) of images. It can be used offline and downloaded from https://www.recolnat.org/en/annotate. The software is based on the setting up observation projects that group together a batch of herbarium images to be studied, associating it with a descriptive model. A file of measurements can be exported in comma separated value (csv) format for further analysis (Fig. 2). XPER3 Usually used in the context of systematics in which the items studied are taxa, XPER3 can also be used to distinguish architectural groups that are not phylogenetically related. Developed by the Laboratoire d'Informatique et Systématique (LIS) of the Institut de Systématique, Evolution, Biodiversité in Paris, XPER3 is an online collaborative platform that allows the editing of descriptive data (https://www.xper3.fr/?language=en). This tool allows the cross-referencing of items (in this case architectural groups) and descriptors (or characters). It allows the development of free access identification keys (it means without fixed sequence of identification steps). The latter can be used directly online. But it also offers to produce single-access keys, with or without using character weighting and dependencies between characters. Links between XPER3 and Recolnat Annotate The descriptive model used by Recolnat Annotate can be developed within the framework of XPER3, which provides for characters and character states. Thus the observations made by the Recolnat Annotate measurement tool can be integrated into the XPER3 platform. Specimens can then be compared, or several descriptions can be merged to express the description of a species (Fig. 3). RESULTS The joint use of XPER3 and Recolnat Annotate to manage both herbarium specimens and architectural patterns has proven to be relevant. Moreover, the measurements on the virtual specimens are fast and reliable. N. Hallé (1977) had produced a dichotomous single-accesskey that allowed the identification and attribution of a pattern to a plant observed in the field or in a herbarium. The project to build a polytomous and interactive key with XPER3 required completing the observations to give a status for each character of each vegetative architectural model. Recolnat Annotate was used to produce observations from herbarium network in France. The use of XPER3 has allowed us to redefine these models in the light of new data from the herbaria and to publish the interactive key available at dendrobium-nc.identificationkey.org.


Author(s):  
Dave Vieglais ◽  
Stephen Richard ◽  
Hong Cui ◽  
Neil Davies ◽  
John Deck ◽  
...  

Material samples form an important portion of the data infrastructure for many disciplines. Here, a material sample is a physical object, representative of some physical thing, on which observations can be made. Material samples may be collected for one project initially, but can also be valuable resources for other studies in other disciplines. Collecting and curating material samples can be a costly process. Integrating institutionally managed sample collections, along with those sitting in individual offices or labs, is necessary to faciliate large-scale evidence-based scientific research. Many have recognized the problems and are working to make data related to material samples FAIR: findable, accessible, interoperable, and reusable. The Internet of Samples (i.e., iSamples) is one of these projects. iSamples was funded by the United States National Science Foundation in 2020 with the following aims: enable previously impossible connections between diverse and disparate sample-based observations; support existing research programs and facilities that collect and manage diverse sample types; facilitate new interdisciplinary collaborations; and provide an efficient solution for FAIR samples, avoiding duplicate efforts in different domains (Davies et al. 2021) enable previously impossible connections between diverse and disparate sample-based observations; support existing research programs and facilities that collect and manage diverse sample types; facilitate new interdisciplinary collaborations; and provide an efficient solution for FAIR samples, avoiding duplicate efforts in different domains (Davies et al. 2021) The initial sample collections that will make up the internet of samples include those from the System for Earth Sample Registration (SESAR), Open Context, the Genomic Observatories Meta-Database (GEOME), and Smithsonian Institution Museum of Natural History (NMNH), representing the disciplines of geoscience, archaeology/anthropology, and biology. To achieve these aims, the proposed iSamples infrastructure (Fig. 1) has two key components: iSamples in a Box (iSB) and iSamples Central (iSC). The iSC component will be a permanent Internet service that preserves, indexes, and provides access to sample metadata aggregated from iSBs. It will also ensure that persistent identifiers and sample descriptions assigned and used by individual iSBs are synchronized with the records in iSC and with identifier authorities like International Geo Sample Number (IGSN) or Archival Resource Key (ARK). The iSBs create and maintain identifiers and metadata for their respective collection of samples. While providing access to the samples held locally, an iSB also allows iSC to harvest its metadata records. The metadata modeling strategy adopted by the iSamples project is a metadata profile-based approach, where core metadata fields that are applicable to all samples, form the core metadata schema for iSamples. Each individual participating collectionis free to include additional metadata in their records, which will also be harvested by iSC and are discoverable through the iSC user interface or APIs (Application Programming Interfaces), just like the core. In-depth analysis of metadata profiles used by participating collections, including Darwin Core, has resulted in an iSamples core schema currently being tested and refined through use. See the current version of the iSamples core schema. A number of properties require a controlled vocabulary. Controlled vocabularies used by existing records are kept, while new vocabularies are also being developed to support high-level grouping with consistent semantics across collection types. Examples include vocabularies for Context Category, Material Category, and Specimen Type (Table 1). These vocabularies were also developed in a bottom-up manner, based on the terms used in the existing collections. For each vocabulary, a decision tree graph was created to illustrate relations among the terms, and a card sorting exercise was conducted within the project team to collect feedback. Domain experts are invited to take part in this exercise here, here, and here. These terms will be used as upper-level terms to the existing category terms used in the participating collections and hence create connections among individual participating collections. iSample project members are also active in the TDWG Material Sample Task Group and the global consultation on Digital Extended Specimens. Many members of the iSamples project also lead or participate in a sister research coordination network (RCN), Sampling Nature. The goal of this RCN is to develop and refine metadata standards and controlled vocabularies for the iSamples and other projects focusing on material samples. We cordially invite you to participate in the Sampling Nature RCN and help shape the future standards for material samples. Contact Sarah Ramdeen ([email protected]) to engage with the RCN.


Author(s):  
Beata Bramorska

Poland is characterised by a relatively high variety of living organisms attributed to terrestrial and water environments. Currently, close to 57.000 species of living organisms are described that occur in Poland (Symonides 2008), including lowland and mountain species, those attributed to oceanic and continental areas, as well as species from forested and open habitats. Poland comprehensively represents biodiversity of living organisms on a continental scale and thus, is considered to have an important role for biodiversity maintenance. The Mammal Research Institute of Polish Academy of Sciences (MRI PAS), located in Białowieża Forest, a UNESCO Heritage Site, has been collecting biodiversity data for 90 years. However, a great amount of data gathered over the years, especially old data, is gradually being forgotten and hard to access. Old catalogues and databases have never been digitalized or publicly shared, and not many Polish scientists are aware of the existence of such resources, not to mention the rest of the scientific world. Recognizing the need for an online, interoperable platform, following FAIR data principles (findable, accessible, interoperable, reusable), where biodiversity and scientific data can be shared, MRI PAS took a lead in creation of an Open Forest Data (OFD) repository. OpenForestData.pl (Fig. 1) is a newly created (2020) digital repository, designed to provide access to natural sciences data and provide scientists with an infrastructure for storing, sharing and archiving their research outcomes. Creating such a platform is a part of an ongoing development of life sciences in Poland, aiming for an open, modern science, where data are published as free-access. OFD also allows for the consolidation of natural science data, enabling the use and processing of shared data, including API (Application Programming Interface) tools. OFD is indexed by the Directory of Open Repositories (OpenDOAR) and Registry of Research Data Repositories (re3data). The OFD platform is based entirely on reliable, globally recognized open source software: DATAVERSE, an interactive database app which supports sharing, storing, exploration, citation and analysis of scientific data; GEONODE, a content management geospatial system used for storing, publicly sharing and visualising vector and raster layers, GRAFANA, a system meant for storing and analysis of metrics and large scale measurement data, as well as visualisation of historical graphs at any time range and analysis for trends; and external tools for database storage (Orthanc) and data visualisation (Orthanc plugin Osimis Web Viewer and Online 3D Viewer (https://3dviewer.net/), which were integrated with the system mechanism of Dataverse. Furthermore, according to the need for specimen description, Darwin Core (Wieczorek et al. 2012) metadata schema was decided to be the most suitable for specimen and collections description and mapped into a Dataverse additional metadata block. The use of Darwin Core is based on the same file format, the Darwin Core Archive (DwC-A) which allows for sharing data using common terminology and provides the possibility for easy evaluation and comparison of biodiversity datasets. It allows the contributors to OFD to optionally choose Darwin Core for object descriptions making it possible to share biodiversity datasets in a standardized way for users to download, analyse and compare. Currently, OFD stores more than 10.000 datasets and objects from the collections of Mammal Research Institute of Polish Academy of Sciences and Forest Science Institute of Białystok University of Technology. The objects from natural collections were digitalized, described, catalogued and made public in free-access. OFD manages seven types of collection materials: 3D and 2D scans of specimen in Herbarium, Fungarium, Insect and Mammal Collections, images from microscopes (including stereoscopic and scanning electron microscopes), morphometric measurements, computed tomography and microtomography scans in Mammal Collection, mammal telemetry data, satellite imagery, geospatial climatic and environmental data, georeferenced historical maps. 3D and 2D scans of specimen in Herbarium, Fungarium, Insect and Mammal Collections, images from microscopes (including stereoscopic and scanning electron microscopes), morphometric measurements, computed tomography and microtomography scans in Mammal Collection, mammal telemetry data, satellite imagery, geospatial climatic and environmental data, georeferenced historical maps. In the OFD repository, researchers have the possibility to share data in standardized way, which nowadays is often a requirement during the publishing process of a scientific article. Beside scientists, OFD is designed to be open and free for students and specialists in nature protection, but also for officials, foresters and nature enthusiasts. Creation of the OFD repository supports the development of citizen science in Poland, increases visibility and access to published data, improves scientific collaboration, exchange and reuse of data within and across borders.


Author(s):  
Michael Webster ◽  
Jutta Buschbom ◽  
Alex Hardisty ◽  
Andrew Bentley

Specimens have long been viewed as critical to research in the natural sciences because each specimen captures the phenotype (and often the genotype) of a particular individual at a particular point in space and time. In recent years there has been considerable focus on digitizing the many physical specimens currently in the world’s natural history research collections. As a result, a growing number of specimens are each now represented by their own “digital specimen”, that is, a findable, accessible, interoperable and re-usable (FAIR) digital representation of the physical specimen, which contains data about it. At the same time, there has been growing recognition that each digital specimen can be extended, and made more valuable for research, by linking it to data/samples derived from the curated physical specimen itself (e.g., computed tomography (CT) scan imagery, DNA sequences or tissue samples), directly related specimens or data about the organism's life (e.g., specimens of parasites collected from it, photos or recordings of the organism in life, immediate surrounding ecological community), and the wide range of associated specimen-independent data sets and model-based contextualisations (e.g., taxonomic information, conservation status, bioclimatological region, remote sensing images, environmental-climatological data, traditional knowledge, genome annotations). The resulting connected network of extended digital specimens will enable new research on a number of fronts, and indeed this has already begun. The new types of research enabled fall into four distinct but overlapping categories. First, because the digital specimen is a surrogate—acting on the Internet for a physical specimen in a natural science collection—it is amenable to analytical approaches that are simply not possible with physical specimens. For example, digital specimens can serve as training, validation and test sets for predictive process-based or machine learning algorithms, which are opening new doors of discovery and forecasting. Such sophisticated and powerful analytical approaches depend on FAIR, and on extended digital specimen data being as open as possible. These analytical approaches are derived from biodiversity monitoring outputs that are critically needed by the biodiversity community because they are central to conservation efforts at all levels of analysis, from genetics to species to ecosystem diversity. Second, linking specimens to closely associated specimens (potentially across multiple disparate collections) allows for the coordinated co-analysis of those specimens. For example, linking specimens of parasites/pathogens to specimens of the hosts from which they were collected, allows for a powerful new understanding of coevolution, including pathogen range expansion and shifts to new hosts. Similarly, linking specimens of pollinators, their food plants, and their predators can help untangle complex food webs and multi-trophic interactions. Third, linking derived data to their associated voucher specimens increases information richness, density, and robustness, thereby allowing for novel types of analyses, strengthening validation through linked independent data and thus, improving confidence levels and risk assessment. For example, digital representations of specimens, which incorporate e.g., images, CT scans, or vocalizations, may capture important information that otherwise is lost during preservation, such as coloration or behavior. In addition, permanently linking genetic and genomic data to the specimen of the individual from which they were derived—something that is currently done inconsistently—allows for detailed studies of the connections between genotype and phenotype. Furthermore, persistent links to physical specimens, of additional information and associated transactions, are the building blocks of documentation and preservation of chains of custody. The links will also facilitate data cleaning, updating, as well as maintenance of digital specimens and their derived and associated datasets, with ever-expanding research questions and applied uses materializing over time. The resulting high-quality data resources are needed for fact-based decision-making and forecasting based on monitoring, forensics and prediction workflows in conservation, sustainable management and policy-making. Finally, linking specimens to diverse but associated datasets allows for detailed, often transdisciplinary, studies of topics ranging from local adaptation, through the forces driving range expansion and contraction (critically important to our understanding of the consequences of climate change), and social vectors in disease transmission. A network of extended digital specimens will enable new and critically important research and applications in all of these categories, as well as science and uses that we cannot yet envision.


Author(s):  
John Waller ◽  
Nikolay Volik ◽  
Federico Mendez ◽  
Andrea Hahn

GBIF (Global Biodiversity Information Facility) is the largest data aggregator of biological occurrences in the world. GBIF was officially established in 2001 and has since aggregated 1.8 billion occurrence records from almost 2000 publishers. GBIF relies heavily on Darwin Core (DwC) for organising the data it receives. GBIF Data Processing Pipelines Every single occurrence record that gets published to GBIF goes through a series of three processing steps until it becomes available on GBIF.org. source downloading parsing into verbatim occurrences interpreting verbatim values source downloading parsing into verbatim occurrences interpreting verbatim values Once all records are available in the standard verbatim form, they go through a set of interpretations. In 2018, GBIF processing underwent a significant rewrite in order to improve speed and maintainablility. One of the main goals of this rewrite was to improve the consistency between GBIF's processing and that of the Living Atlases. In connection with this, GBIF's current data validator fell out of sync with GBIF pipelines processing. New GBIF Data Validator The current GBIF data validator is a service that allows anyone with a GBIF-relevant dataset to receive a report on the syntactical correctness and the validity of the content contained within the dataset. By submitting a dataset to the validator, users can go through the validation and interpretation procedures usually associated with publishing in GBIF and quickly determine potential issues in data, without having to publish it. GBIF is planning to rework the current validator because the current validator does not exactly match current GBIF pipelines processing. Planned Changes The new validator will match the processing of the GBIF pipelines project. Validations will be saved and show up on user pages similar to the way downloads and derived datasets appear now (no more bookmarking validations!) A downloadable report of issues found will be produced. Validations will be saved and show up on user pages similar to the way downloads and derived datasets appear now (no more bookmarking validations!) A downloadable report of issues found will be produced. Suggested Changes/Ideas One of the main guiding philosophies for the new validator user interface will be avoiding information overload. The current validator is often quite verbose in its feedback, highlighting data issues that may or may not be fixable or particularly important. The new validator will: generate a map of record geolocations; give users issues by order of importance; give "What", "Where", "When" flags priority; give some possible solutions or suggested fixes for flagged records. generate a map of record geolocations; give users issues by order of importance; give "What", "Where", "When" flags priority; give some possible solutions or suggested fixes for flagged records. We see the hosted portal environment as a way to quickly implement a pre-publication validation environment that is interactive and visual. Potential New Data Quality Flags The GBIF team has been compiling a list of new data quality flags. Not all of the suggested flags are easy to implement, so GBIF cannot promise the flags will get implemented, even if they are a great idea. The advantage of the new processing pipelines is that almost any new data quality flag or processing step in pipelines will be available for the data validator. Easy new potential flags: country centroid flag: Country/province centroids are a known data quality problem. any zero coordinate flag: Sometimes publishers leave either the latitude or longitude field as zero when it should have been left blank or NULL. default coordinate uncertainty in meters flag: Sometimes a default value or code is used for dwc:coordinateUncertaintyInMeters, which might indicate that it is incorrect. This is especially the case for values 301, 3036, 999, 9999. no higher taxonomy flag: Often publishers will leave out the higher taxonomy of a record. This can cause problems for matching to the GBIF backbone taxonomy.. null coordinate uncertainty in meters flag: There has been some discussion that GBIF should encourage publishers more to fill in dwc:coordinateUncertaintyInMeters. This is because every record, even ones taken from a Global Positioning System (GPS) reading, have an associated dwc:coordinateUncertaintyInMeters country centroid flag: Country/province centroids are a known data quality problem. any zero coordinate flag: Sometimes publishers leave either the latitude or longitude field as zero when it should have been left blank or NULL. default coordinate uncertainty in meters flag: Sometimes a default value or code is used for dwc:coordinateUncertaintyInMeters, which might indicate that it is incorrect. This is especially the case for values 301, 3036, 999, 9999. no higher taxonomy flag: Often publishers will leave out the higher taxonomy of a record. This can cause problems for matching to the GBIF backbone taxonomy.. null coordinate uncertainty in meters flag: There has been some discussion that GBIF should encourage publishers more to fill in dwc:coordinateUncertaintyInMeters. This is because every record, even ones taken from a Global Positioning System (GPS) reading, have an associated dwc:coordinateUncertaintyInMeters It is also nice when a data quality flag has an escape hatch, such that a data publisher can get rid of false positives or remove a flag through filling in a value. Batch-type validations that are doable for pipelines, but probably not in the validator include: outlier: Outliers are a known data quality problem. There are generally two types of outliers: environmental outliers and distance outliers. Currently GBIF does not flag either type of outlier. record is sensitive species: A sensitive species would be a record where the species is considered vulnerable in some way. Usually this is due to poaching threat or the species is only found in one area. gridded dataset: Rasterized or gridded datasets are common on GBIF. These are datasets where location information is pinned to a low-resolution grid. This is already available with an experimental API (Application Programming Interface). outlier: Outliers are a known data quality problem. There are generally two types of outliers: environmental outliers and distance outliers. Currently GBIF does not flag either type of outlier. record is sensitive species: A sensitive species would be a record where the species is considered vulnerable in some way. Usually this is due to poaching threat or the species is only found in one area. gridded dataset: Rasterized or gridded datasets are common on GBIF. These are datasets where location information is pinned to a low-resolution grid. This is already available with an experimental API (Application Programming Interface). Conclusion Data quality and data processing are moving targets. Variable source data will always be an issue when aggregating large amounts of data. With GBIF's new processing architecture, we hope that new features and data quality flags can be added more easily. Time and staffing resources are always in short supply, so we plan to prioritise the feedback we give to publishers, in order for them to work on correcting the most important and fixable issues. With new GBIF projects like the vocabulary server, we also hope that GBIF data processing can have more community participation.


Author(s):  
Marcus Guidoti ◽  
Carolina Sokolowicz ◽  
Felipe Simoes ◽  
Valdenar Gonçalves ◽  
Tatiana Ruschel ◽  
...  

Plazi's TreatmentBank is a research infrastructure and partner of the recent European Union-funded Biodiversity Community Integrated Knowledge Library (BiCIKL) project to provide a single knowledge portal to open, interlinked and machine-readable, findable, accessible, interoperable and reusable (FAIR) data. Plazi is liberating published biodiversity data that is trapped in so-called flat formats, such as portable document format (PDF), to increase its FAIRness. This can pose a variety of challenges for both data mining and curation of the extracted data. The automation of such a complex process requires internal organization and a well established workflow of specific steps (e.g., decoding of the PDF, extraction of data) to handle the challenges that the immense variety of graphic layouts existing in the biodiversity publishing landscape can impose. These challenges may vary according to the origin of the document: scanned documents that were not initially digital, need optical character recognition in order to be processed. Processing a document can either be an individual, one-time-only process, or a batch process, in which a template for a specific document type must be produced. Templates consist of a set of parameters that tell Plazi-dedicated software how to read and where to find key pieces of information for the extraction process, such as the related metadata. These parameters aim to improve the outcome of the data extraction process, and lead to more consistent results than manual extraction. In order to produce such templates, a set of tests and accompanying statistics are evaluated, and these same statistics are constantly checked against ongoing processing tasks in order to assess the template performance in a continuous manner. In addition to these steps that are intrinsically associated with the automated process, different granularity levels (e.g., low granularity level might consist of a treatment and its subsections versus a high granularity level that includes material citations down to named entities such as collection codes, collector, collecting date) were defined to accommodate specific needs for particular projects and user requirements. The higher the granularity level, the more thoroughly checked the resulting data is expected to be. Additionally, steps related to the quality control (qc), such as the “pre-qc”, “qc” and “extended qc” were designed and implemented to ensure data quality and enhanced data accuracy. Data on all these different stages of the processing workflow are constantly being collected and assessed in order to improve these very same stages, aiming for a more reliable and efficient operation. This is also associated with a current Data Architecture plan to move this data assessment to a cloud provider to promote real-time assessment and constant analyses of template performance and processing stages as a whole. In this talk, the steps of this entire process are explained in detail, highlighting how data are being used to improve these steps towards a more efficient, accurate, and less costly operation.


Sign in / Sign up

Export Citation Format

Share Document