scholarly journals Unity in Variety: Developing a collection description standard by consensus

Author(s):  
Matt Woodburn ◽  
Deborah L Paul ◽  
Wouter Addink ◽  
Steven J Baskauf ◽  
Stanley Blum ◽  
...  

Digitisation and publication of museum specimen data is happening worldwide, but far from complete. Museums can start by sharing what they know about their holdings at a higher level, long before each object has its own record. Information about what is held in collections worldwide is needed by many stakeholders including collections managers, funders, researchers, policy-makers, industry, and educators. To aggregate this information from collections, the data need to be standardised (Johnston and Robinson 2002). So, the Biodiversity Information Standards (TDWG) Collection Descriptions (CD) Task Group is developing a data standard for describing collections, which gives the ability to provide: automated metrics, using standardised collection descriptions and/or data derived from specimen datasets (e.g., counts of specimens) and a global registry of physical collections (i.e., digitised or non-digitised). automated metrics, using standardised collection descriptions and/or data derived from specimen datasets (e.g., counts of specimens) and a global registry of physical collections (i.e., digitised or non-digitised). Outputs will include a data model to underpin the new standard, and guidance and reference implementations for the practical use of the standard in institutional and collaborative data infrastructures. The Task Group employs a community-driven approach to standard development. With international participation, workshops at the Natural History Museum (London 2019) and the MOBILISE workshop (Warsaw 2020) allowed over 50 people to contribute this work. Our group organized online "barbecues" (BBQs) so that many more could contribute to standard definitions and address data model design challenges. Cloud-based tools (e.g., GitHub, Google Sheets) are used to organise and publish the group's work and make it easy to participate. A Wikibase instance is also used to test and demonstrate the model using real data. There are a range of global, regional, and national initiatives interested in the standard (see Task Group charter). Some, like GRSciColl (now at the Global Biodiversity Information Facility (GBIF)), Index Herbariorum (IH), and the iDigBio US Collections List are existing catalogues. Others, including the Consortium of European Taxonomic Facilities (CETAF) and the Distributed System of Scientific Collections (DiSSCo), include collection descriptions as a key part of their near-term development plans. As part of the EU-funded SYNTHESYS+ project, GBIF organized a virtual workshop: Advancing the Catalogue of the World's Natural History Collections to get international input for such a resource that would use this CD standard. Some major complexities present themselves in designing a standardised approach to represent collection descriptions data. It is not the first time that the natural science collections community has tried to address them (see the TDWG Natural Collections Description standard). Beyond natural sciences, the library community in particular gave thought to this (Heaney 2001, Johnston and Robinson 2002), noting significant difficulties. One hurdle is that collections may be broken down into different degrees of granularity according to different criteria, and may also overlap so that a single object can be represented in more than one collection description. Managing statistics such as numbers of objects is complex due to data gaps and variable degrees of certainty about collection contents. It also takes considerable effort from collections staff to generate structured data about their undigitised holdings. We need to support simple, high-level collection summaries as well as detailed quantitative data, and to be able to update as needed. We need a simple approach, but one that can also handle the complexities of data, scope, and social needs, for digitised and undigitised collections. The data standard itself is a defined set of classes and properties that can be used to represent groups of collection objects and their associated information. These incorporate common characteristics ('dimensions') by which we want to describe, group and break down our collections, metrics for quantifying those collections, and properties such as persistent identifiers for tracking collections and managing their digital counterparts. Existing terms from other standards (e.g. Darwin Core, ABCD) are re-used if possible. The data model (Fig. 1) underpinning the standard defines the relationships between those different classes, and ensures that the structure as well as the content are comparable across different datasets. It centres around the core concept of an 'object group', representing a set of physical objects that is defined by one or more dimensions (e.g., taxonomy and geographic origin), and linked to other entities such as the holding institution. To the object group, quantitative data about its contents are attached (e.g. counts of objects or taxa), along with more qualitative information describing the contents of the group as a whole. In this presentation, we will describe the draft standard and data model with examples of early adoption for real-world and example data. We will also discuss the vision of how the new standard may be adopted and its potential impact on collection discoverability across the collections community.

Author(s):  
Matt Woodburn ◽  
Deborah L Paul ◽  
William Ulate ◽  
Niels Raes

Aggregating content of museum and scientific collections worldwide offers us the opportunity to realize a virtual museum of our planet and the life upon it through space and time. By mapping specimen-level data records to standards and publishing this information, an increasing number of collections contribute to a digitally accessible wealth of knowledge. Visualizing these digital records by parameters such as collection type and geographic origin, helps collections and institutions to better understand their digital holdings and compare them to other such collections, as well as enabling researchers to find specimens and specimen data quickly (Singer et al. 2018). At the higher level of collections, related people and their activities, and especially the great majority of material that is yet to be digitised, we know much less. Many collections hold material not yet digitally discoverable in any form. For those that do publish collection-level data, it is commonly text-based data without the Global Unique Identifiers (GUIDs) or the controlled vocabularies that would support quantitative collection metrics and aid discovery of related expertise and publications. To best understand and plan for our world’s bio- and geodiversity represented in collections, we need standardised, quantitative collections-level metadata. Various groups planet-wide are actively developing tools to capture this much-needed metadata, including information about the backlog, and more detailed information about institutions and their activities (e.g. staffing, space, species-level inventories, geographic and taxonomic expertise, and related publications) (Smith et al. 2018). The Biodiversity Information Standards organization (TDWG) Collection Descriptions (CD) Data Standard Task Group aims to provide a data standard for describing natural scientific collections, which enables the ability to provide: automated metrics, using standardised collection descriptions and/or data derived from specimen datasets (e.g., counts of specimens) and a global registry of physical collections (either digitised or non-digitised). automated metrics, using standardised collection descriptions and/or data derived from specimen datasets (e.g., counts of specimens) and a global registry of physical collections (either digitised or non-digitised). The group will also produce a data model to underpin the new standard, and provide guidance and reference implementations for the practical use of the standard in institutional and collaborative data infrastructures. Our task group includes members from a myriad of groups with a stake in mobilizing such data at local, regional, domain-specific and global levels. With such a standard adopted, it will be possible to effectively share data across different community resources. So far, we have carried out landscape analyses of existing collection description frameworks, and amassed a portfolio of use cases from the group as well as from a range of other sources, including the Collection Descriptions Dashboard working group of ICEDIG ("Innovation and consolidation for large scale digitisation of natural heritage"), iDigBio (Integrated Digitized Biocollections), Smithsonian, Index Herbariorum, the Field Museum, GBIF (Global Biodiversity Information Facility), GRBio (Global Registry of Biodiversity Repositories) and fishfindR.net. These were used to develop a draft data model, and between them inform the first iteration of CD draft data standard. A variety of challenges present themselves in developing this standard. Some relate to the standard development process itself, such as identifying (often learning) effective tools and methods for collaborative working and communication across globally distributed volunteers. Others concern the scope and gaining consensus from stakeholders, across a wide range of disciplines, while maintaining achievable goals. Further challenges arise from the requirement to develop a data model and standard that support such a variety of use cases and priorities, while retaining interoperability and manageability of the data. We will present some of these challenges and methods for addressing them, and summarise the progress and draft outputs of the group so far. We will also discuss the vision of how the new standard may be adopted and its potential impact on collections discoverability across the natural science collections community.


Author(s):  
Katharine Barker ◽  
Jonas Astrin ◽  
Gabriele Droege ◽  
Jonathan Coddington ◽  
Ole Seberg

Most successful research programs depend on easily accessible and standardized research infrastructures. Until recently, access to tissue or DNA samples with standardized metadata and of a sufficiently high quality, has been a major bottleneck for genomic research. The Global Geonome Biodiversity Network (GGBN) fills this critical gap by offering standardized, legal access to samples. Presently, GGBN’s core activity is enabling access to searchable DNA and tissue collections across natural history museums and botanic gardens. Activities are gradually being expanded to encompass all kinds of biodiversity biobanks such as culture collections, zoological gardens, aquaria, arboreta, and environmental biobanks. Broadly speaking, these collections all provide long-term storage and standardized public access to samples useful for molecular research. GGBN facilitates sample search and discovery for its distributed member collections through a single entry point. It stores standardized information on mostly geo-referenced, vouchered samples, their physical location, availability, quality, and the necessary legal information on over 50,000 species of Earth’s biodiversity, from unicellular to multicellular organisms. The GGBN Data Portal and the GGBN Data Standard are complementary to existing infrastructures such as the Global Biodiversity Information Facility (GBIF) and International Nucleotide Sequence Database (INSDC). Today, many well-known open-source collection management databases such as Arctos, Specify, and Symbiota, are implementing the GGBN data standard. GGBN continues to increase its collections strategically, based on the needs of the research community, adding over 1.3 million online records in 2018 alone, and today two million sample data are available through GGBN. Together with Consortium of European Taxonomic Facilities (CETAF), Society for the Preservation of Natural History Collections (SPNHC), Biodiversity Information Standards (TDWG), and Synthesis of Systematic Resources (SYNTHESYS+), GGBN provides best practices for biorepositories on meeting the requirements of the Nagoya Protocol on Access and Benefit Sharing (ABS). By collaboration with the Biodiversity Heritage Library (BHL), GGBN is exploring options for tagging publications that reference GGBN collections and associated specimens, made searchable through GGBN’s document library. Through its collaborative efforts, standards, and best practices GGBN aims at facilitating trust and transparency in the use of genetic resources.


Author(s):  
Matt Woodburn ◽  
Gabriele Droege ◽  
Sharon Grant ◽  
Quentin Groom ◽  
Janeen Jones ◽  
...  

The utopian vision is of a future where a digital representation of each object in our collections is accessible through the internet and sustainably linked to other digital resources. This is a long term goal however, and in the meantime there is an urgent need to share data about our collections at a higher level with a range of stakeholders (Woodburn et al. 2020). To sustainably achieve this, and to aggregate this information across all natural science collections, the data need to be standardised (Johnston and Robinson 2002). To this end, the Biodiversity Information Standards (TDWG) Collection Descriptions (CD) Interest Group has developed a data standard for describing collections, which is approaching formal review for ratification as a new TDWG standard. It proposes 20 classes (Suppl. material 1) and over 100 properties that can be used to describe, categorise, quantify, link and track digital representations of natural science collections, from high-level approximations to detailed breakdowns depending on the purpose of a particular implementation. The wide range of use cases identified for representing collection description data means that a flexible approach to the standard and the underlying modelling concepts is essential. These are centered around the ‘ObjectGroup’ (Fig. 1), a class that may represent any group (of any size) of physical collection objects, which have one or more common characteristics. This generic definition of the ‘collection’ in ‘collection descriptions’ is an important factor in making the standard flexible enough to support the breadth of use cases. For any use case or implementation, only a subset of classes and properties within the standard are likely to be relevant. In some cases, this subset may have little overlap with those selected for other use cases. This additional need for flexibility means that very few classes and properties, representing the core concepts, are proposed to be mandatory. Metrics, facts and narratives are represented in a normalised structure using an extended MeasurementOrFact class, so that these can be user-defined rather than constrained to a set identified by the standard. Finally, rather than a rigid underlying data model as part of the normative standard, documentation will be developed to provide guidance on how the classes in the standard may be related and quantified according to relational, dimensional and graph-like models. So, in summary, the standard has, by design, been made flexible enough to be used in a number of different ways. The corresponding risk is that it could be used in ways that may not deliver what is needed in terms of outputs, manageability and interoperability with other resources of collection-level or object-level data. To mitigate this, it is key for any new implementer of the standard to establish how it should be used in that particular instance, and define any necessary constraints within the wider scope of the standard and model. This is the concept of the ‘collection description scheme,’ a profile that defines elements such as: which classes and properties should be included, which should be mandatory, and which should be repeatable; which controlled vocabularies and hierarchies should be used to make the data interoperable; how the collections should be broken down into individual ObjectGroups and interlinked, and how the various classes should be related to each other. which classes and properties should be included, which should be mandatory, and which should be repeatable; which controlled vocabularies and hierarchies should be used to make the data interoperable; how the collections should be broken down into individual ObjectGroups and interlinked, and how the various classes should be related to each other. Various factors might influence these decisions, including the types of information that are relevant to the use case, whether quantitative metrics need to be captured and aggregated across collection descriptions, and how many resources can be dedicated to amassing and maintaining the data. This process has particular relevance to the Distributed System of Scientific Collections (DiSSCo) consortium, the design of which incorporates use cases for storing, interlinking and reporting on the collections of its member institutions. These include helping users of the European Loans and Visits System (ELViS) (Islam 2020) to discover specimens for physical and digital loans by providing descriptions and breakdowns of the collections of holding institutions, and monitoring digitisation progress across European collections through a dynamic Collections Digitisation Dashboard. In addition, DiSSCo will be part of a global collections data ecosystem requiring interoperation with other infrastructures such as the GBIF (Global Biodiversity Information Facility) Registry of Scientific Collections, the CETAF (Consortium of European Taxonomic Facilities) Registry of Collections and Index Herbariorum. In this presentation, we will introduce the draft standard and discuss the process of defining new collection description schemes using the standard and data model, and focus on DiSSCo requirements as examples of real-world collection descriptions use cases.


Author(s):  
David Fichtmüller ◽  
Fabian Reimeier ◽  
Anton Güntsch

In the ABCD 3.0 Project the ABCD (Access to Biological Collection Data) Standard (Access to Biological Collections Data task group 2007) was transformed from a classic XML Schema into an OWL (Web Ontology Language) ontology (along side an updated semantic-aware XML version). While it was initially planned to use the established TDWG Terms wiki as the editing and development platform for the ABCD ontology, the rise of Wikidata and its underlying platform Wikibase have caused us to reconsider this decision and switch to a Wikibase installation instead. This proved to be a crucial decision, as Wikibase turned out to be a well-suited platform to collaboratively import, develop and export this complex semantic standard. This experience is potentially of interest to maintainers of other Biodiversity Information Standards (TDWG) standards and the Technical Architecture Group. In this presentation we will explain our technical setup and how we used Wikibase, alongside its related tools, to model the ABCD Ontology. We will introduce the tools we used for importing existing concepts from the previous ABCD versions, running maintenance queries (e.g. for checking the ontology for consistency or missing information about concepts), and exporting the ontology into the OWL/XML format. Finally we will discuss the lessons we learned and how our setup can be improved for future uses.


Author(s):  
Matt Woodburn ◽  
Sarah Vincent ◽  
Helen Hardy ◽  
Clare Valentine

The natural science collections community has identified an increasing need for shared, structured and interoperable data standards that can be used to describe the totality of institutional collection holdings, whether digitised or not. Major international initiatives - including the Global Biodiversity Information Facility (GBIF), the Distributed System of Scientific Collections (DiSSCo) and the Consortium of European Taxonomic Facilities (CETAF) - consider the current lack of standards to be a major barrier, which must be overcome to further their strategic aims and contribute to an open, discoverable catalogue of global collections. The Biodiversity Information Standards (TDWG) Collection Descriptions (CD) group is looking to address this issue with a new data standard for collection descriptions. At an institutional level, this concept of collection descriptions aligns strongly with the need to use a structured and more data-driven approach to assessing and working with collections, both to identify and prioritise investment and effort, and to monitor the impact of the work. Use cases include planning conservation and collection moves, prioritising specimen digitisation activities, and informing collection development strategy. The data can be integrated with the collection description framework for ongoing assessments of the state of the collection. This approach was pioneered with the ‘Move the Dots’ methodology by the Smithsonian National Museum of Natural History, started in 2009 and run annually since. The collection is broken down into several hundred discrete subcollections, for each of which the number of objects was estimated and a numeric rank allocated according to a range of assessment criteria. This method has since been adopted by several other institutions, including Naturalis Biodiversity Centre, Museum für Naturkunde and Natural History Museum, London (NHM). First piloted in 2016, and now implemented as a core framework, the NHM’s adaptation, ‘Join the Dots’, divides the collection into approximately 2,600 ‘collection units’. The breakdown uses formal controlled lists and hierarchies, primarily taxonomy, type of object, storage location and (where relevant) stratigraphy, which are mapped to external authorities such as the Catalogue of Life and Paleobiology Database. The collection breakdown is enhanced with estimations of number of items, and ranks from 1 to 5 for each collection unit against 17 different criteria. These are grouped into four categories of ‘Condition’, ‘Information’ (including digital records), ‘Importance and Significance’ and ‘Outreach’. Although requiring significant time investment from collections staff to provide the estimates and assessments, this methodology has yielded a rich dataset that supports both discoverability (collection descriptions) and management (collection assessment). Links to further datasets about the building infrastructure and environmental conditions also make it into a powerful resource for planning activities such as collections moves, pest monitoring and building work. We have developed dynamic dashboards to provide rich visualisations for exploring, analysing and communicating the data. As an ongoing, embedded activity for collections staff, there will also be a build-up of historical data going forward, enabling us to see trends, track changes to the collection, and measure the impact of projects and events. The concept of Join the Dots also offers a generic, institution-agnostic model for enhancing the collection description framework with additional metrics that add value for strategic management and resourcing of the collection. In the design and implementation, we’ve faced challenges that should be highly relevant to the TDWG CD group, such as managing the dynamic breakdown of collections across multiple dimensions. We also face some that are yet to be resolved, such as a robust model for managing the evolving dataset over time. We intend to contribute these use cases into the development of the new TDWG data standard and be an early adopter and reference case. We envisage that this could constitute a common model that, where resources are available, provides the ability to add greater depth and utility to the world catalogue of collections.


Author(s):  
Mathias Dillen ◽  
Elspeth Haston ◽  
Nicole Kearney ◽  
Deborah L Paul ◽  
Joaquim Santos ◽  
...  

The natural history specimens of the world have been documented on paper labels, often physically attached to the specimen itself. As we transcribe these data to make them digital and more useful for analysis, we make interpretations. Sometimes these interpretations are trivial, because the label is unambiguous, but often the meaning is not so clear, even if it is easily read. One key element that suffers from considerable ambiguity is people’s names. Though a person is indivisible, their name can change, is rarely unique and can be written in many ways. Yet knowing the people associated with data is incredibly useful. Data on people can be used to validate other data, simplify data capture, link together data across domains, reduce duplication-of-effort and facilitate data-gap-analysis. In addition, people data enable the discovery of individuals unique to our collections, the collective charting of the history of scientific researchers and the provision of credit to the people who deserve it (Groom et al. 2020). We foresee a future where the people associated with collections are not ambiguous, are shared globally, and data of all kinds are linked through the people who generate them. The TDWG People in Biodiversity Data Task Group is therefore working on a guide to the disambiguation of people in natural history collections. The ultimate goal is to connect the various strings of characters on specimen labels and other documentation to persistent identifiers (PIDs) that unambiguously link a name “string” to the identity of a person. In working towards this goal, 150 volunteers in the Bionomia project have linked 21 million specimens to persistent identifiers for their collectors and determiners. An additional 2 million specimens with links to identifiers for people have already emerged directly from collections that make use of the recently ratified Darwin Core terms recordedByID and identifiedByID. Furthermore, the CETAF Botany Pilot conducted among a group of European herbaria and museums has connected over 1.4 million specimens to disambiguated collectors (Güntsch et al. 2021). Still, given the estimated 2 billion (Ariño 2010) natural history specimens globally, there is much more disambiguation to be done. The process of disambiguation starts with a trigger, which is often the transcription of a specimen’s label data. Unambiguous identification of the collector may facilitate this transcription, as it offers knowledge of their biographical details and collecting habits, allowing us to infer missing information such as collecting date or locality. Another trigger might be the flagging of inconsistent data during data entry or resulting from data quality processes, revealing for instance that multiple collectors have been conflated. A disambiguation trigger is followed by the gathering of data, then the evaluation of the results and finally by the documentation of the new information. Disambiguation is not always straightforward and there are many pitfalls. It requires access to biographical data, and identifiers to be minted. In the case of living people, they have to cooperate with being disambiguated and we have to follow legal and ethical guidelines. In the case of dead people, particularly those long dead, disambiguation may require considerable research. We will present the progress made by the People in Biodiversity Data Task Group and their recommendations for disambiguation in collections. We want to encourage other institutions to engage with a global effort of linking people to persistent identifiers to collaboratively improve all collection data.


Author(s):  
Peter C. G. Veenstra

The Pipeline Open Data Standard (PODS) Association develops and advances global pipeline data standards and best practices supporting data management and reporting for the oil and gas industry. This presentation provides an overview of the PODS Association and a detailed overview of the transformed PODS Pipeline Data Model resulting from the PODS Next Generation initiative. The PODS Association’s Next Generation, or Next Gen, initiative is focused on a complete re-design and modernization of the PODS Pipeline Data Model. The re-design of the PODS Pipeline Data Model is driven by PODS Association Strategy objectives as defined in its 2016–2019 Strategic Plan and reflects nearly 20 years of PODS Pipeline Data Model implementation experience and lessons learned. The Next Gen Data Model is designed to be the system of record for pipeline centerlines and pressurized containment assets for the safe transport of product, allowing pipeline operators to: • Achieve greater agility to build and extend the data model, • respond to new business requirements, • interoperate through standard data models and consistent application interface, • share data within and between organizations using well defined data exchange specifications, • optimize performance for management of bulk loading, reroute, inspection data and history. The presentation will introduce the Next Gen Data Model design principles, conceptual, logical and physical structures with a focus on transformational changes from prior versions of the Model. Support for multiple platforms including but not limited to Esri ArcGIS, open source GIS and relational database management systems will be described. Alignment with Esri’s ArcGIS Platform and ArcGIS for Pipeline Referencing (APR) will be a main topic of discussion along with how PODS Next Gen can be leveraged to benefit pipeline integrity, risk assessment, reporting and data maintenance. The end goal of a PODS implementation is a realization of data management efficiency, data transfer and exchange, to make the operation of a pipeline safer and most cost effective.


2017 ◽  
Vol 23 ◽  
pp. 260-261
Author(s):  
Steven Ing ◽  
Bart Clarke ◽  
Aliya Khan ◽  
Michael Mannstadt ◽  
Michael McDermott ◽  
...  

2017 ◽  
Author(s):  
Bart L Clarke ◽  
Steven Ing ◽  
Aliya Khan ◽  
Michael Mannstadt ◽  
Michael McDermott ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document