scholarly journals A Data Standard for Dynamic Collection Descriptions

Author(s):  
Matt Woodburn ◽  
Gabriele Droege ◽  
Sharon Grant ◽  
Quentin Groom ◽  
Janeen Jones ◽  
...  

The utopian vision is of a future where a digital representation of each object in our collections is accessible through the internet and sustainably linked to other digital resources. This is a long term goal however, and in the meantime there is an urgent need to share data about our collections at a higher level with a range of stakeholders (Woodburn et al. 2020). To sustainably achieve this, and to aggregate this information across all natural science collections, the data need to be standardised (Johnston and Robinson 2002). To this end, the Biodiversity Information Standards (TDWG) Collection Descriptions (CD) Interest Group has developed a data standard for describing collections, which is approaching formal review for ratification as a new TDWG standard. It proposes 20 classes (Suppl. material 1) and over 100 properties that can be used to describe, categorise, quantify, link and track digital representations of natural science collections, from high-level approximations to detailed breakdowns depending on the purpose of a particular implementation. The wide range of use cases identified for representing collection description data means that a flexible approach to the standard and the underlying modelling concepts is essential. These are centered around the ‘ObjectGroup’ (Fig. 1), a class that may represent any group (of any size) of physical collection objects, which have one or more common characteristics. This generic definition of the ‘collection’ in ‘collection descriptions’ is an important factor in making the standard flexible enough to support the breadth of use cases. For any use case or implementation, only a subset of classes and properties within the standard are likely to be relevant. In some cases, this subset may have little overlap with those selected for other use cases. This additional need for flexibility means that very few classes and properties, representing the core concepts, are proposed to be mandatory. Metrics, facts and narratives are represented in a normalised structure using an extended MeasurementOrFact class, so that these can be user-defined rather than constrained to a set identified by the standard. Finally, rather than a rigid underlying data model as part of the normative standard, documentation will be developed to provide guidance on how the classes in the standard may be related and quantified according to relational, dimensional and graph-like models. So, in summary, the standard has, by design, been made flexible enough to be used in a number of different ways. The corresponding risk is that it could be used in ways that may not deliver what is needed in terms of outputs, manageability and interoperability with other resources of collection-level or object-level data. To mitigate this, it is key for any new implementer of the standard to establish how it should be used in that particular instance, and define any necessary constraints within the wider scope of the standard and model. This is the concept of the ‘collection description scheme,’ a profile that defines elements such as: which classes and properties should be included, which should be mandatory, and which should be repeatable; which controlled vocabularies and hierarchies should be used to make the data interoperable; how the collections should be broken down into individual ObjectGroups and interlinked, and how the various classes should be related to each other. which classes and properties should be included, which should be mandatory, and which should be repeatable; which controlled vocabularies and hierarchies should be used to make the data interoperable; how the collections should be broken down into individual ObjectGroups and interlinked, and how the various classes should be related to each other. Various factors might influence these decisions, including the types of information that are relevant to the use case, whether quantitative metrics need to be captured and aggregated across collection descriptions, and how many resources can be dedicated to amassing and maintaining the data. This process has particular relevance to the Distributed System of Scientific Collections (DiSSCo) consortium, the design of which incorporates use cases for storing, interlinking and reporting on the collections of its member institutions. These include helping users of the European Loans and Visits System (ELViS) (Islam 2020) to discover specimens for physical and digital loans by providing descriptions and breakdowns of the collections of holding institutions, and monitoring digitisation progress across European collections through a dynamic Collections Digitisation Dashboard. In addition, DiSSCo will be part of a global collections data ecosystem requiring interoperation with other infrastructures such as the GBIF (Global Biodiversity Information Facility) Registry of Scientific Collections, the CETAF (Consortium of European Taxonomic Facilities) Registry of Collections and Index Herbariorum. In this presentation, we will introduce the draft standard and discuss the process of defining new collection description schemes using the standard and data model, and focus on DiSSCo requirements as examples of real-world collection descriptions use cases.

Author(s):  
Matt Woodburn ◽  
Deborah L Paul ◽  
William Ulate ◽  
Niels Raes

Aggregating content of museum and scientific collections worldwide offers us the opportunity to realize a virtual museum of our planet and the life upon it through space and time. By mapping specimen-level data records to standards and publishing this information, an increasing number of collections contribute to a digitally accessible wealth of knowledge. Visualizing these digital records by parameters such as collection type and geographic origin, helps collections and institutions to better understand their digital holdings and compare them to other such collections, as well as enabling researchers to find specimens and specimen data quickly (Singer et al. 2018). At the higher level of collections, related people and their activities, and especially the great majority of material that is yet to be digitised, we know much less. Many collections hold material not yet digitally discoverable in any form. For those that do publish collection-level data, it is commonly text-based data without the Global Unique Identifiers (GUIDs) or the controlled vocabularies that would support quantitative collection metrics and aid discovery of related expertise and publications. To best understand and plan for our world’s bio- and geodiversity represented in collections, we need standardised, quantitative collections-level metadata. Various groups planet-wide are actively developing tools to capture this much-needed metadata, including information about the backlog, and more detailed information about institutions and their activities (e.g. staffing, space, species-level inventories, geographic and taxonomic expertise, and related publications) (Smith et al. 2018). The Biodiversity Information Standards organization (TDWG) Collection Descriptions (CD) Data Standard Task Group aims to provide a data standard for describing natural scientific collections, which enables the ability to provide: automated metrics, using standardised collection descriptions and/or data derived from specimen datasets (e.g., counts of specimens) and a global registry of physical collections (either digitised or non-digitised). automated metrics, using standardised collection descriptions and/or data derived from specimen datasets (e.g., counts of specimens) and a global registry of physical collections (either digitised or non-digitised). The group will also produce a data model to underpin the new standard, and provide guidance and reference implementations for the practical use of the standard in institutional and collaborative data infrastructures. Our task group includes members from a myriad of groups with a stake in mobilizing such data at local, regional, domain-specific and global levels. With such a standard adopted, it will be possible to effectively share data across different community resources. So far, we have carried out landscape analyses of existing collection description frameworks, and amassed a portfolio of use cases from the group as well as from a range of other sources, including the Collection Descriptions Dashboard working group of ICEDIG ("Innovation and consolidation for large scale digitisation of natural heritage"), iDigBio (Integrated Digitized Biocollections), Smithsonian, Index Herbariorum, the Field Museum, GBIF (Global Biodiversity Information Facility), GRBio (Global Registry of Biodiversity Repositories) and fishfindR.net. These were used to develop a draft data model, and between them inform the first iteration of CD draft data standard. A variety of challenges present themselves in developing this standard. Some relate to the standard development process itself, such as identifying (often learning) effective tools and methods for collaborative working and communication across globally distributed volunteers. Others concern the scope and gaining consensus from stakeholders, across a wide range of disciplines, while maintaining achievable goals. Further challenges arise from the requirement to develop a data model and standard that support such a variety of use cases and priorities, while retaining interoperability and manageability of the data. We will present some of these challenges and methods for addressing them, and summarise the progress and draft outputs of the group so far. We will also discuss the vision of how the new standard may be adopted and its potential impact on collections discoverability across the natural science collections community.


Author(s):  
Matt Woodburn ◽  
Sarah Vincent ◽  
Helen Hardy ◽  
Clare Valentine

The natural science collections community has identified an increasing need for shared, structured and interoperable data standards that can be used to describe the totality of institutional collection holdings, whether digitised or not. Major international initiatives - including the Global Biodiversity Information Facility (GBIF), the Distributed System of Scientific Collections (DiSSCo) and the Consortium of European Taxonomic Facilities (CETAF) - consider the current lack of standards to be a major barrier, which must be overcome to further their strategic aims and contribute to an open, discoverable catalogue of global collections. The Biodiversity Information Standards (TDWG) Collection Descriptions (CD) group is looking to address this issue with a new data standard for collection descriptions. At an institutional level, this concept of collection descriptions aligns strongly with the need to use a structured and more data-driven approach to assessing and working with collections, both to identify and prioritise investment and effort, and to monitor the impact of the work. Use cases include planning conservation and collection moves, prioritising specimen digitisation activities, and informing collection development strategy. The data can be integrated with the collection description framework for ongoing assessments of the state of the collection. This approach was pioneered with the ‘Move the Dots’ methodology by the Smithsonian National Museum of Natural History, started in 2009 and run annually since. The collection is broken down into several hundred discrete subcollections, for each of which the number of objects was estimated and a numeric rank allocated according to a range of assessment criteria. This method has since been adopted by several other institutions, including Naturalis Biodiversity Centre, Museum für Naturkunde and Natural History Museum, London (NHM). First piloted in 2016, and now implemented as a core framework, the NHM’s adaptation, ‘Join the Dots’, divides the collection into approximately 2,600 ‘collection units’. The breakdown uses formal controlled lists and hierarchies, primarily taxonomy, type of object, storage location and (where relevant) stratigraphy, which are mapped to external authorities such as the Catalogue of Life and Paleobiology Database. The collection breakdown is enhanced with estimations of number of items, and ranks from 1 to 5 for each collection unit against 17 different criteria. These are grouped into four categories of ‘Condition’, ‘Information’ (including digital records), ‘Importance and Significance’ and ‘Outreach’. Although requiring significant time investment from collections staff to provide the estimates and assessments, this methodology has yielded a rich dataset that supports both discoverability (collection descriptions) and management (collection assessment). Links to further datasets about the building infrastructure and environmental conditions also make it into a powerful resource for planning activities such as collections moves, pest monitoring and building work. We have developed dynamic dashboards to provide rich visualisations for exploring, analysing and communicating the data. As an ongoing, embedded activity for collections staff, there will also be a build-up of historical data going forward, enabling us to see trends, track changes to the collection, and measure the impact of projects and events. The concept of Join the Dots also offers a generic, institution-agnostic model for enhancing the collection description framework with additional metrics that add value for strategic management and resourcing of the collection. In the design and implementation, we’ve faced challenges that should be highly relevant to the TDWG CD group, such as managing the dynamic breakdown of collections across multiple dimensions. We also face some that are yet to be resolved, such as a robust model for managing the evolving dataset over time. We intend to contribute these use cases into the development of the new TDWG data standard and be an early adopter and reference case. We envisage that this could constitute a common model that, where resources are available, provides the ability to add greater depth and utility to the world catalogue of collections.


Author(s):  
Katharine Barker ◽  
Jonas Astrin ◽  
Gabriele Droege ◽  
Jonathan Coddington ◽  
Ole Seberg

Most successful research programs depend on easily accessible and standardized research infrastructures. Until recently, access to tissue or DNA samples with standardized metadata and of a sufficiently high quality, has been a major bottleneck for genomic research. The Global Geonome Biodiversity Network (GGBN) fills this critical gap by offering standardized, legal access to samples. Presently, GGBN’s core activity is enabling access to searchable DNA and tissue collections across natural history museums and botanic gardens. Activities are gradually being expanded to encompass all kinds of biodiversity biobanks such as culture collections, zoological gardens, aquaria, arboreta, and environmental biobanks. Broadly speaking, these collections all provide long-term storage and standardized public access to samples useful for molecular research. GGBN facilitates sample search and discovery for its distributed member collections through a single entry point. It stores standardized information on mostly geo-referenced, vouchered samples, their physical location, availability, quality, and the necessary legal information on over 50,000 species of Earth’s biodiversity, from unicellular to multicellular organisms. The GGBN Data Portal and the GGBN Data Standard are complementary to existing infrastructures such as the Global Biodiversity Information Facility (GBIF) and International Nucleotide Sequence Database (INSDC). Today, many well-known open-source collection management databases such as Arctos, Specify, and Symbiota, are implementing the GGBN data standard. GGBN continues to increase its collections strategically, based on the needs of the research community, adding over 1.3 million online records in 2018 alone, and today two million sample data are available through GGBN. Together with Consortium of European Taxonomic Facilities (CETAF), Society for the Preservation of Natural History Collections (SPNHC), Biodiversity Information Standards (TDWG), and Synthesis of Systematic Resources (SYNTHESYS+), GGBN provides best practices for biorepositories on meeting the requirements of the Nagoya Protocol on Access and Benefit Sharing (ABS). By collaboration with the Biodiversity Heritage Library (BHL), GGBN is exploring options for tagging publications that reference GGBN collections and associated specimens, made searchable through GGBN’s document library. Through its collaborative efforts, standards, and best practices GGBN aims at facilitating trust and transparency in the use of genetic resources.


Author(s):  
Matt Woodburn ◽  
Deborah L Paul ◽  
Wouter Addink ◽  
Steven J Baskauf ◽  
Stanley Blum ◽  
...  

Digitisation and publication of museum specimen data is happening worldwide, but far from complete. Museums can start by sharing what they know about their holdings at a higher level, long before each object has its own record. Information about what is held in collections worldwide is needed by many stakeholders including collections managers, funders, researchers, policy-makers, industry, and educators. To aggregate this information from collections, the data need to be standardised (Johnston and Robinson 2002). So, the Biodiversity Information Standards (TDWG) Collection Descriptions (CD) Task Group is developing a data standard for describing collections, which gives the ability to provide: automated metrics, using standardised collection descriptions and/or data derived from specimen datasets (e.g., counts of specimens) and a global registry of physical collections (i.e., digitised or non-digitised). automated metrics, using standardised collection descriptions and/or data derived from specimen datasets (e.g., counts of specimens) and a global registry of physical collections (i.e., digitised or non-digitised). Outputs will include a data model to underpin the new standard, and guidance and reference implementations for the practical use of the standard in institutional and collaborative data infrastructures. The Task Group employs a community-driven approach to standard development. With international participation, workshops at the Natural History Museum (London 2019) and the MOBILISE workshop (Warsaw 2020) allowed over 50 people to contribute this work. Our group organized online "barbecues" (BBQs) so that many more could contribute to standard definitions and address data model design challenges. Cloud-based tools (e.g., GitHub, Google Sheets) are used to organise and publish the group's work and make it easy to participate. A Wikibase instance is also used to test and demonstrate the model using real data. There are a range of global, regional, and national initiatives interested in the standard (see Task Group charter). Some, like GRSciColl (now at the Global Biodiversity Information Facility (GBIF)), Index Herbariorum (IH), and the iDigBio US Collections List are existing catalogues. Others, including the Consortium of European Taxonomic Facilities (CETAF) and the Distributed System of Scientific Collections (DiSSCo), include collection descriptions as a key part of their near-term development plans. As part of the EU-funded SYNTHESYS+ project, GBIF organized a virtual workshop: Advancing the Catalogue of the World's Natural History Collections to get international input for such a resource that would use this CD standard. Some major complexities present themselves in designing a standardised approach to represent collection descriptions data. It is not the first time that the natural science collections community has tried to address them (see the TDWG Natural Collections Description standard). Beyond natural sciences, the library community in particular gave thought to this (Heaney 2001, Johnston and Robinson 2002), noting significant difficulties. One hurdle is that collections may be broken down into different degrees of granularity according to different criteria, and may also overlap so that a single object can be represented in more than one collection description. Managing statistics such as numbers of objects is complex due to data gaps and variable degrees of certainty about collection contents. It also takes considerable effort from collections staff to generate structured data about their undigitised holdings. We need to support simple, high-level collection summaries as well as detailed quantitative data, and to be able to update as needed. We need a simple approach, but one that can also handle the complexities of data, scope, and social needs, for digitised and undigitised collections. The data standard itself is a defined set of classes and properties that can be used to represent groups of collection objects and their associated information. These incorporate common characteristics ('dimensions') by which we want to describe, group and break down our collections, metrics for quantifying those collections, and properties such as persistent identifiers for tracking collections and managing their digital counterparts. Existing terms from other standards (e.g. Darwin Core, ABCD) are re-used if possible. The data model (Fig. 1) underpinning the standard defines the relationships between those different classes, and ensures that the structure as well as the content are comparable across different datasets. It centres around the core concept of an 'object group', representing a set of physical objects that is defined by one or more dimensions (e.g., taxonomy and geographic origin), and linked to other entities such as the holding institution. To the object group, quantitative data about its contents are attached (e.g. counts of objects or taxa), along with more qualitative information describing the contents of the group as a whole. In this presentation, we will describe the draft standard and data model with examples of early adoption for real-world and example data. We will also discuss the vision of how the new standard may be adopted and its potential impact on collection discoverability across the collections community.


2019 ◽  
Vol 214 ◽  
pp. 07030
Author(s):  
Marco Aldinucci ◽  
Stefano Bagnasco ◽  
Matteo Concas ◽  
Stefano Lusso ◽  
Sergio Rabellino ◽  
...  

Obtaining CPU cycles on an HPC cluster is nowadays relatively simple and sometimes even cheap for academic institutions. However, in most of the cases providers of HPC services would not allow changes on the configuration, implementation of special features or a lower-level control on the computing infrastructure, for example for testing experimental configurations. The variety of use cases proposed by several departments of the University of Torino, including ones from solid-state chemistry, computational biology, genomics and many others, called for different and sometimes conflicting configurations; furthermore, several R&D activities in the field of scientific computing, with topics ranging from GPU acceleration to Cloud Computing technologies, needed a platform to be carried out on. The Open Computing Cluster for Advanced data Manipulation (OCCAM) is a multi-purpose flexible HPC cluster designed and operated by a collaboration between the University of Torino and the Torino branch of the Istituto Nazionale di Fisica Nucleare. It is aimed at providing a flexible and reconfigurable infrastructure to cater to a wide range of different scientific computing needs, as well as a platform for R&D activities on computational technologies themselves. We describe some of the use cases that prompted the design and construction of the system, its architecture and a first characterisation of its performance by some synthetic benchmark tools and a few realistic use-case tests.


Author(s):  
Arthur Chapman ◽  
Lee Belbin ◽  
Paula Zermoglio ◽  
John Wieczorek ◽  
Paul Morris ◽  
...  

The quality of biodiversity data publicly accessible via aggregators such as GBIF (Global Biodiversity Information Facility), the ALA (Atlas of Living Australia), iDigBio (Integrated Digitized Biocollections), and OBIS (Ocean Biogeographic Information System) is often questioned, especially by the research community. The Data Quality Interest Group, established by Biodiversity Information Standards (TDWG) and GBIF, has been engaged in four main activities: developing a framework for the assessment and management of data quality using a fitness for use approach; defining a core set of standardised tests and associated assertions based on Darwin Core terms; gathering and classifying user stories to form contextual-themed use cases, such as species distribution modelling, agrobiodiversity, and invasive species; and developing a standardised format for building and managing controlled vocabularies of values. Using the developed framework, data quality profiles have been built from use cases to represent user needs. Quality assertions can then be used to filter data suitable for a purpose. The assertions can also be used to provide feedback to data providers and custodians to assist in improving data quality at the source. A case study, using two different implementations of tests and assertions based around the Darwin Core "Event Date" terms, were also tested against GBIF data, to demonstrate that the tests are implementation agnostic, can be run on large aggregated datasets, and can make biodiversity data more fit for typical research uses.


Author(s):  
Mareike Petersen ◽  
Sabine von Mering ◽  
Julia Pim Reis ◽  
Falko Glöckler

In the last two decades, various projects and initiatives have conducted research on how to share, exchange, and link information from natural science collection objects. This profound (technical) knowledge, standards, tools, and best practices are essential to the development of any new research infrastructure facilitating research on bio- and geodiversity. However, the knowledge and research results are usually not easily accessible at a single point and particularly not in a well-curated form. Here, the Knowledgebase developed for the Distributed System of Scientific Collections (DiSSCo) comes into play. This information hub will act as trusted source for project outcomes and other relevant resources (e.g., web services, Persistent Identifier Systems, controlled vocabularies, domain-specific ontologies and standards) for users and developers of DiSSCo and other research infrastructures worldwide. In this talk, we will present the current version of the DiSSCo Knowledgebase, its developmental approach, and the opportunity for this source to act as an e-service for various stakeholder groups interested in and working with natural science collections worldwide.


Electronics ◽  
2021 ◽  
Vol 10 (5) ◽  
pp. 592
Author(s):  
Radek Silhavy ◽  
Petr Silhavy ◽  
Zdenka Prokopova

Software size estimation represents a complex task, which is based on data analysis or on an algorithmic estimation approach. Software size estimation is a nontrivial task, which is important for software project planning and management. In this paper, a new method called Actors and Use Cases Size Estimation is proposed. The new method is based on the number of actors and use cases only. The method is based on stepwise regression and led to a very significant reduction in errors when estimating the size of software systems compared to Use Case Points-based methods. The proposed method is independent of Use Case Points, which allows the elimination of the effect of the inaccurate determination of Use Case Points components, because such components are not used in the proposed method.


Sensors ◽  
2021 ◽  
Vol 21 (11) ◽  
pp. 3871
Author(s):  
Jiri Pokorny ◽  
Khanh Ma ◽  
Salwa Saafi ◽  
Jakub Frolka ◽  
Jose Villa ◽  
...  

Automated systems have been seamlessly integrated into several industries as part of their industrial automation processes. Employing automated systems, such as autonomous vehicles, allows industries to increase productivity, benefit from a wide range of technologies and capabilities, and improve workplace safety. So far, most of the existing systems consider utilizing one type of autonomous vehicle. In this work, we propose a collaboration of different types of unmanned vehicles in maritime offshore scenarios. Providing high capacity, extended coverage, and better quality of services, autonomous collaborative systems can enable emerging maritime use cases, such as remote monitoring and navigation assistance. Motivated by these potential benefits, we propose the deployment of an Unmanned Surface Vehicle (USV) and an Unmanned Aerial Vehicle (UAV) in an autonomous collaborative communication system. Specifically, we design high-speed, directional communication links between a terrestrial control station and the two unmanned vehicles. Using measurement and simulation results, we evaluate the performance of the designed links in different communication scenarios and we show the benefits of employing multiple autonomous vehicles in the proposed communication system.


Religions ◽  
2019 ◽  
Vol 10 (6) ◽  
pp. 389
Author(s):  
James Robert Brown

Religious notions have long played a role in epistemology. Theological thought experiments, in particular, have been effective in a wide range of situations in the sciences. Some of these are merely picturesque, others have been heuristically important, and still others, as I will argue, have played a role that could be called essential. I will illustrate the difference between heuristic and essential with two examples. One of these stems from the Newton–Leibniz debate over the nature of space and time; the other is a thought experiment of my own constructed with the aim of making a case for a more liberal view of evidence in mathematics.


Sign in / Sign up

Export Citation Format

Share Document