scholarly journals Exposing the Dark Data of Undigitized Collections: A TDWG global standard for collection descriptions

Author(s):  
Matt Woodburn ◽  
Deborah L Paul ◽  
William Ulate ◽  
Niels Raes

Aggregating content of museum and scientific collections worldwide offers us the opportunity to realize a virtual museum of our planet and the life upon it through space and time. By mapping specimen-level data records to standards and publishing this information, an increasing number of collections contribute to a digitally accessible wealth of knowledge. Visualizing these digital records by parameters such as collection type and geographic origin, helps collections and institutions to better understand their digital holdings and compare them to other such collections, as well as enabling researchers to find specimens and specimen data quickly (Singer et al. 2018). At the higher level of collections, related people and their activities, and especially the great majority of material that is yet to be digitised, we know much less. Many collections hold material not yet digitally discoverable in any form. For those that do publish collection-level data, it is commonly text-based data without the Global Unique Identifiers (GUIDs) or the controlled vocabularies that would support quantitative collection metrics and aid discovery of related expertise and publications. To best understand and plan for our world’s bio- and geodiversity represented in collections, we need standardised, quantitative collections-level metadata. Various groups planet-wide are actively developing tools to capture this much-needed metadata, including information about the backlog, and more detailed information about institutions and their activities (e.g. staffing, space, species-level inventories, geographic and taxonomic expertise, and related publications) (Smith et al. 2018). The Biodiversity Information Standards organization (TDWG) Collection Descriptions (CD) Data Standard Task Group aims to provide a data standard for describing natural scientific collections, which enables the ability to provide: automated metrics, using standardised collection descriptions and/or data derived from specimen datasets (e.g., counts of specimens) and a global registry of physical collections (either digitised or non-digitised). automated metrics, using standardised collection descriptions and/or data derived from specimen datasets (e.g., counts of specimens) and a global registry of physical collections (either digitised or non-digitised). The group will also produce a data model to underpin the new standard, and provide guidance and reference implementations for the practical use of the standard in institutional and collaborative data infrastructures. Our task group includes members from a myriad of groups with a stake in mobilizing such data at local, regional, domain-specific and global levels. With such a standard adopted, it will be possible to effectively share data across different community resources. So far, we have carried out landscape analyses of existing collection description frameworks, and amassed a portfolio of use cases from the group as well as from a range of other sources, including the Collection Descriptions Dashboard working group of ICEDIG ("Innovation and consolidation for large scale digitisation of natural heritage"), iDigBio (Integrated Digitized Biocollections), Smithsonian, Index Herbariorum, the Field Museum, GBIF (Global Biodiversity Information Facility), GRBio (Global Registry of Biodiversity Repositories) and fishfindR.net. These were used to develop a draft data model, and between them inform the first iteration of CD draft data standard. A variety of challenges present themselves in developing this standard. Some relate to the standard development process itself, such as identifying (often learning) effective tools and methods for collaborative working and communication across globally distributed volunteers. Others concern the scope and gaining consensus from stakeholders, across a wide range of disciplines, while maintaining achievable goals. Further challenges arise from the requirement to develop a data model and standard that support such a variety of use cases and priorities, while retaining interoperability and manageability of the data. We will present some of these challenges and methods for addressing them, and summarise the progress and draft outputs of the group so far. We will also discuss the vision of how the new standard may be adopted and its potential impact on collections discoverability across the natural science collections community.

Author(s):  
Matt Woodburn ◽  
Gabriele Droege ◽  
Sharon Grant ◽  
Quentin Groom ◽  
Janeen Jones ◽  
...  

The utopian vision is of a future where a digital representation of each object in our collections is accessible through the internet and sustainably linked to other digital resources. This is a long term goal however, and in the meantime there is an urgent need to share data about our collections at a higher level with a range of stakeholders (Woodburn et al. 2020). To sustainably achieve this, and to aggregate this information across all natural science collections, the data need to be standardised (Johnston and Robinson 2002). To this end, the Biodiversity Information Standards (TDWG) Collection Descriptions (CD) Interest Group has developed a data standard for describing collections, which is approaching formal review for ratification as a new TDWG standard. It proposes 20 classes (Suppl. material 1) and over 100 properties that can be used to describe, categorise, quantify, link and track digital representations of natural science collections, from high-level approximations to detailed breakdowns depending on the purpose of a particular implementation. The wide range of use cases identified for representing collection description data means that a flexible approach to the standard and the underlying modelling concepts is essential. These are centered around the ‘ObjectGroup’ (Fig. 1), a class that may represent any group (of any size) of physical collection objects, which have one or more common characteristics. This generic definition of the ‘collection’ in ‘collection descriptions’ is an important factor in making the standard flexible enough to support the breadth of use cases. For any use case or implementation, only a subset of classes and properties within the standard are likely to be relevant. In some cases, this subset may have little overlap with those selected for other use cases. This additional need for flexibility means that very few classes and properties, representing the core concepts, are proposed to be mandatory. Metrics, facts and narratives are represented in a normalised structure using an extended MeasurementOrFact class, so that these can be user-defined rather than constrained to a set identified by the standard. Finally, rather than a rigid underlying data model as part of the normative standard, documentation will be developed to provide guidance on how the classes in the standard may be related and quantified according to relational, dimensional and graph-like models. So, in summary, the standard has, by design, been made flexible enough to be used in a number of different ways. The corresponding risk is that it could be used in ways that may not deliver what is needed in terms of outputs, manageability and interoperability with other resources of collection-level or object-level data. To mitigate this, it is key for any new implementer of the standard to establish how it should be used in that particular instance, and define any necessary constraints within the wider scope of the standard and model. This is the concept of the ‘collection description scheme,’ a profile that defines elements such as: which classes and properties should be included, which should be mandatory, and which should be repeatable; which controlled vocabularies and hierarchies should be used to make the data interoperable; how the collections should be broken down into individual ObjectGroups and interlinked, and how the various classes should be related to each other. which classes and properties should be included, which should be mandatory, and which should be repeatable; which controlled vocabularies and hierarchies should be used to make the data interoperable; how the collections should be broken down into individual ObjectGroups and interlinked, and how the various classes should be related to each other. Various factors might influence these decisions, including the types of information that are relevant to the use case, whether quantitative metrics need to be captured and aggregated across collection descriptions, and how many resources can be dedicated to amassing and maintaining the data. This process has particular relevance to the Distributed System of Scientific Collections (DiSSCo) consortium, the design of which incorporates use cases for storing, interlinking and reporting on the collections of its member institutions. These include helping users of the European Loans and Visits System (ELViS) (Islam 2020) to discover specimens for physical and digital loans by providing descriptions and breakdowns of the collections of holding institutions, and monitoring digitisation progress across European collections through a dynamic Collections Digitisation Dashboard. In addition, DiSSCo will be part of a global collections data ecosystem requiring interoperation with other infrastructures such as the GBIF (Global Biodiversity Information Facility) Registry of Scientific Collections, the CETAF (Consortium of European Taxonomic Facilities) Registry of Collections and Index Herbariorum. In this presentation, we will introduce the draft standard and discuss the process of defining new collection description schemes using the standard and data model, and focus on DiSSCo requirements as examples of real-world collection descriptions use cases.


Author(s):  
Matt Woodburn ◽  
Deborah L Paul ◽  
Wouter Addink ◽  
Steven J Baskauf ◽  
Stanley Blum ◽  
...  

Digitisation and publication of museum specimen data is happening worldwide, but far from complete. Museums can start by sharing what they know about their holdings at a higher level, long before each object has its own record. Information about what is held in collections worldwide is needed by many stakeholders including collections managers, funders, researchers, policy-makers, industry, and educators. To aggregate this information from collections, the data need to be standardised (Johnston and Robinson 2002). So, the Biodiversity Information Standards (TDWG) Collection Descriptions (CD) Task Group is developing a data standard for describing collections, which gives the ability to provide: automated metrics, using standardised collection descriptions and/or data derived from specimen datasets (e.g., counts of specimens) and a global registry of physical collections (i.e., digitised or non-digitised). automated metrics, using standardised collection descriptions and/or data derived from specimen datasets (e.g., counts of specimens) and a global registry of physical collections (i.e., digitised or non-digitised). Outputs will include a data model to underpin the new standard, and guidance and reference implementations for the practical use of the standard in institutional and collaborative data infrastructures. The Task Group employs a community-driven approach to standard development. With international participation, workshops at the Natural History Museum (London 2019) and the MOBILISE workshop (Warsaw 2020) allowed over 50 people to contribute this work. Our group organized online "barbecues" (BBQs) so that many more could contribute to standard definitions and address data model design challenges. Cloud-based tools (e.g., GitHub, Google Sheets) are used to organise and publish the group's work and make it easy to participate. A Wikibase instance is also used to test and demonstrate the model using real data. There are a range of global, regional, and national initiatives interested in the standard (see Task Group charter). Some, like GRSciColl (now at the Global Biodiversity Information Facility (GBIF)), Index Herbariorum (IH), and the iDigBio US Collections List are existing catalogues. Others, including the Consortium of European Taxonomic Facilities (CETAF) and the Distributed System of Scientific Collections (DiSSCo), include collection descriptions as a key part of their near-term development plans. As part of the EU-funded SYNTHESYS+ project, GBIF organized a virtual workshop: Advancing the Catalogue of the World's Natural History Collections to get international input for such a resource that would use this CD standard. Some major complexities present themselves in designing a standardised approach to represent collection descriptions data. It is not the first time that the natural science collections community has tried to address them (see the TDWG Natural Collections Description standard). Beyond natural sciences, the library community in particular gave thought to this (Heaney 2001, Johnston and Robinson 2002), noting significant difficulties. One hurdle is that collections may be broken down into different degrees of granularity according to different criteria, and may also overlap so that a single object can be represented in more than one collection description. Managing statistics such as numbers of objects is complex due to data gaps and variable degrees of certainty about collection contents. It also takes considerable effort from collections staff to generate structured data about their undigitised holdings. We need to support simple, high-level collection summaries as well as detailed quantitative data, and to be able to update as needed. We need a simple approach, but one that can also handle the complexities of data, scope, and social needs, for digitised and undigitised collections. The data standard itself is a defined set of classes and properties that can be used to represent groups of collection objects and their associated information. These incorporate common characteristics ('dimensions') by which we want to describe, group and break down our collections, metrics for quantifying those collections, and properties such as persistent identifiers for tracking collections and managing their digital counterparts. Existing terms from other standards (e.g. Darwin Core, ABCD) are re-used if possible. The data model (Fig. 1) underpinning the standard defines the relationships between those different classes, and ensures that the structure as well as the content are comparable across different datasets. It centres around the core concept of an 'object group', representing a set of physical objects that is defined by one or more dimensions (e.g., taxonomy and geographic origin), and linked to other entities such as the holding institution. To the object group, quantitative data about its contents are attached (e.g. counts of objects or taxa), along with more qualitative information describing the contents of the group as a whole. In this presentation, we will describe the draft standard and data model with examples of early adoption for real-world and example data. We will also discuss the vision of how the new standard may be adopted and its potential impact on collection discoverability across the collections community.


Author(s):  
David Fichtmüller ◽  
Fabian Reimeier ◽  
Anton Güntsch

In the ABCD 3.0 Project the ABCD (Access to Biological Collection Data) Standard (Access to Biological Collections Data task group 2007) was transformed from a classic XML Schema into an OWL (Web Ontology Language) ontology (along side an updated semantic-aware XML version). While it was initially planned to use the established TDWG Terms wiki as the editing and development platform for the ABCD ontology, the rise of Wikidata and its underlying platform Wikibase have caused us to reconsider this decision and switch to a Wikibase installation instead. This proved to be a crucial decision, as Wikibase turned out to be a well-suited platform to collaboratively import, develop and export this complex semantic standard. This experience is potentially of interest to maintainers of other Biodiversity Information Standards (TDWG) standards and the Technical Architecture Group. In this presentation we will explain our technical setup and how we used Wikibase, alongside its related tools, to model the ABCD Ontology. We will introduce the tools we used for importing existing concepts from the previous ABCD versions, running maintenance queries (e.g. for checking the ontology for consistency or missing information about concepts), and exporting the ontology into the OWL/XML format. Finally we will discuss the lessons we learned and how our setup can be improved for future uses.


Author(s):  
Matt Woodburn ◽  
Sarah Vincent ◽  
Helen Hardy ◽  
Clare Valentine

The natural science collections community has identified an increasing need for shared, structured and interoperable data standards that can be used to describe the totality of institutional collection holdings, whether digitised or not. Major international initiatives - including the Global Biodiversity Information Facility (GBIF), the Distributed System of Scientific Collections (DiSSCo) and the Consortium of European Taxonomic Facilities (CETAF) - consider the current lack of standards to be a major barrier, which must be overcome to further their strategic aims and contribute to an open, discoverable catalogue of global collections. The Biodiversity Information Standards (TDWG) Collection Descriptions (CD) group is looking to address this issue with a new data standard for collection descriptions. At an institutional level, this concept of collection descriptions aligns strongly with the need to use a structured and more data-driven approach to assessing and working with collections, both to identify and prioritise investment and effort, and to monitor the impact of the work. Use cases include planning conservation and collection moves, prioritising specimen digitisation activities, and informing collection development strategy. The data can be integrated with the collection description framework for ongoing assessments of the state of the collection. This approach was pioneered with the ‘Move the Dots’ methodology by the Smithsonian National Museum of Natural History, started in 2009 and run annually since. The collection is broken down into several hundred discrete subcollections, for each of which the number of objects was estimated and a numeric rank allocated according to a range of assessment criteria. This method has since been adopted by several other institutions, including Naturalis Biodiversity Centre, Museum für Naturkunde and Natural History Museum, London (NHM). First piloted in 2016, and now implemented as a core framework, the NHM’s adaptation, ‘Join the Dots’, divides the collection into approximately 2,600 ‘collection units’. The breakdown uses formal controlled lists and hierarchies, primarily taxonomy, type of object, storage location and (where relevant) stratigraphy, which are mapped to external authorities such as the Catalogue of Life and Paleobiology Database. The collection breakdown is enhanced with estimations of number of items, and ranks from 1 to 5 for each collection unit against 17 different criteria. These are grouped into four categories of ‘Condition’, ‘Information’ (including digital records), ‘Importance and Significance’ and ‘Outreach’. Although requiring significant time investment from collections staff to provide the estimates and assessments, this methodology has yielded a rich dataset that supports both discoverability (collection descriptions) and management (collection assessment). Links to further datasets about the building infrastructure and environmental conditions also make it into a powerful resource for planning activities such as collections moves, pest monitoring and building work. We have developed dynamic dashboards to provide rich visualisations for exploring, analysing and communicating the data. As an ongoing, embedded activity for collections staff, there will also be a build-up of historical data going forward, enabling us to see trends, track changes to the collection, and measure the impact of projects and events. The concept of Join the Dots also offers a generic, institution-agnostic model for enhancing the collection description framework with additional metrics that add value for strategic management and resourcing of the collection. In the design and implementation, we’ve faced challenges that should be highly relevant to the TDWG CD group, such as managing the dynamic breakdown of collections across multiple dimensions. We also face some that are yet to be resolved, such as a robust model for managing the evolving dataset over time. We intend to contribute these use cases into the development of the new TDWG data standard and be an early adopter and reference case. We envisage that this could constitute a common model that, where resources are available, provides the ability to add greater depth and utility to the world catalogue of collections.


Author(s):  
David Fichtmueller ◽  
Anton Güntsch ◽  
Stanley Blum

ABCD (Access to Biological Collection Data, Holetschek et al. 2012) and DwC (Darwin Core, Wieczorek et al. 2012), are TDWG standards for documenting the occurrence of organisms in nature and/or collections, whether as specimens or observations (i.e., unit-level data), and are used for a wide range of applications. Since 2019, the working group has been investigating ways to enable a closer link and integration of these standards (Blum et al. 2019). This presentation will summarize the results of the September 2020 workshop of the ABCD/DwC Alignment Working Group as part of the TDWG 2020 Virtual Conference Working Sessions Week. Prior to the workshop, we will have collected use cases for the application of the standards, which will have been analysed and discussed in the workshop itself. On this basis, smaller working groups will have been formed to address technical, organisational, and sociological aspects relevant for an alignment and future maintenance of ABCD and DwC. The results of these working groups, as well as the general summary of the workshop and the planned next steps of the ABCD/DwC Alignment Working Group will be shown and discussed in this presentation.


Author(s):  
Eun-Young Mun ◽  
Anne E. Ray

Integrative data analysis (IDA) is a promising new approach in psychological research and has been well received in the field of alcohol research. This chapter provides a larger unifying research synthesis framework for IDA. Major advantages of IDA of individual participant-level data include better and more flexible ways to examine subgroups, model complex relationships, deal with methodological and clinical heterogeneity, and examine infrequently occurring behaviors. However, between-study heterogeneity in measures, designs, and samples and systematic study-level missing data are significant barriers to IDA and, more broadly, to large-scale research synthesis. Based on the authors’ experience working on the Project INTEGRATE data set, which combined individual participant-level data from 24 independent college brief alcohol intervention studies, it is also recognized that IDA investigations require a wide range of expertise and considerable resources and that some minimum standards for reporting IDA studies may be needed to improve transparency and quality of evidence.


2021 ◽  
Author(s):  
José-Vicente Tomás-Miquel ◽  
Jordi Capó-Vicedo

AbstractScholars have widely recognised the importance of academic relationships between students at the university. While much of the past research has focused on studying their influence on different aspects such as the students’ academic performance or their emotional stability, less is known about their dynamics and the factors that influence the formation and dissolution of linkages between university students in academic networks. In this paper, we try to shed light on this issue by exploring through stochastic actor-oriented models and student-level data the influence that a set of proximity factors may have on formation of these relationships over the entire period in which students are enrolled at the university. Our findings confirm that the establishment of academic relationships is derived, in part, from a wide range of proximity dimensions of a social, personal, geographical, cultural and academic nature. Furthermore, and unlike previous studies, this research also empirically confirms that the specific stage in which the student is at the university determines the influence of these proximity factors on the dynamics of academic relationships. In this regard, beyond cultural and geographic proximities that only influence the first years at the university, students shape their relationships as they progress in their studies from similarities in more strategic aspects such as academic and personal closeness. These results may have significant implications for both academic research and university policies.


Sensors ◽  
2021 ◽  
Vol 21 (11) ◽  
pp. 3871
Author(s):  
Jiri Pokorny ◽  
Khanh Ma ◽  
Salwa Saafi ◽  
Jakub Frolka ◽  
Jose Villa ◽  
...  

Automated systems have been seamlessly integrated into several industries as part of their industrial automation processes. Employing automated systems, such as autonomous vehicles, allows industries to increase productivity, benefit from a wide range of technologies and capabilities, and improve workplace safety. So far, most of the existing systems consider utilizing one type of autonomous vehicle. In this work, we propose a collaboration of different types of unmanned vehicles in maritime offshore scenarios. Providing high capacity, extended coverage, and better quality of services, autonomous collaborative systems can enable emerging maritime use cases, such as remote monitoring and navigation assistance. Motivated by these potential benefits, we propose the deployment of an Unmanned Surface Vehicle (USV) and an Unmanned Aerial Vehicle (UAV) in an autonomous collaborative communication system. Specifically, we design high-speed, directional communication links between a terrestrial control station and the two unmanned vehicles. Using measurement and simulation results, we evaluate the performance of the designed links in different communication scenarios and we show the benefits of employing multiple autonomous vehicles in the proposed communication system.


2013 ◽  
Vol 103 (5) ◽  
pp. 479-487 ◽  
Author(s):  
Efrén Remesal ◽  
Blanca B. Landa ◽  
María del Mar Jiménez-Gasco ◽  
Juan A. Navas-Cortés

Populations of Sclerotium rolfsii, the causal organism of Sclerotium root-rot on a wide range of hosts, can be placed into mycelial compatibility groups (MCGs). In this study, we evaluated three different molecular approaches to unequivocally identify each of 12 previously identified MCGs. These included restriction fragment length polymorphism (RFLP) patterns of the internal transcribed spacer (ITS) region of nuclear ribosomal DNA (rDNA) and sequence analysis of two protein-coding genes: translation elongation factor 1α (EF1α) and RNA polymerase II subunit two (RPB2). A collection of 238 single-sclerotial isolates representing 12 MCGs of S. rolfsii were obtained from diseased sugar beet plants from Chile, Italy, Portugal, and Spain. ITS-RFLP analysis using four restriction enzymes (AluI, HpaII, RsaI, and MboI) displayed a low degree of variability among MCGs. Only three different restriction profiles were identified among S. rolfsii isolates, with no correlation to MCG or to geographic origin. Based on nucleotide polymorphisms, the RPB2 gene was more variable among MCGs compared with the EF1α gene. Thus, 10 of 12 MCGs could be characterized utilizing the RPB2 region only, while the EF1α region resolved 7 MCGs. However, the analysis of combined partial sequences of EF1α and RPB2 genes allowed discrimination among each of the 12 MCGs. All isolates belonging to the same MCG showed identical nucleotide sequences that differed by at least in one nucleotide from a different MCG. The consistency of our results to identify the MCG of a given S. rolfsii isolate using the combined sequences of EF1α and RPB2 genes was confirmed using blind trials. Our study demonstrates that sequence variation in the protein-coding genes EF1α and RPB2 may be exploited as a diagnostic tool for MCG typing in S. rolfsii as well as to identify previously undescribed MCGs.


Author(s):  
Katharine Barker ◽  
Jonas Astrin ◽  
Gabriele Droege ◽  
Jonathan Coddington ◽  
Ole Seberg

Most successful research programs depend on easily accessible and standardized research infrastructures. Until recently, access to tissue or DNA samples with standardized metadata and of a sufficiently high quality, has been a major bottleneck for genomic research. The Global Geonome Biodiversity Network (GGBN) fills this critical gap by offering standardized, legal access to samples. Presently, GGBN’s core activity is enabling access to searchable DNA and tissue collections across natural history museums and botanic gardens. Activities are gradually being expanded to encompass all kinds of biodiversity biobanks such as culture collections, zoological gardens, aquaria, arboreta, and environmental biobanks. Broadly speaking, these collections all provide long-term storage and standardized public access to samples useful for molecular research. GGBN facilitates sample search and discovery for its distributed member collections through a single entry point. It stores standardized information on mostly geo-referenced, vouchered samples, their physical location, availability, quality, and the necessary legal information on over 50,000 species of Earth’s biodiversity, from unicellular to multicellular organisms. The GGBN Data Portal and the GGBN Data Standard are complementary to existing infrastructures such as the Global Biodiversity Information Facility (GBIF) and International Nucleotide Sequence Database (INSDC). Today, many well-known open-source collection management databases such as Arctos, Specify, and Symbiota, are implementing the GGBN data standard. GGBN continues to increase its collections strategically, based on the needs of the research community, adding over 1.3 million online records in 2018 alone, and today two million sample data are available through GGBN. Together with Consortium of European Taxonomic Facilities (CETAF), Society for the Preservation of Natural History Collections (SPNHC), Biodiversity Information Standards (TDWG), and Synthesis of Systematic Resources (SYNTHESYS+), GGBN provides best practices for biorepositories on meeting the requirements of the Nagoya Protocol on Access and Benefit Sharing (ABS). By collaboration with the Biodiversity Heritage Library (BHL), GGBN is exploring options for tagging publications that reference GGBN collections and associated specimens, made searchable through GGBN’s document library. Through its collaborative efforts, standards, and best practices GGBN aims at facilitating trust and transparency in the use of genetic resources.


Sign in / Sign up

Export Citation Format

Share Document