20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration

Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked using taxonomic names. This labor intensive, error prone, and lengthy process relies on accessible versions of nomenclatural authorities and fuzzy-matching algorithms. To approach the challenge of linking diverse data, more than technology is needed. New social collaborations like the Global Unified Open Data Architecture (GUODA) that combines skills from diverse groups of computer engineers from iDigBio, server resources from the Advanced Computing and Information Systems (ACIS) Lab, global-scale data presentation from EOL, and independent developers and researchers are what is needed to make concrete progress on finding relationships between biodiversity datasets. This paper will discuss a technical solution developed by the GUODA collaboration for faster linking across databases with a use case linking Wikidata and the Global Biotic Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20 GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10–11 min. Instead of comparing name strings or relying on a single identifier, Wikidata and GloBI were linked by comparing graphs of biodiversity identifiers external to each system. This method resulted in adding 119,957 Wikidata links in GloBI, an increase of 13.7% of all outgoing name links in GloBI. Wikidata and GloBI were compared to Open Tree of Life Reference Taxonomy to examine consistency and coverage. The process of parsing Wikidata, Open Tree of Life Reference Taxonomy and GloBI archives and calculating consistency metrics was done in minutes on the GUODA platform. As a model collaboration, GUODA has the potential to revolutionize biodiversity science by bringing diverse technically minded people together with high performance computing resources that are accessible from a laptop or desktop. However, participating in such a collaboration still requires basic programming skills.

Download Full-text

20 GB in 10 minutes: A case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration

10.7287/peerj.preprints.26951v1 ◽

2018 ◽

Author(s):

Anne E Thessen ◽

Jorrit H Poelen ◽

Matthew Collins ◽

Jen Hammock

Keyword(s):

High Performance Computing ◽

High Performance ◽

Open Data ◽

Global Scale ◽

Technical Solution ◽

Data Types ◽

Combining Data ◽

High Performance Computing Cluster ◽

Diverse Groups ◽

Performance Computing

Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked using taxonomic names. This labor intensive, error prone, and lengthy process relies on accessible versions of nomenclatural authorities and fuzzy-matching algorithms. To approach the challenge of linking diverse data, more than technology is needed. New social collaborations like the Global Unified Open Data Architecture (GUODA) that combine skills from diverse groups of computer engineers from iDigBio, server resources from the Advanced Computing and Information Systems (ACIS) Lab, global-scale data presentation from EOL, and independent developers and researchers are what is needed to make concrete progress on finding relationships between biodiversity datasets. This paper will discuss a technical solution developed by the GUODA collaboration for faster linking across databases with a use case linking Wikidata and the Global Biodiversity Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10-11 minutes. Instead of comparing name strings or relying on a single identifier, Wikidata and GloBI were linked by comparing graphs of biodiversity identifiers external to each system. This method resulted in adding 119,957 Wikidata links in GloBI, an increase of 13.7% of all outgoing name links in GloBI. Wikidata and GloBI were compared to Open Tree Taxonomy to examine consistency and coverage. The process of parsing Wikidata, Open Tree Taxonomy and GloBI archives and calculating consistency metrics was done in minutes on the GUODA platform. As a model collaboration, GUODA has the potential to revolutionize biodiversity science by bringing diverse technically minded people together with high performance computing resources that are accessible from a laptop or desktop. However, participating in such a collaboration still requires basic programming skills.

Download Full-text

20 GB in 10 minutes: A case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration

10.7287/peerj.preprints.26951 ◽

2018 ◽

Author(s):

Anne E Thessen ◽

Jorrit H Poelen ◽

Matthew Collins ◽

Jen Hammock

Keyword(s):

High Performance Computing ◽

High Performance ◽

Open Data ◽

Global Scale ◽

Technical Solution ◽

Data Types ◽

Combining Data ◽

High Performance Computing Cluster ◽

Diverse Groups ◽

Performance Computing

Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked using taxonomic names. This labor intensive, error prone, and lengthy process relies on accessible versions of nomenclatural authorities and fuzzy-matching algorithms. To approach the challenge of linking diverse data, more than technology is needed. New social collaborations like the Global Unified Open Data Architecture (GUODA) that combine skills from diverse groups of computer engineers from iDigBio, server resources from the Advanced Computing and Information Systems (ACIS) Lab, global-scale data presentation from EOL, and independent developers and researchers are what is needed to make concrete progress on finding relationships between biodiversity datasets. This paper will discuss a technical solution developed by the GUODA collaboration for faster linking across databases with a use case linking Wikidata and the Global Biodiversity Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10-11 minutes. Instead of comparing name strings or relying on a single identifier, Wikidata and GloBI were linked by comparing graphs of biodiversity identifiers external to each system. This method resulted in adding 119,957 Wikidata links in GloBI, an increase of 13.7% of all outgoing name links in GloBI. Wikidata and GloBI were compared to Open Tree Taxonomy to examine consistency and coverage. The process of parsing Wikidata, Open Tree Taxonomy and GloBI archives and calculating consistency metrics was done in minutes on the GUODA platform. As a model collaboration, GUODA has the potential to revolutionize biodiversity science by bringing diverse technically minded people together with high performance computing resources that are accessible from a laptop or desktop. However, participating in such a collaboration still requires basic programming skills.

Download Full-text

Leveraging High Performance Computing for Managing Large and Evolving Data Collections

International Journal of Digital Curation ◽

10.2218/ijdc.v9i2.331 ◽

2014 ◽

Vol 9 (2) ◽

pp. 17-27 ◽

Cited By ~ 6

Author(s):

Ritu Arora ◽

Maria Esteva ◽

Jessica Trelogan

Keyword(s):

Data Management ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Research Process ◽

Open Science ◽

Test Case ◽

Growth Data ◽

Data Types ◽

Performance Computing

The process of developing a digital collection in the context of a research project often involves a pipeline pattern during which data growth, data types, and data authenticity need to be assessed iteratively in relation to the different research steps and in the interest of archiving. Throughout a project’s lifecycle curators organize newly generated data while cleaning and integrating legacy data when it exists, and deciding what data will be preserved for the long term. Although these actions should be part of a well-oiled data management workflow, there are practical challenges in doing so if the collection is very large and heterogeneous, or is accessed by several researchers contemporaneously. There is a need for data management solutions that can help curators with efficient and on-demand analyses of their collection so that they remain well-informed about its evolving characteristics. In this paper, we describe our efforts towards developing a workflow to leverage open science High Performance Computing (HPC) resources for routinely and efficiently conducting data management tasks on large collections. We demonstrate that HPC resources and techniques can significantly reduce the time for accomplishing critical data management tasks, and enable a dynamic archiving throughout the research process. We use a large archaeological data collection with a long and complex formation history as our test case. We share our experiences in adopting open science HPC resources for large-scale data management, which entails understanding usage of the open source HPC environment and training users. These experiences can be generalized to meet the needs of other data curators working with large collections.

Download Full-text

Characterizing land surface anisotropy from AVHRR data at a global scale using high performance computing

International Journal of Remote Sensing ◽

10.1080/01431160121422 ◽

2001 ◽

Vol 22 (11) ◽

pp. 2171-2191 ◽

Cited By ~ 24

Author(s):

S. N. V. Kalluri ◽

Z. Zhang ◽

J. Jájá ◽

S. Liang ◽

J. R. G. Townshend

Keyword(s):

High Performance Computing ◽

Land Surface ◽

High Performance ◽

Global Scale ◽

Surface Anisotropy ◽

Avhrr Data ◽

Performance Computing

Download Full-text

RDMTk

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.2019100101 ◽

2019 ◽

Vol 13 (4) ◽

pp. 1-38

Author(s):

Vinay Gavirangaswamy ◽

Aakash Gupta ◽

Mark Terwilliger ◽

Ajay Gupta

Keyword(s):

Decision Making ◽

High Performance Computing ◽

High Performance ◽

Global Scale ◽

Scalable Algorithms ◽

Risky Decision Making ◽

Risky Decision ◽

Holistic Understanding ◽

Free Environment ◽

Performance Computing

Research into risky decision making (RDM) has become a multidisciplinary effort. Conversations cut across fields such as psychology, economics, insurance, and marketing. This broad interest highlights the necessity for collaborative investigation of RDM to understand and manipulate the situations within which it manifests. A holistic understanding of RDM has been impeded by the independent development of diverse RDM research methodologies across different fields. There is no software specific to RDM that combines paradigms and analytical tools based on recent developments in high-performance computing technologies. This paper presents a toolkit called RDMTk, developed specifically for the study of risky decision making. RDMTk provides a free environment that can be used to manage globally-based experiments while fostering collaborative research. The incorporation of machine learning and high-performance computing (HPC) technologies in the toolkit further open additional possibilities such as scalable algorithms and big data problems arising from global scale experiments.

Download Full-text

Simulation of Multilayer Shallow Water Fluid Flow Using Lattice Boltzmann Modeling and High Performance Computing

World Environmental and Water Resources Congress 2009 ◽

10.1061/41036(342)282 ◽

2009 ◽

Author(s):

K. R. Tubbs ◽

F. T. -C. Tsai

Keyword(s):

Fluid Flow ◽

Shallow Water ◽

High Performance Computing ◽

Lattice Boltzmann ◽

High Performance ◽

Lattice Boltzmann Modeling ◽

Performance Computing

Download Full-text

High performance computing on graphics processing units

Pollack Periodica ◽

10.1556/pollack.3.2008.2.3 ◽

2008 ◽

Vol 3 (2) ◽

pp. 27-34 ◽

Cited By ~ 2

Author(s):

Balázs Tukora ◽

Tibor Szalay

Keyword(s):

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Graphics Processing ◽

Performance Computing

Download Full-text

The Recent Revolution in High Performance Computing

MRS Bulletin ◽

10.1557/s0883769400034096 ◽

1997 ◽

Vol 22 (10) ◽

pp. 5-6

Author(s):

Horst D. Simon

Keyword(s):

High Performance Computing ◽

High Performance ◽

New Technologies ◽

New Technology ◽

Parallel Architecture ◽

Time Frame ◽

Good News ◽

Computing Industry ◽

Time And Energy ◽

Performance Computing

Recent events in the high-performance computing industry have concerned scientists and the general public regarding a crisis or a lack of leadership in the field. That concern is understandable considering the industry's history from 1993 to 1996. Cray Research, the historic leader in supercomputing technology, was unable to survive financially as an independent company and was acquired by Silicon Graphics. Two ambitious new companies that introduced new technologies in the late 1980s and early 1990s—Thinking Machines and Kendall Square Research—were commercial failures and went out of business. And Intel, which introduced its Paragon supercomputer in 1994, discontinued production only two years later.During the same time frame, scientists who had finished the laborious task of writing scientific codes to run on vector parallel supercomputers learned that those codes would have to be rewritten if they were to run on the next-generation, highly parallel architecture. Scientists who are not yet involved in high-performance computing are understandably hesitant about committing their time and energy to such an apparently unstable enterprise.However, beneath the commercial chaos of the last several years, a technological revolution has been occurring. The good news is that the revolution is over, leading to five to ten years of predictable stability, steady improvements in system performance, and increased productivity for scientific applications. It is time for scientists who were sitting on the fence to jump in and reap the benefits of the new technology.

Download Full-text