Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data

AbstractIn many disciplines, data is highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline ten lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers; we also outline important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.

Download Full-text

Collective Development of Large Scale Data Science Products via Modularized Assignments

Proceedings of the 51st ACM Technical Symposium on Computer Science Education ◽

10.1145/3328778.3366961 ◽

2020 ◽

Author(s):

Bhavya ◽

Assma Boughoula ◽

Aaron Green ◽

ChengXiang Zhai

Keyword(s):

Large Scale ◽

Data Science ◽

Large Scale Data ◽

Collective Development ◽

Scale Data

Download Full-text

WebViz: A Web-Based Collaborative Interactive Visualization System for Large-Scale Data Sets

Lecture Notes in Earth System Sciences - GPU Solutions to Multi-scale Problems in Science and Engineering ◽

10.1007/978-3-642-16405-7_37 ◽

2013 ◽

pp. 587-606 ◽

Cited By ~ 2

Author(s):

Yichen Zhou ◽

Robin M. Weiss ◽

Elizabeth McArthur ◽

David Sanchez ◽

Xiang Yao ◽

...

Keyword(s):

Large Scale ◽

Interactive Visualization ◽

Data Sets ◽

Web Based ◽

Visualization System ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Affordances of Data Science in Agriculture, Manufacturing, and Education

Web Services ◽

10.4018/978-1-5225-7501-6.ch052 ◽

2019 ◽

pp. 953-978

Author(s):

Krishnan Umachandran ◽

Debra Sharon Ferdinand-James

Keyword(s):

Big Data ◽

Large Scale ◽

Data Science ◽

Data Generation ◽

Large Scale Data ◽

Big Data Applications ◽

Effective Decision ◽

Effective Decision Making ◽

Text Images ◽

Scale Data

Continued technological advancements of the 21st Century afford massive data generation in sectors of our economy to include the domains of agriculture, manufacturing, and education. However, harnessing such large-scale data, using modern technologies for effective decision-making appears to be an evolving science that requires knowledge of Big Data management and analytics. Big data in agriculture, manufacturing, and education are varied such as voluminous text, images, and graphs. Applying Big data science techniques (e.g., functional algorithms) for extracting intelligence data affords decision markers quick response to productivity, market resilience, and student enrollment challenges in today's unpredictable markets. This chapter serves to employ data science for potential solutions to Big Data applications in the sectors of agriculture, manufacturing and education to a lesser extent, using modern technological tools such as Hadoop, Hive, Sqoop, and MongoDB.

Download Full-text

Affordances of Data Science in Agriculture, Manufacturing, and Education

Privacy and Security Policies in Big Data - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-5225-2486-1.ch002 ◽

2017 ◽

pp. 14-40 ◽

Cited By ~ 2

Author(s):

Krishnan Umachandran ◽

Debra Sharon Ferdinand-James

Keyword(s):

Big Data ◽

Large Scale ◽

Data Science ◽

Data Generation ◽

Large Scale Data ◽

Big Data Applications ◽

Effective Decision ◽

Effective Decision Making ◽

Text Images ◽

Scale Data

Download Full-text

Scalable data analysis in proteomics and metabolomics using BioContainers and workflows engines

10.1101/604413 ◽

2019 ◽

Author(s):

Yasset Perez-Riverol ◽

Pablo Moreno

Keyword(s):

Data Analysis ◽

Large Scale ◽

Data Science ◽

Proteomics Data ◽

Computational Proteomics ◽

New Approach ◽

Large Scale Data ◽

Desktop Application ◽

Key Steps ◽

Scale Data

AbstractThe recent improvements in mass spectrometry instruments and new analytical methods are increasing the intersection between proteomics and big data science. In addition, the bioinformatics analysis is becoming an increasingly complex and convoluted process involving multiple algorithms and tools. A wide variety of methods and software tools have been developed for computational proteomics and metabolomics during recent years, and this trend is likely to continue. However, most of the computational proteomics and metabolomics tools are targeted and design for single desktop application limiting the scalability and reproducibility of the data analysis. In this paper we overview the key steps of metabolomic and proteomics data processing including main tools and software use to perform the data analysis. We discuss the combination of software containers with workflows environments for large scale metabolomics and proteomics analysis. Finally, we introduced to the proteomics and metabolomics communities a new approach for reproducible and large-scale data analysis based on BioContainers and two of the most popular workflows environments: Galaxy and Nextflow.

Download Full-text

Configuration Models as an Urn Problem: The Generalized Hypergeometric Ensemble of Random Graphs

10.21203/rs.3.rs-254843/v1 ◽

2021 ◽

Author(s):

Giona Casiraghi ◽

Vahan Nanumyan

Keyword(s):

Large Scale ◽

Data Science ◽

Graph Model ◽

Configuration Model ◽

World Systems ◽

Fundamental Issue ◽

Random Graph Model ◽

Large Scale Data ◽

Standard Configuration ◽

Scale Data

Abstract A fundamental issue of network data science is the ability to discern observed features that can be expected at random from those beyond such expectations. Configuration models play a crucial role there, allowing us to compare observations against degree-corrected null-models. Nonetheless, existing formulations have limited large-scale data analysis applications either because they require expensive Monte-Carlo simulations or lack the required flexibility to model real-world systems. With the generalized hypergeometric ensemble, we address both problems. To achieve this, we map the configuration model to an urn problem, where edges are represented as balls in an appropriately constructed urn. Doing so, we obtain a random graph model reproducing and extending the properties of standard configuration models, with the critical advantage of a closed-form probability distribution.

Download Full-text

Learnings from developing an applied data science curricula for undergraduate and graduate students

MRS Advances ◽

10.1557/adv.2020.135 ◽

2020 ◽

Vol 5 (7) ◽

pp. 347-353

Author(s):

Roger H. French ◽

Laura S. Bruckman

Keyword(s):

Domain Knowledge ◽

Large Scale ◽

Data Science ◽

Engineering Students ◽

Source Coding ◽

Model Development ◽

Science And Engineering ◽

Analysis Techniques ◽

Large Scale Data ◽

Scale Data

ABSTRACTData science has advanced significantly in recent years and allows scientists to harness large-scale data analysis techniques using open source coding frameworks. Data science is a tool that should be taught to science and engineering students in addition to their chosen domain knowledge. An applied data science minor allows students to understand data and data handling as well as statistics and model development. This move will improve reproducibility and openness of research as well as allow for greater interdisciplinarity and more analyses focusing on critical scientific challenges.

Download Full-text

Efficient Graph Analytics in Python for Large-Scale Data Science

10.1007/978-3-030-86534-4_15 ◽

2021 ◽

pp. 158-164

Author(s):

Xiantian Zhou ◽

Carlos Ordonez

Keyword(s):

Large Scale ◽

Data Science ◽

Graph Analytics ◽

Large Scale Data ◽

Scale Data

Download Full-text

PsicquicGraph, a BioJS component to visualize molecular interactions from PSICQUIC servers

F1000Research ◽

10.12688/f1000research.3-44.v1 ◽

2014 ◽

Vol 3 ◽

pp. 44 ◽

Cited By ~ 4

Author(s):

Jose M. Villaveces ◽

Rafael C. Jimenez ◽

Bianca H. Habermann

Keyword(s):

Small Molecules ◽

Molecular Interactions ◽

Large Scale ◽

Protein Interaction Networks ◽

Interaction Networks ◽

Web Based ◽

High Throughput Data ◽

Link Type ◽

Large Scale Data ◽

Scale Data

Summary: Protein interaction networks have become an essential tool in large-scale data analysis, integration, and the visualization of high-throughput data in the context of complex cellular networks. Many individual databases are available that provide information on binary interactions of proteins and small molecules. Community efforts such as PSICQUIC aim to unify and standardize information emanating from these public databases. Here we introduce PsicquicGraph, an open-source, web-based visualization component for molecular interactions from PSIQUIC services. Availability: PsicquicGraph is freely available at the BioJS Registry for download and enhancement. Instructions on how to use the tool are available here http://goo.gl/kDaIgZ and the source code can be found at http://github.com/biojs/biojs and DOI:10.5281/zenodo.7709.

Download Full-text