scholarly journals Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data

2017 ◽  
Author(s):  
Julie A McMurry ◽  
Nick Juty ◽  
Niklas Blomberg ◽  
Tony Burdett ◽  
Tom Conlin ◽  
...  

AbstractIn many disciplines, data is highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline ten lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers; we also outline important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.

Web Services ◽  
2019 ◽  
pp. 953-978
Author(s):  
Krishnan Umachandran ◽  
Debra Sharon Ferdinand-James

Continued technological advancements of the 21st Century afford massive data generation in sectors of our economy to include the domains of agriculture, manufacturing, and education. However, harnessing such large-scale data, using modern technologies for effective decision-making appears to be an evolving science that requires knowledge of Big Data management and analytics. Big data in agriculture, manufacturing, and education are varied such as voluminous text, images, and graphs. Applying Big data science techniques (e.g., functional algorithms) for extracting intelligence data affords decision markers quick response to productivity, market resilience, and student enrollment challenges in today's unpredictable markets. This chapter serves to employ data science for potential solutions to Big Data applications in the sectors of agriculture, manufacturing and education to a lesser extent, using modern technological tools such as Hadoop, Hive, Sqoop, and MongoDB.


Author(s):  
Krishnan Umachandran ◽  
Debra Sharon Ferdinand-James

Continued technological advancements of the 21st Century afford massive data generation in sectors of our economy to include the domains of agriculture, manufacturing, and education. However, harnessing such large-scale data, using modern technologies for effective decision-making appears to be an evolving science that requires knowledge of Big Data management and analytics. Big data in agriculture, manufacturing, and education are varied such as voluminous text, images, and graphs. Applying Big data science techniques (e.g., functional algorithms) for extracting intelligence data affords decision markers quick response to productivity, market resilience, and student enrollment challenges in today's unpredictable markets. This chapter serves to employ data science for potential solutions to Big Data applications in the sectors of agriculture, manufacturing and education to a lesser extent, using modern technological tools such as Hadoop, Hive, Sqoop, and MongoDB.


2019 ◽  
Author(s):  
Yasset Perez-Riverol ◽  
Pablo Moreno

AbstractThe recent improvements in mass spectrometry instruments and new analytical methods are increasing the intersection between proteomics and big data science. In addition, the bioinformatics analysis is becoming an increasingly complex and convoluted process involving multiple algorithms and tools. A wide variety of methods and software tools have been developed for computational proteomics and metabolomics during recent years, and this trend is likely to continue. However, most of the computational proteomics and metabolomics tools are targeted and design for single desktop application limiting the scalability and reproducibility of the data analysis. In this paper we overview the key steps of metabolomic and proteomics data processing including main tools and software use to perform the data analysis. We discuss the combination of software containers with workflows environments for large scale metabolomics and proteomics analysis. Finally, we introduced to the proteomics and metabolomics communities a new approach for reproducible and large-scale data analysis based on BioContainers and two of the most popular workflows environments: Galaxy and Nextflow.


2021 ◽  
Author(s):  
Giona Casiraghi ◽  
Vahan Nanumyan

Abstract A fundamental issue of network data science is the ability to discern observed features that can be expected at random from those beyond such expectations. Configuration models play a crucial role there, allowing us to compare observations against degree-corrected null-models. Nonetheless, existing formulations have limited large-scale data analysis applications either because they require expensive Monte-Carlo simulations or lack the required flexibility to model real-world systems. With the generalized hypergeometric ensemble, we address both problems. To achieve this, we map the configuration model to an urn problem, where edges are represented as balls in an appropriately constructed urn. Doing so, we obtain a random graph model reproducing and extending the properties of standard configuration models, with the critical advantage of a closed-form probability distribution.


MRS Advances ◽  
2020 ◽  
Vol 5 (7) ◽  
pp. 347-353
Author(s):  
Roger H. French ◽  
Laura S. Bruckman

ABSTRACTData science has advanced significantly in recent years and allows scientists to harness large-scale data analysis techniques using open source coding frameworks. Data science is a tool that should be taught to science and engineering students in addition to their chosen domain knowledge. An applied data science minor allows students to understand data and data handling as well as statistics and model development. This move will improve reproducibility and openness of research as well as allow for greater interdisciplinarity and more analyses focusing on critical scientific challenges.


F1000Research ◽  
2014 ◽  
Vol 3 ◽  
pp. 44 ◽  
Author(s):  
Jose M. Villaveces ◽  
Rafael C. Jimenez ◽  
Bianca H. Habermann

Summary: Protein interaction networks have become an essential tool in large-scale data analysis, integration, and the visualization of high-throughput data in the context of complex cellular networks. Many individual databases are available that provide information on binary interactions of proteins and small molecules. Community efforts such as PSICQUIC aim to unify and standardize information emanating from these public databases. Here we introduce PsicquicGraph, an open-source, web-based visualization component for molecular interactions from PSIQUIC services. Availability: PsicquicGraph is freely available at the BioJS Registry for download and enhancement. Instructions on how to use the tool are available here http://goo.gl/kDaIgZ and the source code can be found at http://github.com/biojs/biojs and DOI:10.5281/zenodo.7709.


2020 ◽  
Vol 13 (12) ◽  
pp. 2993-2996
Author(s):  
El Kindi Rezig ◽  
Ashrita Brahmaroutu ◽  
Nesime Tatbul ◽  
Mourad Ouzzani ◽  
Nan Tang ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document