Shared Data Science Infrastructure for Genomics Data

10.21203/rs.2.4295/v2 ◽

2019 ◽

Author(s):

Hamid Bagheri ◽

Usha Muppirala ◽

Rick Masonbrink ◽

Andrew J Severin ◽

Hridesh Rajan

Keyword(s):

Data Science ◽

Bacterial Genome ◽

Large Data ◽

Biological Data ◽

Domain Specific Language ◽

Data Repositories ◽

Specific Language ◽

Domain Specific ◽

Shared Data ◽

Genome Assemblies

Abstract Background: Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boa_g is needed to efficiently process and parse data contained in large data repositories. The main features of Boa_g are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. Results: As a proof of concept, Boa for genomics, Boa_g, has been implemented to analyze RefSeq’s 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boa_g provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boa_g to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boa_g databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations. Conclusions: In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boa_g, provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boa_g using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boa_g could be used with large biological datasets.

Download Full-text

Shared Data Science Infrastructure for Genomics Data

10.21203/rs.2.4295/v3 ◽

2019 ◽

Author(s):

Hamid Bagheri ◽

Usha Muppirala ◽

Rick Masonbrink ◽

Andrew J Severin ◽

Hridesh Rajan

Keyword(s):

Data Science ◽

Bacterial Genome ◽

Large Data ◽

Biological Data ◽

Domain Specific Language ◽

Data Repositories ◽

Specific Language ◽

Domain Specific ◽

Shared Data ◽

Genome Assemblies

Abstract Background: Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boa_g is needed to efficiently process and parse data contained in large data repositories. The main features of Boa_g are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. Results: As a proof of concept, Boa for genomics, Boa_g, has been implemented to analyze RefSeq’s 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boa_g provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boa_g to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boa_g databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations. Conclusions: In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boa_g, provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boa_g using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boa_g could be used with large biological datasets.

Download Full-text

Shared Data Science Infrastructure for Genomics Data

10.21203/rs.2.4295/v1 ◽

2019 ◽

Author(s):

Hamid Bagheri ◽

Usha Muppirala ◽

Rick Masonbrink ◽

Andrew J Severin ◽

Hridesh Rajan

Keyword(s):

Data Science ◽

Bacterial Genome ◽

Large Data ◽

Biological Data ◽

Domain Specific Language ◽

Data Repositories ◽

Specific Language ◽

Domain Specific ◽

Shared Data ◽

Genome Assemblies

Abstract Background: Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boa_g is needed to efficiently process and parse data contained in large data repositories. The main features of Boa_g are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. Results: As a proof of concept, Boa for genomics, Boa_g, has been implemented to analyze RefSeq’s 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boa_g provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boa_g to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boa_g databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations. Conclusions: In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boa_g, provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boa_g using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boa_g could be used with large biological datasets.

Download Full-text

Toward collaborative open data science in metabolomics using Jupyter Notebooks and cloud computing

Metabolomics ◽

10.1007/s11306-019-1588-0 ◽

2019 ◽

Vol 15 (10) ◽

Cited By ~ 7

Author(s):

Kevin M. Mendez ◽

Leighton Pritchard ◽

Stacey N. Reinke ◽

David I. Broadhurst

Keyword(s):

Cloud Computing ◽

Web Application ◽

Data Science ◽

Open Data ◽

Open Science ◽

Data Repository ◽

Data Repositories ◽

Fully Integrated ◽

Computing Platform ◽

Novices And Experts

Abstract Background A lack of transparency and reporting standards in the scientific community has led to increasing and widespread concerns relating to reproduction and integrity of results. As an omics science, which generates vast amounts of data and relies heavily on data science for deriving biological meaning, metabolomics is highly vulnerable to irreproducibility. The metabolomics community has made substantial efforts to align with FAIR data standards by promoting open data formats, data repositories, online spectral libraries, and metabolite databases. Open data analysis platforms also exist; however, they tend to be inflexible and rely on the user to adequately report their methods and results. To enable FAIR data science in metabolomics, methods and results need to be transparently disseminated in a manner that is rapid, reusable, and fully integrated with the published work. To ensure broad use within the community such a framework also needs to be inclusive and intuitive for both computational novices and experts alike. Aim of Review To encourage metabolomics researchers from all backgrounds to take control of their own data science, mould it to their personal requirements, and enthusiastically share resources through open science. Key Scientific Concepts of Review This tutorial introduces the concept of interactive web-based computational laboratory notebooks. The reader is guided through a set of experiential tutorials specifically targeted at metabolomics researchers, based around the Jupyter Notebook web application, GitHub data repository, and Binder cloud computing platform.

Download Full-text

Repository Approaches to Improving the Quality of Shared Data and Code

Data ◽

10.3390/data6020015 ◽

2021 ◽

Vol 6 (2) ◽

pp. 15

Author(s):

Ana Trisovic ◽

Katherine Mika ◽

Ceilyn Boyd ◽

Sebastian Feger ◽

Mercè Crosas

Keyword(s):

Secondary Data ◽

Scientific Work ◽

Data Curation ◽

Data Repository ◽

Data Repositories ◽

Research Dissemination ◽

Design Elements ◽

Shared Data ◽

The Past

Sharing data and code for reuse has become increasingly important in scientific work over the past decade. However, in practice, shared data and code may be unusable, or published results obtained from them may be irreproducible. Data repository features and services contribute significantly to the quality, longevity, and reusability of datasets. This paper presents a combination of original and secondary data analysis studies focusing on computational reproducibility, data curation, and gamified design elements that can be employed to indicate and improve the quality of shared data and code. The findings of these studies are sorted into three approaches that can be valuable to data repositories, archives, and other research dissemination platforms.

Download Full-text

Closing gaps between open software and public data in a hackathon setting: User-centered software prototyping

F1000Research ◽

10.12688/f1000research.8382.1 ◽

2016 ◽

Vol 5 ◽

pp. 672 ◽

Cited By ~ 3

Author(s):

Ben Busby ◽

Matthew Lesko ◽

Lisa Federer ◽

Keyword(s):

Open Source ◽

Data Science ◽

Biological Data ◽

Similar Data ◽

Data Types ◽

Data Repositories ◽

Collaborative Software ◽

Public Data ◽

Public Datasets ◽

Prototype Software

In genomics, bioinformatics and other areas of data science, gaps exist between extant public datasets and the open-source software tools built by the community to analyze similar data types. The purpose of biological data science hackathons is to assemble groups of genomics or bioinformatics professionals and software developers to rapidly prototype software to address these gaps. The only two rules for the NCBI-assisted hackathons run so far are that 1) data either must be housed in public data repositories or be deposited to such repositories shortly after the hackathon’s conclusion, and 2) all software comprising the final pipeline must be open-source or open-use. Proposed topics, as well as suggested tools and approaches, are distributed to participants at the beginning of each hackathon and refined during the event. Software, scripts, and pipelines are developed and published on GitHub, a web service providing publicly available, free-usage tiers for collaborative software development. The code resulting from each hackathon is published at https://github.com/NCBI-Hackathons/ with separate directories or repositories for each team.

Download Full-text

Susan G. Komen Big Data for Breast Cancer Initiative: How Patient Advocacy Organizations Can Facilitate Using Big Data to Improve Patient Outcomes

JCO Precision Oncology ◽

10.1200/po.19.00184 ◽

2019 ◽

pp. 1-9

Author(s):

Jerome Jourquin ◽

Stephanie Birkey Reffey ◽

Cheryl Jernigan ◽

Mia Levy ◽

Glendon Zinser ◽

...

Keyword(s):

Breast Cancer ◽

Big Data ◽

Patient Outcomes ◽

Data Science ◽

Real Life ◽

Large Data ◽

Patient Advocacy ◽

Imaging Data ◽

Data Repositories ◽

Improve Patient

Integrating different types of data, including electronic health records, imaging data, administrative and claims databases, large data repositories, the Internet of Things, genomics, and other omics data, is both a challenge and an opportunity that must be tackled head on. We explore some of the challenges and opportunities in optimizing data integration to accelerate breast cancer discovery and improve patient outcomes. Susan G. Komen convened three meetings (2015, 2017, and 2018) with various stakeholders to discuss challenges, opportunities, and next steps to enhance the use of big data in the field of breast cancer. Meeting participants agreed that big data approaches can enhance the identification of better therapies, improve outcomes, reduce disparities, and optimize precision medicine. One challenge is that databases must be shared, linked with each other, standardized, and interoperable. Patients want to be active participants in research and their own care, and to control how their data are used. Many patients have privacy concerns and do not understand how sharing their data can help to effectively drive discovery. Public education is essential, and breast cancer researchers who are skilled in using and analyzing big data are needed. Patient advocacy groups can play multiple roles to help maximize and leverage big data to better serve patients. Komen is committed to educating patients on big data issues, encouraging data sharing by all stakeholders, assisting in training the next generation of data science breast cancer researchers, and funding research projects that will use real-life data in real time to revolutionize the way breast cancer is understood and treated.

Download Full-text

Exploring completeness in clinical data research networks with DQe-c

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocx109 ◽

2017 ◽

Vol 25 (1) ◽

pp. 17-24 ◽

Cited By ~ 4

Author(s):

Hossein Estiri ◽

Kari A Stephens ◽

Jeffrey G Klann ◽

Shawn N Murphy

Keyword(s):

Data Quality ◽

Quality Assessment ◽

Clinical Data ◽

Data Science ◽

Ad Hoc ◽

Assessment Tool ◽

Data Repository ◽

Process Data ◽

Data Repositories ◽

Data Quality Assessment

Abstract Objective To provide an open source, interoperable, and scalable data quality assessment tool for evaluation and visualization of completeness and conformance in electronic health record (EHR) data repositories. Materials and Methods This article describes the tool’s design and architecture and gives an overview of its outputs using a sample dataset of 200 000 randomly selected patient records with an encounter since January 1, 2010, extracted from the Research Patient Data Registry (RPDR) at Partners HealthCare. All the code and instructions to run the tool and interpret its results are provided in the Supplementary Appendix. Results DQe-c produces a web-based report that summarizes data completeness and conformance in a given EHR data repository through descriptive graphics and tables. Results from running the tool on the sample RPDR data are organized into 4 sections: load and test details, completeness test, data model conformance test, and test of missingness in key clinical indicators. Discussion Open science, interoperability across major clinical informatics platforms, and scalability to large databases are key design considerations for DQe-c. Iterative implementation of the tool across different institutions directed us to improve the scalability and interoperability of the tool and find ways to facilitate local setup. Conclusion EHR data quality assessment has been hampered by implementation of ad hoc processes. The architecture and implementation of DQe-c offer valuable insights for developing reproducible and scalable data science tools to assess, manage, and process data in clinical data repositories.

Download Full-text

Closing gaps between open software and public data in a hackathon setting: User-centered software prototyping

F1000Research ◽

10.12688/f1000research.8382.2 ◽

2016 ◽

Vol 5 ◽

pp. 672 ◽

Cited By ~ 6

Author(s):

Ben Busby ◽

Matthew Lesko ◽

Lisa Federer ◽

Keyword(s):

Open Source ◽

Data Science ◽

Biological Data ◽

Similar Data ◽

Data Types ◽

Data Repositories ◽

Collaborative Software ◽

Public Data ◽

Public Datasets ◽

Prototype Software

In genomics, bioinformatics and other areas of data science, gaps exist between extant public datasets and the open-source software tools built by the community to analyze similar data types. The purpose of biological data science hackathons is to assemble groups of genomics or bioinformatics professionals and software developers to rapidly prototype software to address these gaps. The only two rules for the NCBI-assisted hackathons run so far are that 1) data either must be housed in public data repositories or be deposited to such repositories shortly after the hackathon’s conclusion, and 2) all software comprising the final pipeline must be open-source or open-use. Proposed topics, as well as suggested tools and approaches, are distributed to participants at the beginning of each hackathon and refined during the event. Software, scripts, and pipelines are developed and published on GitHub, a web service providing publicly available, free-usage tiers for collaborative software development. The code resulting from each hackathon is published at https://github.com/NCBI-Hackathons/ with separate directories or repositories for each team.

Download Full-text

BITACORA: A comprehensive tool for the identification and annotation of gene families in genome assemblies

10.1101/593889 ◽

2019 ◽

Cited By ~ 1

Author(s):

Joel Vizueta ◽

Alejandro Sánchez-Gracia ◽

Julio Rozas

Keyword(s):

Dna Sequences ◽

Gene Annotation ◽

Sequence Similarity ◽

Gene Families ◽

Genomic Research ◽

Model Organisms ◽

Large Gene ◽

Genomic Annotation ◽

Gene Models ◽

Genome Assemblies

AbstractGene annotation is a critical bottleneck in genomic research, especially for the comprehensive study of very large gene families in the genomes of non-model organisms. Despite the recent progress in automatic methods, the tools developed for this task often produce inaccurate annotations, such as fused, chimeric, partial or even completely absent gene models for many family copies, which require considerable extra efforts to be amended. Here we present BITACORA, a bioinformatics solution that integrates sequence similarity search tools and Perl scripts to facilitate both the curation of these inaccurate annotations and the identification of previously undetected gene family copies directly from DNA sequences. We tested the performance of the BITACORA pipeline in annotating the members of two chemosensory gene families of different sizes in seven available chelicerate genome drafts. Despite the relatively high fragmentation of some of these drafts, BITACORA was able to improve the annotation of many members of these families and detected thousands of new chemoreceptors encoded in genome sequences. The program generates an output file in the general feature format (GFF) files, with both curated and novel gene models, and a FASTA file with the predicted proteins. These outputs can be easily integrated in genomic annotation editors, greatly facilitating subsequent manual annotation and downstream evolutionary analyses.

Download Full-text