Shared Data Science Infrastructure for Genomics Data

10.21203/rs.2.4295/v2 ◽

2019 ◽

Author(s):

Hamid Bagheri ◽

Usha Muppirala ◽

Rick Masonbrink ◽

Andrew J Severin ◽

Hridesh Rajan

Keyword(s):

Data Science ◽

Bacterial Genome ◽

Large Data ◽

Biological Data ◽

Domain Specific Language ◽

Data Repositories ◽

Specific Language ◽

Domain Specific ◽

Shared Data ◽

Genome Assemblies

Abstract Background: Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boa_g is needed to efficiently process and parse data contained in large data repositories. The main features of Boa_g are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. Results: As a proof of concept, Boa for genomics, Boa_g, has been implemented to analyze RefSeq’s 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boa_g provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boa_g to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boa_g databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations. Conclusions: In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boa_g, provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boa_g using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boa_g could be used with large biological datasets.

Download Full-text

Shared Data Science Infrastructure for Genomics Data

10.21203/rs.2.4295/v1 ◽

2019 ◽

Author(s):

Hamid Bagheri ◽

Usha Muppirala ◽

Rick Masonbrink ◽

Andrew J Severin ◽

Hridesh Rajan

Keyword(s):

Data Science ◽

Bacterial Genome ◽

Large Data ◽

Biological Data ◽

Domain Specific Language ◽

Data Repositories ◽

Specific Language ◽

Domain Specific ◽

Shared Data ◽

Genome Assemblies

Abstract Background: Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boa_g is needed to efficiently process and parse data contained in large data repositories. The main features of Boa_g are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. Results: As a proof of concept, Boa for genomics, Boa_g, has been implemented to analyze RefSeq’s 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boa_g provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boa_g to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boa_g databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations. Conclusions: In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boa_g, provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boa_g using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boa_g could be used with large biological datasets.

Download Full-text

Shared Data Science Infrastructure for Genomics Data

10.1101/307777 ◽

2018 ◽

Author(s):

Hamid Bagher ◽

Usha Muppiral ◽

Andrew J Severin ◽

Hridesh Rajan

Keyword(s):

Data Science ◽

Gene Annotation ◽

Large Data ◽

Biological Data ◽

Genomic Research ◽

Data Repository ◽

Small Data ◽

Data Repositories ◽

Shared Data ◽

Genome Assemblies

AbstractBackgroundCreating a computational infrastructure to analyze the wealth of information contained in data repositories that scales well is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared Data Science Infrastructures like Boa can be used to more efficiently process and parse data contained in large data repositories. The main features of Boa are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories.ResultsHere, we present an implementation of Boa for Genomic research (BoaG) on a relatively small data repository: RefSeq’s 97,716 annotation (GFF) and assembly (FASTA) files and metadata. We used BoaG to query the entire RefSeq dataset and gain insight into the RefSeq genome assemblies and gene model annotations and show that assembly quality using the same assembler varies depending on species.ConclusionsIn order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, BoaG, can provide greater access to researchers to efficiently explore data in ways previously not possible for anyone but the most well funded research groups. We demonstrate the efficiency of BoaG to explore the RefSeq database of genome assemblies and annotations to identify interesting features of gene annotation as a proof of concept for much larger datasets.

Download Full-text

Glinda: Supporting Data Science with Live Programming, GUIs and a Domain-specific Language

Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems ◽

10.1145/3411764.3445267 ◽

2021 ◽

Author(s):

Robert A DeLine

Keyword(s):

Data Science ◽

Domain Specific Language ◽

Specific Language ◽

Domain Specific ◽

Live Programming

Download Full-text

Domain-Specific Language Abstractions for Compression

2021 Data Compression Conference (DCC) ◽

10.1109/dcc50243.2021.00077 ◽

2021 ◽

Author(s):

Jessica Ray ◽

Ajav Brahmakshatriya ◽

Richard Wang ◽

Shoaib Kamil ◽

Albert Reuther ◽

...

Keyword(s):

Domain Specific Language ◽

Specific Language ◽

Domain Specific

Download Full-text

Domain-general cognitive control and domain-specific language control in bilingual aphasia: A systematic quantitative literature review

Journal of Neurolinguistics ◽

10.1016/j.jneuroling.2021.101021 ◽

2021 ◽

Vol 60 ◽

pp. 101021

Author(s):

Vishnu KK Nair ◽

Tegan Rayner ◽

Samantha Siyambalapitiya ◽

Britta Biedermann

Keyword(s):

Literature Review ◽

Cognitive Control ◽

Domain Specific Language ◽

Specific Language ◽

Domain Specific ◽

Language Control ◽

Bilingual Aphasia

Download Full-text

Proxemic Environments Modelling based on a Graphical Domain-Specific Language

2020 IEEE/ACS 17th International Conference on Computer Systems and Applications (AICCSA) ◽

10.1109/aiccsa50499.2020.9316496 ◽

2020 ◽

Author(s):

Paulo Perez ◽

Philippe Roose ◽

Yudith Cardinale ◽

Marc Dalmau ◽

Dominique Masson ◽

...

Keyword(s):

Domain Specific Language ◽

Specific Language ◽

Domain Specific

Download Full-text

Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) ◽

10.1109/cgo51591.2021.9370313 ◽

2021 ◽

Author(s):

Mahwish Arif ◽

Ruoyu Zhou ◽

Hsi-Ming Ho ◽

Timothy M. Jones

Keyword(s):

Domain Specific Language ◽

Specific Language ◽

Domain Specific

Download Full-text

A domain-specific language for parallel and grid computing

Proceedings of the 2008 AOSD workshop on Domain-specific aspect languages - DSAL '08 ◽

10.1145/1404927.1404929 ◽

2008 ◽

Cited By ~ 1

Author(s):

João L. Sobral ◽

Miguel P. Monteiro

Keyword(s):

Grid Computing ◽

Domain Specific Language ◽

Specific Language ◽

Domain Specific

Download Full-text

RML: Theory and practice of a domain specific language for runtime verification

Science of Computer Programming ◽

10.1016/j.scico.2021.102610 ◽

2021 ◽

Vol 205 ◽

pp. 102610

Author(s):

Davide Ancona ◽

Luca Franceschini ◽

Angelo Ferrando ◽

Viviana Mascardi

Keyword(s):

Runtime Verification ◽

Theory And Practice ◽

Domain Specific Language ◽

Specific Language ◽

Domain Specific

Download Full-text