CyVerse for Reproducible Research: RNA-Seq Analysis

Plant Bioinformatics - Methods in Molecular Biology ◽

10.1007/978-1-0716-2067-0_3 ◽

2022 ◽

pp. 57-79

Author(s):

Jason Williams

Keyword(s):

Data Storage ◽

High Performance ◽

Lessons Learned ◽

Data Availability ◽

Reproducible Research ◽

Rna Seq ◽

Data Intensive ◽

Interactive Computing ◽

Computing Environments ◽

Performance Computing

AbstractPosing complex research questions poses complex reproducibility challenges. Datasets may need to be managed over long periods of time. Reliable and secure repositories are needed for data storage. Sharing big data requires advance planning and becomes complex when collaborators are spread across institutions and countries. Many complex analyses require the larger compute resources only provided by cloud and high-performance computing infrastructure. Finally at publication, funder and publisher requirements must be met for data availability and accessibility and computational reproducibility. For all of these reasons, cloud-based cyberinfrastructures are an important component for satisfying the needs of data-intensive research. Learning how to incorporate these technologies into your research skill set will allow you to work with data analysis challenges that are often beyond the resources of individual research institutions. One of the advantages of CyVerse is that there are many solutions for high-powered analyses that do not require knowledge of command line (i.e., Linux) computing. In this chapter we will highlight CyVerse capabilities by analyzing RNA-Seq data. The lessons learned will translate to doing RNA-Seq in other computing environments and will focus on how CyVerse infrastructure supports reproducibility goals (e.g., metadata management, containers), team science (e.g., data sharing features), and flexible computing environments (e.g., interactive computing, scaling).

Download Full-text

A comprehensive framework to capture the arcana of neuroimaging analysis

10.1101/447649 ◽

2018 ◽

Cited By ~ 3

Author(s):

Thomas G. Close ◽

Phillip G. D. Ward ◽

Francesco Sforazzini ◽

Wojtek Goscinski ◽

Zhaolin Chen ◽

...

Keyword(s):

Data Storage ◽

High Performance ◽

Software Framework ◽

Mri Contrast ◽

Analysis Methods ◽

Wide Range ◽

Computing Environments ◽

Comprehensive Framework ◽

Performance Computing ◽

Generic Analysis

AbstractMastering the “arcana of neuroimaging analysis”, the obscure knowledge required to apply an appropriate combination of software tools and parameters to analyse a given neuroimaging dataset, is a time consuming process. Therefore, it is not typically feasible to invest the additional effort required generalise workflow implementations to accommodate for the various acquisition parameters, data storage conventions and computing environments in use at different research sites, limiting the reusability of published workflows.We present a novel software framework, Abstraction of Repository-Centric ANAlysis (Arcana), which enables the development of complex, “end-to-end” workflows that are adaptable to new analyses and portable to a wide range of computing infrastructures. Analysis templates for specific image types (e.g. MRI contrast) are implemented as Python classes, which define a range of potential derivatives and analysis methods. Arcana retrieves data from imaging repositories, which can be BIDS datasets, XNAT instances or plain directories, and stores selected derivatives and associated provenance back into a repository for reuse by subsequent analyses. Workflows are constructed using Nipype and can be executed on local workstations or in high performance computing environments. Generic analysis methods can be consolidated within common base classes to facilitate code-reuse and collaborative development, which can be specialised for study-specific requirements via class inheritance. Arcana provides a framework in which to develop unified neuroimaging workflows that can be reused across a wide range of research studies and sites.

Download Full-text

A Framework for Multitasking Data-Intensive Management Services in High Performance Computing Environments

2015 IEEE First International Conference on Big Data Computing Service and Applications ◽

10.1109/bigdataservice.2015.42 ◽

2015 ◽

Author(s):

Sivakumar Kulasekaran ◽

Maria Esteva ◽

Jessica Trelogan ◽

Si Liu

Keyword(s):

High Performance Computing ◽

High Performance ◽

Intensive Management ◽

Data Intensive ◽

Computing Environments ◽

Performance Computing ◽

Management Services

Download Full-text

aRNApipe: A balanced, efficient and distributed pipeline for processing RNA-seq data in high performance computing environments

Bioinformatics ◽

10.1093/bioinformatics/btx023 ◽

2017 ◽

pp. btx023 ◽

Cited By ~ 5

Author(s):

Arnald Alonso ◽

Brittany N. Lasseigne ◽

Kelly Williams ◽

Josh Nielsen ◽

Ryne C. Ramaker ◽

...

Keyword(s):

High Performance Computing ◽

High Performance ◽

Rna Seq ◽

Computing Environments ◽

Performance Computing

Download Full-text

On the Use of Containers in High Performance Computing Environments

2020 IEEE 13th International Conference on Cloud Computing (CLOUD) ◽

10.1109/cloud49709.2020.00048 ◽

2020 ◽

Author(s):

Subil Abraham ◽

Arnab K. Paul ◽

Redwan Ibne Seraj Khan ◽

Ali R. Butt

Keyword(s):

High Performance Computing ◽

High Performance ◽

Computing Environments ◽

Performance Computing

Download Full-text

Performance Evaluation of Container-Based Virtualization for High Performance Computing Environments

2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing ◽

10.1109/pdp.2013.41 ◽

2013 ◽

Cited By ~ 197

Author(s):

M. G. Xavier ◽

M. V. Neves ◽

F. D. Rossi ◽

T. C. Ferreto ◽

T. Lange ◽

...

Keyword(s):

Performance Evaluation ◽

High Performance Computing ◽

High Performance ◽

Computing Environments ◽

Performance Computing

Download Full-text

Performance Evaluations of Distributed File Systems for Scientific Big Data in FUSE Environment

Electronics ◽

10.3390/electronics10121471 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1471

Author(s):

Jun-Yeong Lee ◽

Moon-Hyun Kim ◽

Syed Asif Raza Raza Shah ◽

Sang-Un Ahn ◽

Heejun Yoon ◽

...

Keyword(s):

Data Storage ◽

Scale Up ◽

File Systems ◽

Performance Evaluations ◽

Distributed File Systems ◽

Data Intensive Computing ◽

Data Intensive ◽

Tremendous Amount ◽

Computing Environments ◽

And Performance

Data are important and ever growing in data-intensive scientific environments. Such research data growth requires data storage systems that play pivotal roles in data management and analysis for scientific discoveries. Redundant Array of Independent Disks (RAID), a well-known storage technology combining multiple disks into a single large logical volume, has been widely used for the purpose of data redundancy and performance improvement. However, this requires RAID-capable hardware or software to build up a RAID-enabled disk array. In addition, it is difficult to scale up the RAID-based storage. In order to mitigate such a problem, many distributed file systems have been developed and are being actively used in various environments, especially in data-intensive computing facilities, where a tremendous amount of data have to be handled. In this study, we investigated and benchmarked various distributed file systems, such as Ceph, GlusterFS, Lustre and EOS for data-intensive environments. In our experiment, we configured the distributed file systems under a Reliable Array of Independent Nodes (RAIN) structure and a Filesystem in Userspace (FUSE) environment. Our results identify the characteristics of each file system that affect the read and write performance depending on the features of data, which have to be considered in data-intensive computing environments.

Download Full-text

Identification of Biologically Significant Elements Using Correlation Networks in High Performance Computing Environments

Bioinformatics and Biomedical Engineering - Lecture Notes in Computer Science ◽

10.1007/978-3-319-16480-9_58 ◽

2015 ◽

pp. 607-619 ◽

Cited By ~ 1

Author(s):

Kathryn Dempsey Cooper ◽

Sachin Pawaskar ◽

Hesham H. Ali

Keyword(s):

High Performance Computing ◽

High Performance ◽

Correlation Networks ◽

Computing Environments ◽

Performance Computing

Download Full-text

Toward high performance computing in unconventional computing environments

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing - HPDC '10 ◽

10.1145/1851476.1851569 ◽

2010 ◽

Author(s):

Brent Rood ◽

Nathan Gnanasambandam ◽

Michael J. Lewis ◽

Naveen Sharma

Keyword(s):

High Performance Computing ◽

High Performance ◽

Unconventional Computing ◽

Computing Environments ◽

Performance Computing

Download Full-text

Constructing numerical software libraries for high-performance computing environments

Parallel Scientific Computing - Lecture Notes in Computer Science ◽

10.1007/bfb0030144 ◽

1994 ◽

pp. 147-168

Author(s):

Jaeyoung Choi ◽

Jack J. Dongarra ◽

Roldan Pozo ◽

David W. Walker

Keyword(s):

High Performance Computing ◽

High Performance ◽

Software Libraries ◽

Numerical Software ◽

Computing Environments ◽

Performance Computing

Download Full-text

Empirical Performance Analysis of HPC Benchmarks Across Variations in Cloud Computing

International Journal of Cloud Applications and Computing ◽

10.4018/ijcac.2013010102 ◽

2013 ◽

Vol 3 (1) ◽

pp. 13-26 ◽

Cited By ~ 4

Author(s):

Sanjay P. Ahuja ◽

Sindhu Mani

Keyword(s):

Data Storage ◽

High Performance ◽

Large Data ◽

Extensive Study ◽

Memory Bandwidth ◽

Platform As A Service ◽

Data Intensive ◽

Computational Performance ◽

Empirical Performance ◽

Data Intensive Applications

High Performance Computing (HPC) applications are scientific applications that require significant CPU capabilities. They are also data-intensive applications requiring large data storage. While many researchers have examined the performance of Amazon’s EC2 platform across some HPC benchmarks, an extensive study and their comparison between Amazon’s EC2 and Microsoft’s Windows Azure is largely missing with metrics such as memory bandwidth, I/O performance, and communication and computational performance. The purpose of this paper is to implement existing benchmarks to evaluate and analyze these metrics for EC2 and Windows Azure that span both Infrastructure-as-a-Service and Platform-as-a-Service types. This was accomplished by running MPI versions of STREAM, Interleaved or Random (IOR) and NAS Parallel (NPB) benchmarks on small and medium instance types. In addition a new EC2 medium instance type (m1.medium) was also included in the analysis. These benchmarks measure the memory bandwidth, I/O performance, communication and computational performance.

Download Full-text