scholarly journals CyVerse for Reproducible Research: RNA-Seq Analysis

Author(s):  
Jason Williams

AbstractPosing complex research questions poses complex reproducibility challenges. Datasets may need to be managed over long periods of time. Reliable and secure repositories are needed for data storage. Sharing big data requires advance planning and becomes complex when collaborators are spread across institutions and countries. Many complex analyses require the larger compute resources only provided by cloud and high-performance computing infrastructure. Finally at publication, funder and publisher requirements must be met for data availability and accessibility and computational reproducibility. For all of these reasons, cloud-based cyberinfrastructures are an important component for satisfying the needs of data-intensive research. Learning how to incorporate these technologies into your research skill set will allow you to work with data analysis challenges that are often beyond the resources of individual research institutions. One of the advantages of CyVerse is that there are many solutions for high-powered analyses that do not require knowledge of command line (i.e., Linux) computing. In this chapter we will highlight CyVerse capabilities by analyzing RNA-Seq data. The lessons learned will translate to doing RNA-Seq in other computing environments and will focus on how CyVerse infrastructure supports reproducibility goals (e.g., metadata management, containers), team science (e.g., data sharing features), and flexible computing environments (e.g., interactive computing, scaling).

2018 ◽  
Author(s):  
Thomas G. Close ◽  
Phillip G. D. Ward ◽  
Francesco Sforazzini ◽  
Wojtek Goscinski ◽  
Zhaolin Chen ◽  
...  

AbstractMastering the “arcana of neuroimaging analysis”, the obscure knowledge required to apply an appropriate combination of software tools and parameters to analyse a given neuroimaging dataset, is a time consuming process. Therefore, it is not typically feasible to invest the additional effort required generalise workflow implementations to accommodate for the various acquisition parameters, data storage conventions and computing environments in use at different research sites, limiting the reusability of published workflows.We present a novel software framework, Abstraction of Repository-Centric ANAlysis (Arcana), which enables the development of complex, “end-to-end” workflows that are adaptable to new analyses and portable to a wide range of computing infrastructures. Analysis templates for specific image types (e.g. MRI contrast) are implemented as Python classes, which define a range of potential derivatives and analysis methods. Arcana retrieves data from imaging repositories, which can be BIDS datasets, XNAT instances or plain directories, and stores selected derivatives and associated provenance back into a repository for reuse by subsequent analyses. Workflows are constructed using Nipype and can be executed on local workstations or in high performance computing environments. Generic analysis methods can be consolidated within common base classes to facilitate code-reuse and collaborative development, which can be specialised for study-specific requirements via class inheritance. Arcana provides a framework in which to develop unified neuroimaging workflows that can be reused across a wide range of research studies and sites.


2017 ◽  
pp. btx023 ◽  
Author(s):  
Arnald Alonso ◽  
Brittany N. Lasseigne ◽  
Kelly Williams ◽  
Josh Nielsen ◽  
Ryne C. Ramaker ◽  
...  

Electronics ◽  
2021 ◽  
Vol 10 (12) ◽  
pp. 1471
Author(s):  
Jun-Yeong Lee ◽  
Moon-Hyun Kim ◽  
Syed Asif Raza Raza Shah ◽  
Sang-Un Ahn ◽  
Heejun Yoon ◽  
...  

Data are important and ever growing in data-intensive scientific environments. Such research data growth requires data storage systems that play pivotal roles in data management and analysis for scientific discoveries. Redundant Array of Independent Disks (RAID), a well-known storage technology combining multiple disks into a single large logical volume, has been widely used for the purpose of data redundancy and performance improvement. However, this requires RAID-capable hardware or software to build up a RAID-enabled disk array. In addition, it is difficult to scale up the RAID-based storage. In order to mitigate such a problem, many distributed file systems have been developed and are being actively used in various environments, especially in data-intensive computing facilities, where a tremendous amount of data have to be handled. In this study, we investigated and benchmarked various distributed file systems, such as Ceph, GlusterFS, Lustre and EOS for data-intensive environments. In our experiment, we configured the distributed file systems under a Reliable Array of Independent Nodes (RAIN) structure and a Filesystem in Userspace (FUSE) environment. Our results identify the characteristics of each file system that affect the read and write performance depending on the features of data, which have to be considered in data-intensive computing environments.


2013 ◽  
Vol 3 (1) ◽  
pp. 13-26 ◽  
Author(s):  
Sanjay P. Ahuja ◽  
Sindhu Mani

High Performance Computing (HPC) applications are scientific applications that require significant CPU capabilities. They are also data-intensive applications requiring large data storage. While many researchers have examined the performance of Amazon’s EC2 platform across some HPC benchmarks, an extensive study and their comparison between Amazon’s EC2 and Microsoft’s Windows Azure is largely missing with metrics such as memory bandwidth, I/O performance, and communication and computational performance. The purpose of this paper is to implement existing benchmarks to evaluate and analyze these metrics for EC2 and Windows Azure that span both Infrastructure-as-a-Service and Platform-as-a-Service types. This was accomplished by running MPI versions of STREAM, Interleaved or Random (IOR) and NAS Parallel (NPB) benchmarks on small and medium instance types. In addition a new EC2 medium instance type (m1.medium) was also included in the analysis. These benchmarks measure the memory bandwidth, I/O performance, communication and computational performance.


Sign in / Sign up

Export Citation Format

Share Document