Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing

Scientific applications such as those in astronomy, earthquake science, gravitational-wave physics, and others have embraced workflow technologies to do large-scale science. Workflows enable researchers to collaboratively design, manage, and obtain results that involve hundreds of thousands of steps, access terabytes of data, and generate similar amounts of intermediate and final data products. Although workflow systems are able to facilitate the automated generation of data products, many issues still remain to be addressed. These issues exist in different forms in the workflow lifecycle. This chapter describes a workflow lifecycle as consisting of a workflow generation phase where the analysis is defined, the workflow planning phase where resources needed for execution are selected, the workflow execution part, where the actual computations take place, and the result, metadata, and provenance storing phase. The authors discuss the issues related to data management at each step of the workflow cycle. They describe challenge problems and illustrate them in the context of real-life applications. They discuss the challenges, possible solutions, and open issues faced when mapping and executing large-scale workflows on current cyberinfrastructure. They particularly emphasize the issues related to the management of data throughout the workflow lifecycle.

Download Full-text

Visualization of Large-Scale Distributed Data

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch011 ◽

2012 ◽

pp. 242-274

Author(s):

Jason Leigh ◽

Andrew Johnson ◽

Luc Renambot ◽

Venkatram Vishwanath ◽

Tom Peterka ◽

...

Keyword(s):

Distributed Computing ◽

Data Visualization ◽

High Speed ◽

Large Scale ◽

Distributed Data ◽

Large Scale Data ◽

Distribution Of Resources ◽

Integrated Facility ◽

Effective Visualization ◽

Scale Data

An effective visualization is best achieved through the creation of a proper representation of data and the interactive manipulation and querying of the visualization. Large-scale data visualization is particularly challenging because the size of the data is several orders of magnitude larger than what can be managed on an average desktop computer. Large-scale data visualization therefore requires the use of distributed computing. By leveraging the widespread expansion of the Internet and other national and international high-speed network infrastructure such as the National LambdaRail, Internet-2, and the Global Lambda Integrated Facility, data and service providers began to migrate toward a model of widespread distribution of resources. This chapter introduces different instantiations of the visualization pipeline and the historic motivation for their creation. The authors examine individual components of the pipeline in detail to understand the technical challenges that must be solved in order to ensure continued scalability. They discuss distributed data management issues that are specifically relevant to large-scale visualization. They also introduce key data rendering techniques and explain through case studies approaches for scaling them by leveraging distributed computing. Lastly they describe advanced display technologies that are now considered the “lenses” for examining large-scale data.

Download Full-text

Data-Aware Distributed Computing

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch001 ◽

2012 ◽

pp. 1-27 ◽

Cited By ~ 1

Author(s):

Esma Yildirim ◽

Mehmet Balman ◽

Tevfik Kosar

Keyword(s):

Distributed Computing ◽

Workflow Management ◽

Data Access ◽

New Paradigm ◽

Distributed Computing Systems ◽

Computing Systems ◽

Data Movement ◽

End To End ◽

Commercial Applications ◽

Main Components

With the continuous increase in the data requirements of scientific and commercial applications, access to remote and distributed data has become a major bottleneck for end-to-end application performance. Traditional distributed computing systems closely couple data access and computation, and generally, data access is considered a side effect of computation. The limitations of traditional distributed computing systems and CPU-oriented scheduling and workflow management tools in managing complex data handling have motivated a newly emerging era: data-aware distributed computing. In this chapter, the authors elaborate on how the most crucial distributed computing components, such as scheduling, workflow management, and end-to-end throughput optimization, can become “data-aware.” In this new computing paradigm, called data-aware distributed computing, data placement activities are represented as full-featured jobs in the end-to-end workflow, and they are queued, managed, scheduled, and optimized via a specialized data-aware scheduler. As part of this new paradigm, the authors present a set of tools for mitigating the data bottleneck in distributed computing systems, which consists of three main components: a data-aware scheduler, which provides capabilities such as planning, scheduling, resource reservation, job execution, and error recovery for data movement tasks; integration of these capabilities to the other layers in distributed computing, such as workflow planning; and further optimization of data movement tasks via dynamic tuning of underlying protocol transfer parameters.

Download Full-text

Towards Data Intensive Many-Task Computing

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch002 ◽

2012 ◽

pp. 28-73 ◽

Cited By ~ 8

Author(s):

Ioan Raicu ◽

Ian Foster ◽

Yong Zhao ◽

Alex Szalay ◽

Philip Little ◽

...

Keyword(s):

High Performance ◽

File Systems ◽

Data Locality ◽

Resource Provisioning ◽

Parallel File Systems ◽

Data Intensive ◽

Dynamic Resource Provisioning ◽

Rate Of Increase ◽

Parallel File ◽

Data Intensive Applications

Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Traditional techniques to support many-task computing commonly found in scientific computing (i.e. the reliance on parallel file systems with static configurations) do not scale to today’s largest systems for data intensive application, as the rate of increase in the number of processors per system is outgrowing the rate of performance increase of parallel file systems. In this chapter, the authors argue that in such circumstances, data locality is critical to the successful and efficient use of large distributed systems for data-intensive applications. They propose a “data diffusion” approach to enable data-intensive many-task computing. They define an abstract model for data diffusion, define and implement scheduling policies with heuristics that optimize real world performance, and develop a competitive online caching eviction policy. They also offer many empirical experiments to explore the benefits of data diffusion, both under static and dynamic resource provisioning, demonstrating approaches that improve both performance and scalability.

Download Full-text

Replica Management in Data Intensive Distributed Science Applications

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch009 ◽

2012 ◽

pp. 188-205

Author(s):

Ann Chervenak ◽

Robert Schuler

Keyword(s):

Management Strategies ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Management Problem ◽

Data Intensive ◽

Replica Management ◽

Geographically Distributed ◽

Management Schemes ◽

Scientific Collaborations

Management of the large data sets produced by data-intensive scientific applications is complicated by the fact that participating institutions are often geographically distributed and separated by distinct administrative domains. A key data management problem in these distributed collaborations has been the creation and maintenance of replicated data sets. This chapter provides an overview of replica management schemes used in large, data-intensive, distributed scientific collaborations. Early replica management strategies focused on the development of robust, highly scalable catalogs for maintaining replica locations. In recent years, more sophisticated, application-specific replica management systems have been developed to support the requirements of scientific Virtual Organizations. These systems have motivated interest in application-independent, policy-driven schemes for replica management that can be tailored to meet the performance and reliability requirements of a range of scientific collaborations. The authors discuss the data replication solutions to meet the challenges associated with increasingly large data sets and the requirement to run data analysis at geographically distributed sites.

Download Full-text

Micro-Services

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch003 ◽

2012 ◽

pp. 74-93 ◽

Cited By ~ 3

Author(s):

Arcot Rajasekar ◽

Mike Wan ◽

Reagan Moore ◽

Wayne Schroeder

Keyword(s):

Data Management ◽

Large Scale ◽

Service Oriented Architecture ◽

Distributed Data ◽

Service Oriented Architectures ◽

Loosely Coupled ◽

Server Side ◽

Data Object ◽

Service Oriented ◽

Functional Software

Service-oriented architectures (SOA) enable orchestration of loosely-coupled and interoperable functional software units to develop and execute complex but agile applications. Data management on a distributed data grid can be viewed as a set of operations that are performed across all stages in the life-cycle of a data object. The set of such operations depends on the type of objects, based on their physical and discipline-centric characteristics. In this chapter, the authors define server-side functions, called micro-services, which are orchestrated into conditional workflows for achieving large-scale data management specific to collections of data. Micro-services communicate with each other using parameter exchange, in memory data structures, a database-based persistent information store, and a network messaging system that uses a serialization protocol for communicating with remote micro-services. The orchestration of the workflow is done by a distributed rule engine that chains and executes the workflows and maintains transactional properties through recovery micro-services. They discuss the micro-service oriented architecture, compare the micro-service approach with traditional SOA, and describe the use of micro-services for implementing policy-based data management systems.

Download Full-text

On-Demand Visualization on Scalable Shared Infrastructure

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch012 ◽

2012 ◽

pp. 275-290

Author(s):

Huadong Liu ◽

Jinzhu Gao ◽

Jian Huang ◽

Micah Beck ◽

Terry Moore

Keyword(s):

Heterogeneous Systems ◽

Point Of View ◽

System Level ◽

Heterogeneous Processors ◽

On Demand ◽

Use Of Resources ◽

Parallel Visualization ◽

Time Critical ◽

Shared Infrastructure ◽

Distributed Cache

The emergence of high-resolution simulation, where simulation outputs have grown to terascale levels and beyond, raises major new challenges for the visualization community, which is serving computational scientists who want adequate visualization services provided to them on-demand. Many existing algorithms for parallel visualization were not designed to operate optimally on time-shared parallel systems or on heterogeneous systems. They are usually optimized for systems that are homogeneous and have been reserved for exclusive use. This chapter explores the possibility of developing parallel visualization algorithms that can use distributed, heterogeneous processors to visualize cutting edge simulation datasets. The authors study how to effectively support multiple concurrent users operating on the same large dataset, with each focusing on a dynamically varying subset of the data. From a system design point of view, they observe that a distributed cache offers various advantages, including improved scalability. They develop basic scheduling mechanisms that were able to achieve fault-tolerance and load-balancing, optimal use of resources, and flow-control using system-level back-off, while still enforcing deadline driven (i.e. time-critical) visualization.

Download Full-text

Data Intensive Computing for Bioinformatics

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch010 ◽

2012 ◽

pp. 207-241 ◽

Cited By ~ 1

Author(s):

Judy Qiu ◽

Jaliya Ekanayake ◽

Thilina Gunarathne ◽

Jong Youl Choi ◽

Seung-Hee Bae ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Programming Model ◽

Life Sciences ◽

Programming Models ◽

Data Sets ◽

Data Intensive Computing ◽

Data Intensive ◽

Data Mining Algorithms ◽

Model Combining

Data intensive computing, cloud computing, and multicore computing are converging as frontiers to address massive data problems with hybrid programming models and/or runtimes including MapReduce, MPI, and parallel threading on multicore platforms. A major challenge is to utilize these technologies and large-scale computing resources effectively to advance fundamental science discoveries such as those in Life Sciences. The recently developed next-generation sequencers have enabled large-scale genome sequencing in areas such as environmental sample sequencing leading to metagenomic studies of collections of genes. Metagenomic research is just one of the areas that present a significant computational challenge because of the amount and complexity of data to be processed. This chapter discusses the use of innovative data-mining algorithms and new programming models for several Life Sciences applications. The authors particularly focus on methods that are applicable to large data sets coming from high throughput devices of steadily increasing power. They show results for both clustering and dimension reduction algorithms, and the use of MapReduce on modest size problems. They identify two key areas where further research is essential, and propose to develop new O(NlogN) complexity algorithms suitable for the analysis of millions of sequences. They suggest Iterative MapReduce as a promising programming model combining the best features of MapReduce with those of high performance environments such as MPI.

Download Full-text

A Survey of Scheduling and Management Techniques for Data-Intensive Application Workflows

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch007 ◽

2012 ◽

pp. 156-176 ◽

Cited By ~ 1

Author(s):

Suraj Pandey ◽

Rajkumar Buyya

Keyword(s):

Performance Optimization ◽

Workflow Management ◽

Huge Amount ◽

Distributed Resources ◽

Grid Systems ◽

Data Intensive ◽

Network Bandwidth ◽

Comprehensive Survey ◽

Management Techniques ◽

Data Intensive Application

This chapter presents a comprehensive survey of algorithms, techniques, and frameworks used for scheduling and management of data-intensive application workflows. Many complex scientific experiments are expressed in the form of workflows for structured, repeatable, controlled, scalable, and automated executions. This chapter focuses on the type of workflows that have tasks processing huge amount of data, usually in the range from hundreds of mega-bytes to petabytes. Scientists are already using Grid systems that schedule these workflows onto globally distributed resources for optimizing various objectives: minimize total makespan of the workflow, minimize cost and usage of network bandwidth, minimize cost of computation and storage, meet the deadline of the application, and so forth. This chapter lists and describes techniques used in each of these systems for processing huge amount of data. A survey of workflow management techniques is useful for understanding the working of the Grid systems providing insights on performance optimization of scientific applications dealing with data-intensive workloads.

Download Full-text

Metadata Management in PetaShare Distributed Storage Network

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch005 ◽

2012 ◽

pp. 118-139

Author(s):

Ismail Akturk ◽

Xinqi Wang ◽

Tevfik Kosar

Keyword(s):

Data Storage ◽

High Performance ◽

Distributed Storage ◽

Storage System ◽

High Capacity ◽

Distributed Data ◽

Data Handling ◽

Distributed Data Storage ◽

Efficient Data ◽

High Level

The unbounded increase in the size of data generated by scientific applications necessitates collaboration and sharing among the nation’s education and research institutions. Simply purchasing high-capacity, high-performance storage systems and adding them to the existing infrastructure of the collaborating institutions does not solve the underlying and highly challenging data handling problem. Scientists are compelled to spend a great deal of time and energy on solving basic data-handling issues, such as the physical location of data, how to access it, and/or how to move it to visualization and/or compute resources for further analysis. This chapter presents the design and implementation of a reliable and efficient distributed data storage system, PetaShare, which spans multiple institutions across the state of Louisiana. At the back-end, PetaShare provides a unified name space and efficient data movement across geographically distributed storage sites. At the front-end, it provides light-weight clients the enable easy, transparent, and scalable access. In PetaShare, the authors have designed and implemented an asynchronously replicated multi-master metadata system for enhanced reliability and availability. The authors also present a high level cross-domain metadata schema to provide a structured systematic view of multiple science domains supported by PetaShare.

Download Full-text

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Data Management in Scientific Workflows

Visualization of Large-Scale Distributed Data

Data-Aware Distributed Computing

Towards Data Intensive Many-Task Computing

Replica Management in Data Intensive Distributed Science Applications

Micro-Services

On-Demand Visualization on Scalable Shared Infrastructure

Data Intensive Computing for Bioinformatics

A Survey of Scheduling and Management Techniques for Data-Intensive Application Workflows

Metadata Management in PetaShare Distributed Storage Network

Export Citation Format

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed ComputingLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Data Management in Scientific Workflows

Visualization of Large-Scale Distributed Data

Data-Aware Distributed Computing

Towards Data Intensive Many-Task Computing

Replica Management in Data Intensive Distributed Science Applications

Micro-Services

On-Demand Visualization on Scalable Shared Infrastructure

Data Intensive Computing for Bioinformatics

A Survey of Scheduling and Management Techniques for Data-Intensive Application Workflows

Metadata Management in PetaShare Distributed Storage Network

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing
Latest Publications