Workload Performance Characterization and Test Strategy of High-Performance Fault-Tolerant Computers Based on BIBbench

AbstractReliability is set to become a major concern on emergent large-scale architectures. While there are many parallel languages, and indeed many parallel functional languages, very few address reliability. The notable exception is the widely emulated Erlang distributed actor model that provides explicit supervision and recovery of actors with isolated state. We investigate scalable transparent fault tolerant functional computation with automatic supervision and recovery of tasks. We do so by developing HdpH-RS, a variant of the Haskell distributed parallel Haskell (HdpH) DSL with Reliable Scheduling. Extending the distributed work stealing protocol of HdpH for task supervision and recovery is challenging. To eliminate elusive concurrency bugs, we validate the HdpH-RS work stealing protocol using the SPIN model checker. HdpH-RS differs from the actor model in that its principal entities are tasks, i.e. independent stateless computations, rather than isolated stateful actors. Thanks to statelessness, fault recovery can be performed automatically and entirely hidden in the HdpH-RS runtime system. Statelessness is also key for proving a crucial property of the semantics of HdpH-RS: fault recovery does not change the result of the program, akin to deterministic parallelism. HdpH-RS provides a simple distributed fork/join-style programming model, with minimal exposure of fault tolerance at the language level, and a library of higher level abstractions such as algorithmic skeletons. In fact, the HdpH-RS DSL is exactly the same as the HdpH DSL, hence users can opt in or out of fault tolerant execution without any refactoring. Computations in HdpH-RS are always as reliable as the root node, no matter how many nodes and cores are actually used. We benchmark HdpH-RS on conventional clusters and an High Performance Computing platform: all benchmarks survive Chaos Monkey random fault injection; the system scales well e.g. up to 1,400 cores on the High Performance Computing; reliability and recovery overheads are consistently low even at scale.

Download Full-text

Exploiting P2P and Grid Computing Technologies for Resource Sharing to Support High Performance Distributed System

Handbook of Research on P2P and Grid Systems for Service-Oriented Computing ◽

10.4018/978-1-61520-686-5.ch019 ◽

2010 ◽

pp. 450-475

Author(s):

Liangxiu Han

Keyword(s):

Resource Sharing ◽

High Performance ◽

Large Scale ◽

Fault Tolerant ◽

Resource Discovery ◽

Distributed Computing Systems ◽

Computing Systems ◽

Efficient Resource ◽

Service Oriented ◽

Important Design

This chapter identifies challenges and requirements for resource sharing to support high performance distributed Service-Oriented Computing (SOC) systems. The chapter draws attention to two popular and important design paradigms: Grid and Peer-to-Peer (P2P) computing systems, which are evolving as two practical solutions to supporting wide-area resource sharing over the Internet. As a fundamental task of resource sharing, the efficient resource discovery is playing an important role in the context of the SOC setting. The chapter presents the resource discovery in Grid and P2P environments through an overview of related systems, both historical and emerging. The chapter then discusses the exploitation of both technologies for facilitating the resource discovery within large-scale distributed computing systems in a flexible, scalable, fault-tolerant, interoperable and security fashion.

Download Full-text

Design of fault-tolerant large-scale VOD servers: With emphasis on high-performance and low-cost

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/71.920587 ◽

2001 ◽

Vol 12 (4) ◽

pp. 363-386 ◽

Cited By ~ 19

Author(s):

L. Golubchik ◽

R.R. Muntz ◽

Cheng-Fu Chou ◽

S. Berson

Keyword(s):

High Performance ◽

Large Scale ◽

Fault Tolerant ◽

Low Cost

Download Full-text

A Fault Tolerant Decentralized Scheduling in Large Scale Distributed Systems

Handbook of Research on P2P and Grid Systems for Service-Oriented Computing ◽

10.4018/978-1-61520-686-5.ch024 ◽

2010 ◽

pp. 566-588 ◽

Cited By ~ 2

Author(s):

Florin Pop

Keyword(s):

Distributed Systems ◽

High Performance ◽

Large Scale ◽

Fault Tolerant ◽

Optimal Algorithm ◽

Distributed Applications ◽

Distributed Scheduling ◽

Agent Based ◽

Decentralized Scheduling ◽

Optimization Schemes

This chapter presents a fault tolerant framework for the applications scheduling in large scale distributed systems (LSDS). Due to the specific characteristics and requirements of distributed systems, a good scheduling model should be dynamic. More specifically, it should adapt the scheduling decisions to resource state changes, which are commonly captured through monitoring. The scheduler and the monitor are two important middleware pieces that correlate their actions to ensure the high performance execution of distributed applications. The chapter presents and analyses agent based architecture for scheduling in large scale distributed systems. Then the user and resources management are presented. Optimization schemes for scheduling consider the near-optimal algorithm for distributed scheduling. The chapter presents the solution for scheduling optimization. The chapter covers and explains the fault tolerance cases for Grid environments and describes two possible scenarios for scheduling system.

Download Full-text

Transparent Throughput Elasticity for Modern Cloud Storage

Applying Integration Techniques and Methods in Distributed Systems and Technologies - Advances in Computer and Electrical Engineering ◽

10.4018/978-1-5225-8295-3.ch007 ◽

2019 ◽

pp. 156-191

Author(s):

Bogdan Nicolae ◽

Pierre Riteau ◽

Zhuo Zhen ◽

Kate Keahey

Keyword(s):

Cloud Storage ◽

High Performance ◽

Large Scale ◽

Real Life ◽

Past Experience ◽

Performance Characteristics ◽

Data Intensive ◽

Complex Landscape ◽

Application Data ◽

Caching Mechanism

Storage elasticity on the cloud is a crucial feature in the age of data-intensive computing, especially when considering fluctuations of I/O throughput. In this chapter, the authors explore how to transparently boost the I/O bandwidth during peak utilization to deliver high performance without over-provisioning storage resources. The proposal relies on the idea of leveraging short-lived virtual disks of better performance characteristics (and more expensive) to act during peaks as a caching layer for the persistent virtual disks where the application data is stored during runtime. They show how this idea can be achieved efficiently at the block-device level, using a caching mechanism that leverages iterative behavior and learns from past experience. Second, they introduce a corresponding performance and cost prediction methodology. They demonstrate the benefits of our proposal both for micro-benchmarks and for two real-life applications using large-scale experiments. They conclude with a discussion on how these techniques can be generalized for increasingly complex landscape of modern cloud storage.

Download Full-text

Performance modeling and benchmarking of bank intermediary business on high-performance fault-tolerant computers

2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W) ◽

10.1109/dsnw.2011.5958847 ◽

2011 ◽

Author(s):

Bo Li ◽

Haiying Zhou ◽

Decheng Zuo ◽

Zhan Zhang ◽

Peng Zhou ◽

...

Keyword(s):

High Performance ◽

Performance Modeling ◽

Fault Tolerant ◽

Intermediary Business

Download Full-text

Exploiting P2P and Grid Computing Technologies for Resource Sharing to Support High Performance Distributed System

Grid and Cloud Computing ◽

10.4018/978-1-4666-0879-5.ch602 ◽

2012 ◽

pp. 1289-1314

Author(s):

Liangxiu Han

Keyword(s):

Resource Sharing ◽

High Performance ◽

Large Scale ◽

Fault Tolerant ◽

Resource Discovery ◽

Distributed Computing Systems ◽

Computing Systems ◽

Efficient Resource ◽

Service Oriented ◽

Important Design

This chapter identifies challenges and requirements for resource sharing to support high performance distributed Service-Oriented Computing (SOC) systems. The chapter draws attention to two popular and important design paradigms: Grid and Peer-to-Peer (P2P) computing systems, which are evolving as two practical solutions to supporting wide-area resource sharing over the Internet. As a fundamental task of resource sharing, the efficient resource discovery is playing an important role in the context of the SOC setting. The chapter presents the resource discovery in Grid and P2P environments through an overview of related systems, both historical and emerging. The chapter then discusses the exploitation of both technologies for facilitating the resource discovery within large-scale distributed computing systems in a flexible, scalable, fault-tolerant, interoperable and security fashion.

Download Full-text

Study for Performance Benchmark of Bank Intermediary Business on High-Performance Fault-Tolerant Computers

International Symposium on Parallel and Distributed Processing with Applications ◽

10.1109/ispa.2010.23 ◽

2010 ◽

Cited By ~ 1

Author(s):

Bo Li ◽

Haiying Zhou ◽

Decheng Zuo ◽

Zhan Zhang

Keyword(s):

High Performance ◽

Fault Tolerant ◽

Intermediary Business ◽

Performance Benchmark

Download Full-text

The Structure and Properties of MoSi2 Thin Film in Mos Process

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s1431927600001379 ◽

1980 ◽

Vol 38 ◽

pp. 326-327

Author(s):

C.K. Wu ◽

P. Chang ◽

N. Godinho

Keyword(s):

Thin Film ◽

Integrated Circuits ◽

High Performance ◽

Large Scale ◽

Process Development ◽

Structure And Properties ◽

Metal Silicides ◽

High Oxidation ◽

Important Approach ◽

High Oxidation Resistance

Recently, the use of refractory metal silicides as low resistivity, high temperature and high oxidation resistance gate materials in large scale integrated circuits (LSI) has become an important approach in advanced MOS process development (1). This research is a systematic study on the structure and properties of molybdenum silicide thin film and its applicability to high performance LSI fabrication.

Download Full-text

RECOMMENDATIONS FOR THE CHOICE OF MILKING INSTALLATIONS IN LOOSE HOUSING SYSTEMS OF COWS

Molochnoe i miasnoe skotovodstvo ◽

10.33943/mms.2020.12.24.001 ◽

2020 ◽

Author(s):

В.В. ГОРДЕЕВ ◽

В.Е. ХАЗАНОВ

Keyword(s):

Dairy Cows ◽

High Performance ◽

Large Scale ◽

Dairy Farms ◽

Economic Indicators ◽

Technical Level ◽

Housing Systems ◽

Working Shift ◽

Technical And Economic Indicators

При выборе типа доильной установки и ее размера необходимо учитывать максимальное планируемое поголовье дойных коров и размер технологической группы, кратность и время одного доения, продолжительность рабочей смены дояров. Анализ технико-экономических показателей наиболее распространенных на сегодняшний день типов доильных установок одинакового технического уровня свидетельствует, что наилучшие удельные показатели имеет установка типа «Карусель» (1), а установка типа «Елочка» (2) требует более высоких затрат труда и средств. Установка «Параллель» (3) занимает промежуточное положение. Из анализа пропускной способности и количества необходимых операторов: установка 2 рекомендована для ферм с поголовьем дойного стада до 600 голов, 3 — не более 1200 дойных коров, 1 — более 1200 дойных коров. «Карусель» — наиболее рациональный, высокопроизводительный, легко автоматизируемый и, следовательно, перспективный способ доения в залах, особенно для крупных молочных ферм. The choice of the proper type and size of milking installations needs to take into account the maximum planned number of dairy cows, the size of a technological group, the number of milkings per day, and the duration of one milking and the operator's working shift. The analysis of technical and economic indicators of currently most common types of milking machines of the same technical level revealed that the Carousel installation had the best specific indicators while the Herringbone installation featured higher labour inputs and cash costs. The Parallel installation was found somewhere in between. In terms of the throughput and the required number of operators Herringbone is recommended for farms with up to 600 dairy cows, Parallel — below 1200 dairy cows, Carousel — above 1200 dairy cows. Carousel was found the most practical, high-performance, easily automated and, therefore, promising milking system for milking parlours, especially on the large-scale dairy farms.

Download Full-text