An Overview of Thermal and Mechanical Design, Control, and Testing of the World's Most Powerful and Fastest Supercomputer

Abstract A new era of computing has begun with the development of high-performance computing (HPC), artificial intelligence (AI), machine learning (ML), and cognitive systems. Dramatic increases in the power density of the electronic components have led to the design and architecture of efficient thermal management technologies on these systems. IBM designed and delivered in 2018 the most powerful and fastest supercomputers of the world known as Summit and Sierra having 200 petaflops peak computing performance through LINPACK benchmarks. These systems which are called as IBM POWER AC922 are both air and liquid cooled, where water is employed in liquid-cooled systems to cool the high-power electronic components including IBM POWER9 processors and NVIDIA graphics processing units (GPUs). In this paper, we highlight the overview of the thermal and mechanical design strategies applied to these systems. Testing and experimental analysis with comparison to computational modeling is provided. Thermal control strategies are investigated for the optimization of overall system efficiency. In air cooled systems, we discuss the fan and heat sink designs, as well as the preheating effect on the PCIe section. In liquid-cooled systems, which have a unique cold plate design cooling the processors and the GPUs with water, we examine the water flow path design for the central processing units (CPUs), the GPUs, and the thermal performance of the cold plate. An overview of the cooling assemblies such as TIMs and air baffles in these systems is discussed. Unit and rack manifolds and rear door heat exchanger (RDHx) are investigated. Water flow and pressure distribution at the node and rack-level are provided.

Download Full-text

Thermal and Mechanical Design of the Fastest Supercomputer of the World in Cognitive Systems: IBM POWER AC 922

ASME 2019 International Technical Conference and Exhibition on Packaging and Integration of Electronic and Photonic Microsystems ◽

10.1115/ipack2019-6444 ◽

2019 ◽

Cited By ~ 2

Author(s):

Anil Yuksel ◽

Vic Mahaney ◽

Chris Marroquin ◽

Shurong Tian ◽

Mark Hoffmeyer ◽

...

Keyword(s):

High Performance ◽

Mechanical Design ◽

Cognitive Systems ◽

Design Strategies ◽

Electronic Components ◽

New Era ◽

The World ◽

Computing Performance ◽

Petaflops Computing ◽

Performance Computing

Abstract High performance computing (HPC), artificial intelligence (AI) and cognitive systems have initiated a new era of computing. Efficient thermal management technologies of these systems have been vital due to the increasing power density in the electronic components. In 2018 IBM delivered the fastest supercomputer of the world through Summit with 200 petaflops computing performance with LINPACK benchmarks. The system is both air and water cooled, where water is employed to cool the high power dissipated electronic components which are the IBM POWER9 processors and NVIDIA GPUs. In this paper, we highlight the overview of the thermal and mechanical design strategies applied on these systems. In air cooled systems, we discuss the fan and heat sink designs, as well as the preheating effect on PCI section. Liquid cooled system has a unique coldplate design which cool the processors and the GPUs with water. We examine the water flow path design for the processor and the GPUs by providing the thermal performance of the coldplate. Also, an overview of the cooling assemblies such as TIMs and air baffles in the servers are discussed. Moreover, unit and rack manifolds are investigated; flow and pressure distribution at the node and rack level are provided.

Download Full-text

The VOLNA-OP2 tsunami code (version 1.5)

Geoscientific Model Development ◽

10.5194/gmd-11-4621-2018 ◽

2018 ◽

Vol 11 (11) ◽

pp. 4621-4635 ◽

Cited By ~ 7

Author(s):

Istvan Z. Reguly ◽

Daniel Giles ◽

Devaraj Gopinathan ◽

Laure Quivy ◽

Joakim H. Beck ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Shallow Water Equation ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Central Processing ◽

Domain Specific ◽

Computing Platforms ◽

Graphics Processing ◽

Intel Xeon

Abstract. In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite-volume non-linear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high-performance computing platforms: central processing units (CPUs), the Intel Xeon Phi, and graphics processing units (GPUs). This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years; here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity as well as performance and portability to its users across a number of platforms.

Download Full-text

Controllers: An abstraction to ease the use of hardware accelerators

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017702962 ◽

2017 ◽

Vol 32 (6) ◽

pp. 838-853 ◽

Cited By ~ 4

Author(s):

Ana Moreton–Fernandez ◽

Hector Ortega–Arranz ◽

Arturo Gonzalez–Escribano

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Abstract Entity ◽

Hardware Accelerators ◽

Processing Unit ◽

Central Processing ◽

Computing Platforms ◽

Graphics Processing ◽

Performance Computing ◽

Selection Of

Nowadays the use of hardware accelerators, such as the graphics processing units or XeonPhi coprocessors, is key in solving computationally costly problems that require high performance computing. However, programming solutions for an efficient deployment for these kind of devices is a very complex task that relies on the manual management of memory transfers and configuration parameters. The programmer has to carry out a deep study of the particular data that needs to be computed at each moment, across different computing platforms, also considering architectural details. We introduce the controller concept as an abstract entity that allows the programmer to easily manage the communications and kernel launching details on hardware accelerators in a transparent way. This model also provides the possibility of defining and launching central processing unit kernels in multi-core processors with the same abstraction and methodology used for the accelerators. It internally combines different native programming models and technologies to exploit the potential of each kind of device. Additionally, the model also allows the programmer to simplify the proper selection of values for several configuration parameters that can be selected when a kernel is launched. This is done through a qualitative characterization process of the kernel code to be executed. Finally, we present the implementation of the controller model in a prototype library, together with its application in several case studies. Its use has led to reductions in the development and porting costs, with significantly low overheads in the execution times when compared to manually programmed and optimized solutions which directly use CUDA and OpenMP.

Download Full-text

POM.gpu-v1.0: a GPU-based Princeton Ocean Model

Geoscientific Model Development ◽

10.5194/gmd-8-2815-2015 ◽

2015 ◽

Vol 8 (9) ◽

pp. 2815-2827 ◽

Cited By ~ 13

Author(s):

S. Xu ◽

X. Huang ◽

L.-Y. Oey ◽

F. Xu ◽

H. Fu ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Climate Models ◽

Ocean Model ◽

Compute Unified Device Architecture ◽

Princeton Ocean Model ◽

Central Processing ◽

Device Architecture ◽

Computationally Intensive ◽

Graphics Processing

Abstract. Graphics processing units (GPUs) are an attractive solution in many scientific applications due to their high performance. However, most existing GPU conversions of climate models use GPUs for only a few computationally intensive regions. In the present study, we redesign the mpiPOM (a parallel version of the Princeton Ocean Model) with GPUs. Specifically, we first convert the model from its original Fortran form to a new Compute Unified Device Architecture C (CUDA-C) code, then we optimize the code on each of the GPUs, the communications between the GPUs, and the I / O between the GPUs and the central processing units (CPUs). We show that the performance of the new model on a workstation containing four GPUs is comparable to that on a powerful cluster with 408 standard CPU cores, and it reduces the energy consumption by a factor of 6.8.

Download Full-text

High-performance computing in water resources hydrodynamics

Journal of Hydroinformatics ◽

10.2166/hydro.2020.163 ◽

2020 ◽

Vol 22 (5) ◽

pp. 1217-1235 ◽

Cited By ~ 3

Author(s):

M. Morales-Hernández ◽

M. B. Sharif ◽

S. Gangrade ◽

T. T. Dullo ◽

S.-C. Kao ◽

...

Keyword(s):

Water Resources ◽

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Test Case ◽

Processing Unit ◽

Central Processing ◽

Graphics Processing ◽

Performance Computing

Abstract This work presents a vision of future water resources hydrodynamics codes that can fully utilize the strengths of modern high-performance computing (HPC). The advances to computing power, formerly driven by the improvement of central processing unit processors, now focus on parallel computing and, in particular, the use of graphics processing units (GPUs). However, this shift to a parallel framework requires refactoring the code to make efficient use of the data as well as changing even the nature of the algorithm that solves the system of equations. These concepts along with other features such as the precision for the computations, dry regions management, and input/output data are analyzed in this paper. A 2D multi-GPU flood code applied to a large-scale test case is used to corroborate our statements and ascertain the new challenges for the next-generation parallel water resources codes.

Download Full-text

Parallelizing Multiple Flow Accumulation Algorithm using CUDA and OpenACC

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi8090386 ◽

2019 ◽

Vol 8 (9) ◽

pp. 386 ◽

Cited By ~ 1

Author(s):

Natalija Stojanovic ◽

Dragan Stojanovic

Keyword(s):

Energy Consumption ◽

Graphics Processing Units ◽

High Performance ◽

Programming Model ◽

Flow Direction ◽

Computing Methods ◽

Terrain Analysis ◽

Central Processing ◽

Watershed Analysis ◽

Programming Effort

Watershed analysis, as a fundamental component of digital terrain analysis, is based on the Digital Elevation Model (DEM), which is a grid (raster) model of the Earth surface and topography. Watershed analysis consists of computationally and data intensive computing algorithms that need to be implemented by leveraging parallel and high-performance computing methods and techniques. In this paper, the Multiple Flow Direction (MFD) algorithm for watershed analysis is implemented and evaluated on multi-core Central Processing Units (CPU) and many-core Graphics Processing Units (GPU), which provides significant improvements in performance and energy usage. The implementation is based on NVIDIA CUDA (Compute Unified Device Architecture) implementation for GPU, as well as on OpenACC (Open ACCelerators), a parallel programming model, and a standard for parallel computing. Both phases of the MFD algorithm (i) iterative DEM preprocessing and (ii) iterative MFD algorithm, are parallelized and run over multi-core CPU and GPU. The evaluation of the proposed solutions is performed with respect to the execution time, energy consumption, and programming effort for algorithm parallelization for different sizes of input data. An experimental evaluation has shown not only the advantage of using OpenACC programming over CUDA programming in implementing the watershed analysis on a GPU in terms of performance, energy consumption, and programming effort, but also significant benefits in implementing it on the multi-core CPU.

Download Full-text

Reconfigurable Hardware Generation of Multigrid Solvers with Conjugate Gradient Coarse-Grid Solution

Parallel Processing Letters ◽

10.1142/s0129626418500160 ◽

2018 ◽

Vol 28 (04) ◽

pp. 1850016 ◽

Cited By ~ 2

Author(s):

Christian Schmitt ◽

Moritz Schmid ◽

Sebastian Kuckuk ◽

Harald Köstler ◽

Jürgen Teich ◽

...

Keyword(s):

Conjugate Gradient ◽

Graphics Processing Units ◽

High Performance ◽

Central Processing ◽

Domain Specific ◽

Multigrid Solvers ◽

Field Programmable ◽

Grid Solution ◽

Tool Set ◽

Numerical Solver

Not only in the field of high-performance computing (HPC), field programmable gate arrays (FPGAs) are a soaringly popular accelerator technology. However, they use a completely different programming paradigm and tool set compared to central processing units (CPUs) or even graphics processing units (GPUs), adding extra development steps and requiring special knowledge, hindering widespread use in scientific computing. To bridge this programmability gap, domain-specific languages (DSLs) are a popular choice to generate low-level implementations from an abstract algorithm description. In this work, we demonstrate our approach for the generation of numerical solver implementations based on the multigrid method for FPGAs from the same code base that is also used to generate code for CPUs using a hybrid parallelization of MPI and OpenMP. Our approach yields in a hardware design that can compute up to 11 V-cycles per second with an input grid size of 4096[Formula: see text]4096 and solution on the coarsest using the conjugate gradient (CG) method on a mid-range FPGA, beating vectorized, multi-threaded execution on an Intel Xeon processor.

Download Full-text

Accelerating a Geometrical Approximated PCA Algorithm Using AVX2 and CUDA

Remote Sensing ◽

10.3390/rs12121918 ◽

2020 ◽

Vol 12 (12) ◽

pp. 1918 ◽

Cited By ~ 1

Author(s):

Alina L. Machidon ◽

Octavian M. Machidon ◽

Cătălin B. Ciobanu ◽

Petre L. Ogrutan

Keyword(s):

Energy Consumption ◽

Dimensionality Reduction ◽

Graphics Processing Units ◽

High Performance ◽

Hyperspectral Image ◽

Projection Pursuit ◽

Remote Sensing Data ◽

Principal Component ◽

Large Datasets ◽

Central Processing

Remote sensing data has known an explosive growth in the past decade. This has led to the need for efficient dimensionality reduction techniques, mathematical procedures that transform the high-dimensional data into a meaningful, reduced representation. Projection Pursuit (PP) based algorithms were shown to be efficient solutions for performing dimensionality reduction on large datasets by searching low-dimensional projections of the data where meaningful structures are exposed. However, PP faces computational difficulties in dealing with very large datasets—which are common in hyperspectral imaging, thus raising the challenge for implementing such algorithms using the latest High Performance Computing approaches. In this paper, a PP-based geometrical approximated Principal Component Analysis algorithm (gaPCA) for hyperspectral image analysis is implemented and assessed on multi-core Central Processing Units (CPUs), Graphics Processing Units (GPUs) and multi-core CPUs using Single Instruction, Multiple Data (SIMD) AVX2 (Advanced Vector eXtensions) intrinsics, which provide significant improvements in performance and energy usage over the single-core implementation. Thus, this paper presents a cross-platform and cross-language perspective, having several implementations of the gaPCA algorithm in Matlab, Python, C++ and GPU implementations based on NVIDIA Compute Unified Device Architecture (CUDA). The evaluation of the proposed solutions is performed with respect to the execution time and energy consumption. The experimental evaluation has shown not only the advantage of using CUDA programming in implementing the gaPCA algorithm on a GPU in terms of performance and energy consumption, but also significant benefits in implementing it on the multi-core CPU using AVX2 intrinsics.

Download Full-text

Multi-Core Processing Cloud Eclat Growth

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f8633.088619 ◽

2019 ◽

Vol 8 (6) ◽

pp. 4063-4072

Keyword(s):

Data Mining ◽

Graphics Processing Units ◽

High Performance ◽

Virtual Machines ◽

Cloud Services ◽

Standard Data ◽

Central Processing ◽

Data Mining Algorithms ◽

Graphics Processing ◽

Mining Algorithms

Data mining is a lively process used in many leading technologies of this information era. Eclat growth is one of the best performance data mining algorithms. This work is indented to create a suave interface for Eclat growth algorithm to run in multi-core processor-based cloud computing environments. Recent improvements in processor manufacturing technology make it possible to create multi-core high performance Central Processing Units (CPUs) and Graphics Processing Units (GPUs). Many cloud services are already providing accessibility to these high-power processor virtual machines. The process of blending these technologies with Eclat Growth is proposed here in the name of “Multi-core Processing Cloud Eclat Growth” (MPCEG) to achieve higher processing speeds without compromising the standard data mining metrics such as Accuracy, Precision, Recall and F1-Score. New procedures for Cloud Parallel Processing, GPU Utilization, Annihilation of floating point arithmetic errors by fixed point replacement in GPUs and Hierarchical offloading aggregation are introduced in the construction process of proposed MPCEG

Download Full-text

Stretchable Silicon Electronics and Their Integration with Rubber, Plastic, Paper, Vinyl, Leather and Fabric Substrates

MRS Proceedings ◽

10.1557/proc-1196-c01-03 ◽

2009 ◽

Vol 1196 ◽

Author(s):

Dae-Hyeong Kim ◽

Yun-Soung Kim ◽

Zhuangjian Liu ◽

Jizhou Song ◽

Hoon-Sik Kim ◽

...

Keyword(s):

Integrated Circuits ◽

High Performance ◽

Crystalline Silicon ◽

Mechanical Design ◽

Rubber Band ◽

Oxide Semiconductor ◽

Stretchable Electronics ◽

Design Strategies ◽

Electrical Measurements ◽

Large Area

AbstractElectronic systems that offer elastic mechanical responses to high strain deformations are of growing interest, due to their ability to enable new electrical, optical and biomedical devices and other applications whose requirements are impossible to satisfy with conventional wafer-based technologies or even with those that offer simple bendability. This talk describes materials and mechanical design strategies for classes of electronic circuits that offer extremely high flexibility and stretchability over large area, enabling them to accommodate even demanding deformation modes, such as twisting and linear stretching to ‘rubber-band’ levels of strain over 100%. The use of printed single crystalline silicon nanomaterials for the semiconductor provides performance in flexible and stretchable complementary metal-oxide-semiconductor (CMOS) integrated circuits approaching that of conventional devices with comparable feature sizes formed on silicon wafers. Comprehensive theoretical studies of the mechanics reveal the way in which the structural designs enable these extreme mechanical properties without fracturing the intrinsically brittle active materials or even inducing significant changes in their electrical properties. The results, as demonstrated through electrical measurements of arrays of transistors, CMOS inverters, ring oscillators and differential amplifiers, suggest a valuable route to high performance stretchable electronics that can be integrated with nearly arbitrary substrates. We show examples ranging from plastic and rubber, to vinyl, leather and paper, with capability for large area coverage.

Download Full-text