Hierarchical algorithms on hierarchical architectures

A traditional goal of algorithmic optimality, squeezing out flops, has been superseded by evolution in architecture. Flops no longer serve as a reasonable proxy for all aspects of complexity. Instead, algorithms must now squeeze memory, data transfers, and synchronizations, while extra flops on locally cached data represent only small costs in time and energy. Hierarchically low-rank matrices realize a rarely achieved combination of optimal storage complexity and high-computational intensity for a wide class of formally dense linear operators that arise in applications for which exascale computers are being constructed. They may be regarded as algebraic generalizations of the fast multipole method. Methods based on these hierarchical data structures and their simpler cousins, tile low-rank matrices, are well proportioned for early exascale computer architectures, which are provisioned for high processing power relative to memory capacity and memory bandwidth. They are ushering in a renaissance of computational linear algebra. A challenge is that emerging hardware architecture possesses hierarchies of its own that do not generally align with those of the algorithm. We describe modules of a software toolkit, hierarchical computations on manycore architectures, that illustrate these features and are intended as building blocks of applications, such as matrix-free higher-order methods in optimization and large-scale spatial statistics. Some modules of this open-source project have been adopted in the software libraries of major vendors. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

High Performance Computing Design by Code Migration for Distributed Desktop Computing Grids

Applications and Developments in Grid, Cloud, and High Performance Computing ◽

10.4018/978-1-4666-2065-0.ch012 ◽

2013 ◽

pp. 185-203

Author(s):

Makoto Yoshida ◽

Kazumine Kojima

Keyword(s):

High Performance ◽

Large Scale ◽

Workload Management ◽

Transfer Policy ◽

Loosely Coupled ◽

Design Methodologies ◽

Processing Power ◽

Network Delays ◽

Desktop Computing ◽

Code Migration

Large scale loosely coupled PCs can organize clusters and form desktop computing grids on sharing each processing power; power of PCs, transaction distributions, network scales, network delays, and code migration algorithms characterize the performance of the computing grids. This article describes the design methodologies of workload management in distributed desktop computing grids. Based on the code migration experiments, transfer policy for computation was determined and several simulations for location policies were examined, and the design methodologies for distributed desktop computing grids are derived from the simulation results. The language for distributed desktop computing is designed to accomplish the design methodologies.

Download Full-text

Data Storage, Retrieval and Management

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Large-Scale Distributed Computing and Applications ◽

10.4018/978-1-61520-703-9.ch006 ◽

2010 ◽

pp. 111-140

Author(s):

Valentin Cristea ◽

Ciprian Dobre ◽

Corina Stratan ◽

Florin Pop

Keyword(s):

Data Storage ◽

Resource Sharing ◽

High Performance ◽

Large Scale ◽

Workflow Management ◽

Large Data ◽

Data Retrieval ◽

Distributed Data Storage ◽

Processing Power ◽

Data Transfers

The latest advances in network and distributedsystem technologies now allow integration of a vast variety of services with almost unlimited processing power, using large amounts of data. Sharing of resources is often viewed as the key goal for distributed systems, and in this context the sharing of stored data appears as the most important aspect of distributed resource sharing. Scientific applications are the first to take advantage of such environments as the requirements of current and future high performance computing experiments are pressing, in terms of even higher volumes of issued data to be stored and managed. While these new environments reveal huge opportunities for large-scale distributed data storage and management, they also raise important technical challenges, which need to be addressed. The ability to support persistent storage of data on behalf of users, the consistent distribution of up-to-date data, the reliable replication of fast changing datasets or the efficient management of large data transfers are just some of these new challenges. In this chapter we discuss how the existing distributed computing infrastructure is adequate for supporting the required data storage and management functionalities. We highlight the issues raised from storing data over large distributed environments and discuss the recent research efforts dealing with challenges of data retrieval, replication and fast data transfers. Interaction of data management with other data sensitive, emerging technologies as the workflow management is also addressed.

Download Full-text

Computation of Engine Noise Propagation and Scattering off An Aircraft

International Journal of Aeroacoustics ◽

10.1260/147547202765275989 ◽

2002 ◽

Vol 1 (4) ◽

pp. 403-420 ◽

Cited By ~ 18

Author(s):

D. Stanescu ◽

J. Xu ◽

M.Y. Hussaini ◽

F. Farassat

Keyword(s):

Experimental Data ◽

Time Domain ◽

High Performance ◽

Large Scale ◽

Numerical Algorithms ◽

Spectral Element ◽

Spectral Element Methods ◽

Data Set ◽

Noise Field ◽

High Performance Computers

The purpose of this paper is to demonstrate the feasibility of computing the fan inlet noise field around a real twin-engine aircraft, which includes the radiation of the main spinning modes from the engine as well as the reflection and scattering by the fuselage and the wing. This first-cut large-scale computation is based on time domain and frequency domain approaches that employ spectral element methods for spatial discretization. The numerical algorithms are designed to exploit high-performance computers such as the IBM SP4. Although the simulations could not match the exact conditions of the only available experimental data set, they are able to predict the trends of the measured noise field fairly well.

Download Full-text

High Performance Computing Design by Code Migration for Distributed Desktop Computing Grids

International Journal of Grid and High Performance Computing ◽

10.4018/jghpc.2011100105 ◽

2011 ◽

Vol 3 (4) ◽

pp. 53-70

Author(s):

Makoto Yoshida ◽

Kazumine Kojima

Keyword(s):

High Performance ◽

Large Scale ◽

Workload Management ◽

Transfer Policy ◽

Loosely Coupled ◽

Design Methodologies ◽

Processing Power ◽

Network Delays ◽

Desktop Computing ◽

Code Migration

Download Full-text

Reducing the energy consumption of large-scale computing systems through combined shutdown policies with multiple constraints

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017714530 ◽

2017 ◽

Vol 32 (1) ◽

pp. 176-188 ◽

Cited By ~ 6

Author(s):

Anne Benoit ◽

Laurent Lefèvre ◽

Anne-Cécile Orgerie ◽

Issam Raïs

Keyword(s):

Energy Consumption ◽

High Performance ◽

Large Scale ◽

Cooling System ◽

Multiple Constraints ◽

Computing Systems ◽

Electricity Grid ◽

Computing Centers ◽

Power And Energy ◽

Time And Energy

Large-scale distributed systems (high-performance computing centers, networks, data centers) are expected to consume huge amounts of energy. In order to address this issue, shutdown policies constitute an appealing approach able to dynamically adapt the resource set to the actual workload. However, multiple constraints have to be taken into account for such policies to be applied on real infrastructures: the time and energy cost of switching on and off, the power and energy consumption bounds caused by the electricity grid or the cooling system, and the availability of renewable energy. In this article, we propose models translating these various constraints into different shutdown policies that can be combined for a multiconstraint purpose. Our models and their combinations are validated through simulations on a real workload trace.

Download Full-text

High Performance Graph Data Imputation on Multiple GPUs

Future Internet ◽

10.3390/fi13020036 ◽

2021 ◽

Vol 13 (2) ◽

pp. 36

Author(s):

Chao Zhou ◽

Tao Zhang

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Weighted Graph ◽

Low Rank ◽

Data Imputation ◽

Multiple Gpus ◽

Graph Data ◽

Completion Problem ◽

Gpu Implementation

In real applications, massive data with graph structures are often incomplete due to various restrictions. Therefore, graph data imputation algorithms have been widely used in the fields of social networks, sensor networks, and MRI to solve the graph data completion problem. To keep the data relevant, a data structure is represented by a graph-tensor, in which each matrix is the vertex value of a weighted graph. The convolutional imputation algorithm has been proposed to solve the low-rank graph-tensor completion problem that some data matrices are entirely unobserved. However, this data imputation algorithm has limited application scope because it is compute-intensive and low-performance on CPU. In this paper, we propose a scheme to perform the convolutional imputation algorithm with higher time performance on GPUs (Graphics Processing Units) by exploiting multi-core GPUs of CUDA architecture. We propose optimization strategies to achieve coalesced memory access for graph Fourier transform (GFT) computation and improve the utilization of GPU SM resources for singular value decomposition (SVD) computation. Furthermore, we design a scheme to extend the GPU-optimized implementation to multiple GPUs for large-scale computing. Experimental results show that the GPU implementation is both fast and accurate. On synthetic data of varying sizes, the GPU-optimized implementation running on a single Quadro RTX6000 GPU achieves up to 60.50× speedups over the GPU-baseline implementation. The multi-GPU implementation achieves up to 1.81× speedups on two GPUs versus the GPU-optimized implementation on a single GPU. On the ego-Facebook dataset, the GPU-optimized implementation achieves up to 77.88× speedups over the GPU-baseline implementation. Meanwhile, the GPU implementation and the CPU implementation achieve similar, low recovery errors.

Download Full-text

Machine learning and big scientific data

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0054 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190054 ◽

Cited By ~ 8

Author(s):

Tony Hey ◽

Keith Butler ◽

Sam Jackson ◽

Jeyarajan Thiyagalingam

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Language Processing ◽

High Performance ◽

Large Scale ◽

Materials Science ◽

Numerical Algorithms ◽

Scientific Data ◽

Learning Technology ◽

Challenges And Opportunities

This paper reviews some of the challenges posed by the huge growth of experimental data generated by the new generation of large-scale experiments at UK national facilities at the Rutherford Appleton Laboratory (RAL) site at Harwell near Oxford. Such ‘Big Scientific Data’ comes from the Diamond Light Source and Electron Microscopy Facilities, the ISIS Neutron and Muon Facility and the UK's Central Laser Facility. Increasingly, scientists are now required to use advanced machine learning and other AI technologies both to automate parts of the data pipeline and to help find new scientific discoveries in the analysis of their data. For commercially important applications, such as object recognition, natural language processing and automatic translation, deep learning has made dramatic breakthroughs. Google's DeepMind has now used the deep learning technology to develop their AlphaFold tool to make predictions for protein folding. Remarkably, it has been able to achieve some spectacular results for this specific scientific problem. Can deep learning be similarly transformative for other scientific problems? After a brief review of some initial applications of machine learning at the RAL, we focus on challenges and opportunities for AI in advancing materials science. Finally, we discuss the importance of developing some realistic machine learning benchmarks using Big Scientific Data coming from several different scientific domains. We conclude with some initial examples of our ‘scientific machine learning’ benchmark suite and of the research challenges these benchmarks will enable. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

Programming the Adapteva Epiphany 64-core network-on-chip coprocessor

The International Journal of High Performance Computing Applications ◽

10.1177/1094342015599238 ◽

2015 ◽

Vol 31 (4) ◽

pp. 285-302 ◽

Cited By ~ 1

Author(s):

Anish Varghese ◽

Bob Edwards ◽

Gaurav Mitra ◽

Alistair P Rendell

Keyword(s):

Low Power ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Matrix Multiplication ◽

Building Blocks ◽

Mesh Network ◽

Peak Performance ◽

Core Network ◽

Performance Computing

Energy efficiency is the primary impediment in the path to exascale computing. Consequently, the high-performance computing community is increasingly interested in low-power high-performance embedded systems as building blocks for large-scale high-performance systems. The Adapteva Epiphany architecture integrates low-power RISC cores on a 2D mesh network and promises up to 70 GFLOPS/Watt of theoretical performance. However, with just 32 KB of memory per eCore for storing both data and code, programming the Epiphany system presents significant challenges. In this paper we evaluate the performance of a 64-core Epiphany system with a variety of basic compute and communication micro-benchmarks. Further, we implemented two well known application kernels, 5-point star-shaped heat stencil with a peak performance of 65.2 GFLOPS and matrix multiplication with 65.3 GFLOPS in single precision across 64 Epiphany cores. We discuss strategies for implementing high-performance computing application kernels on such memory constrained low-power devices and compare the Epiphany with competing low-power systems. With future Epiphany revisions expected to house thousands of cores on a single chip, understanding the merits of such an architecture is of prime importance to the exascale initiative.

Download Full-text

Multi-functional flexible 2D carbon nanostructured networks

Nature Communications ◽

10.1038/s41467-020-18977-6 ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Shichao Zhang ◽

Hui Liu ◽

Jianyong Yu ◽

Bingyun Li ◽

Bin Ding

Keyword(s):

High Performance ◽

Large Scale ◽

Carbon Nanomaterials ◽

Building Blocks ◽

Two Dimensional ◽

Innovative Design ◽

Charged Droplets ◽

Dimensional Network ◽

Inorganic Matrix ◽

Nanoscale Level

Abstract Two-dimensional network-structured carbon nanoscale building blocks, going beyond graphene, are of fundamental importance, and creating such structures and developing their applications have broad implications in environment, electronics and energy. Here, we report a facile route, based on electro-spraying/netting, to self-assemble two-dimensional carbon nanostructured networks on a large scale. Manipulation of the dynamic ejection, deformation and assembly of charged droplets by control of Taylor cone instability and micro-electric field, enables the creation of networks with characteristics combining nanoscale diameters of one-dimensional carbon nanotube and lateral infinity of two-dimensional graphene. The macro-sized (meter-level) carbon nanostructured networks show extraordinary nanostructural properties, remarkable flexibility (soft polymeric mechanics having hard inorganic matrix), nanoscale-level conductivity, and outstanding performances in distinctly different areas like filters, separators, absorbents, and wearable electrodes, supercapacitors and cells. This work should make possible the innovative design of high-performance, multi-functional carbon nanomaterials for various applications.

Download Full-text

Low-Cost Unattended Design of Miniaturized 4 × 4 Butler Matrices with Nonstandard Phase Differences

Sensors ◽

10.3390/s21030851 ◽

2021 ◽

Vol 21 (3) ◽

pp. 851

Author(s):

Adrian Bekasiewicz ◽

Slawomir Koziel

Keyword(s):

High Performance ◽

Low Cost ◽

Computational Cost ◽

Numerical Algorithms ◽

Building Blocks ◽

Mobile Systems ◽

Fine Tuning ◽

Control Synthesis ◽

Primary Concern ◽

Phase Differences

Design of Butler matrices dedicated to Internet of Things and 5th generation (5G) mobile systems—where small size and high performance are of primary concern—is a challenging task that often exceeds capabilities of conventional techniques. Lack of appropriate, unified design approaches is a serious bottleneck for the development of Butler structures for contemporary applications. In this work, a low-cost bottom-up procedure for rigorous and unattended design of miniaturized 4 × 4 Butler matrices is proposed. The presented approach exploits numerical algorithms (governed by a set of suitable objective functions) to control synthesis, implementation, optimization, and fine-tuning of the structure and its individual building blocks. The framework is demonstrated using two miniaturized matrices with nonstandard output-port phase differences. Numerical results indicate that the computational cost of the design process using the presented framework is over 80% lower compared to the conventional approach. The footprints of optimized matrices are only 696 and 767 mm2, respectively. Small size and operation frequency of around 2.6 GHz make the circuits of potential use for mobile devices dedicated to work within a sub-6 GHz 5G spectrum. Both structures have been benchmarked against the state-of-the-art designs from the literature in terms of performance and size. Measurements of the fabricated Butler matrix prototype are also provided.

Download Full-text