MAGMA templates for scalable linear algebra on emerging architectures

With the acquisition and widespread use of more resources that rely on accelerator/wide vector–based computing, there has been a strong demand for science and engineering applications to take advantage of these latest assets. This, however, has been extremely challenging due to the diversity of systems to support their extreme concurrency, complex memory hierarchies, costly data movement, and heterogeneous node architectures. To address these challenges, we design a programming model and describe its ease of use in the development of a new MAGMA Templates library that delivers high-performance scalable linear algebra portable on current and emerging architectures. MAGMA Templates derives its performance and portability by (1) building on existing state-of-the-art linear algebra libraries, like MAGMA, SLATE, Trilinos, and vendor-optimized math libraries, and (2) providing access (seamlessly to the users) to the latest algorithms and architecture-specific optimizations through a single, easy-to-use C++-based API.

Download Full-text

Apache Nemo: A Framework for Optimizing Distributed Data Processing

ACM Transactions on Computer Systems ◽

10.1145/3468144 ◽

2020 ◽

Vol 38 (3-4) ◽

pp. 1-31

Author(s):

Won Wook Song ◽

Youngseok Yang ◽

Jeongyoon Eo ◽

Jangho Seo ◽

Joo Yeon Kim ◽

...

Keyword(s):

Data Processing ◽

High Performance ◽

Programming Model ◽

Compiler Optimization ◽

Ease Of Use ◽

Distributed Data ◽

Performance Improvements ◽

Distributed Data Processing ◽

Fine Control ◽

High Level

Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation of dataflow, compiler optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario. Apache Nemo is open-sourced at https://nemo.apache.org as an Apache incubator project.

Download Full-text

ESTABLISHING A RELIABLE JINI INFRASTRUCTURE FOR PARALLEL APPLICATIONS

Parallel Processing Letters ◽

10.1142/s0129626401000531 ◽

2001 ◽

Vol 11 (02n03) ◽

pp. 203-221 ◽

Cited By ~ 2

Author(s):

MARK BAKER ◽

GARRY SMITH

Keyword(s):

High Performance ◽

Distributed Applications ◽

Ease Of Use ◽

Parallel Applications ◽

Engineering Applications ◽

The Third ◽

Parallel Environment ◽

Popular Language ◽

Future Work ◽

Distributed Infrastructure

Java is becoming an increasingly popular language for developing distributed and parallel scientific and engineering applications. Jini is a Java-based infrastructure developed by Sun that can allegedly provide all the services necessary to support distributed applications. It is the aim of this paper to explore and investigate the services and properties that Jini actually provides and match these against the needs of high performance distributed and parallel applications written in Java. The motivation for this work is the need to develop a distributed infrastructure to support an MPI-like interface to Java known as MPJ. In the first part of the paper we discuss the needs of MPJ, the parallel environment that we wish to support. In particular we look at aspects such as reliability and ease of use. We then move on to sketch out the Jini architecture and review the components and services that Jini provides. In the third part of the paper we critically explore a Jini infrastructure that could be used to support MPJ. Here we are particularly concerned with Jini's ability to support reliably a cocoon of MPJ processes executing in a heterogeneous envirnoment. In the final part of the paper we summarise our findings and report on future work being undertaken on Jini and MPJ.

Download Full-text

Supporting Distributed Visualization Services for High Performance Science and Engineering Applications A Service Provider Perspective

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid ◽

10.1109/ccgrid.2009.94 ◽

2009 ◽

Cited By ~ 1

Author(s):

Lakshmi Sastry ◽

Ronald Fowler ◽

Srikanth Nagella ◽

Jonathan Churchill

Keyword(s):

Service Provider ◽

High Performance ◽

Science And Engineering ◽

Engineering Applications ◽

Provider Perspective

Download Full-text

Straightforward Heterogeneous Computing with the oneAPI Coexecutor Runtime

Electronics ◽

10.3390/electronics10192386 ◽

2021 ◽

Vol 10 (19) ◽

pp. 2386

Author(s):

Raúl Nozal ◽

Jose Luis Bosque

Keyword(s):

Energy Efficiency ◽

High Performance ◽

Heterogeneous Computing ◽

Programming Model ◽

Heterogeneous Systems ◽

Ease Of Use ◽

Embedded Devices ◽

Computing Systems ◽

Key Points ◽

Integrated Gpu

Heterogeneous systems are the core architecture of most computing systems, from high-performance computing nodes to embedded devices, due to their excellent performance and energy efficiency. Efficiently programming these systems has become a major challenge due to the complexity of their architectures and the efforts required to provide them with co-execution capabilities that can fully exploit the applications. There are many proposals to simplify the programming and management of acceleration devices and multi-core CPUs. However, in many cases, portability and ease of use compromise the efficiency of different devices—even more so when co-executing. Intel oneAPI, a new and powerful standards-based unified programming model, built on top of SYCL, addresses these issues. In this paper, oneAPI is provided with co-execution strategies to run the same kernel between different devices, enabling the exploitation of static and dynamic policies. This work evaluates the performance and energy efficiency for a well-known set of regular and irregular HPC benchmarks, using two heterogeneous systems composed of an integrated GPU and CPU. Static and dynamic load balancers are integrated and evaluated, highlighting single and co-execution strategies and the most significant key points of this promising technology. Experimental results show that co-execution is worthwhile when using dynamic algorithms and improves the efficiency even further when using unified shared memory.

Download Full-text

State-of-the-Art Admixture for high performance SCC in China

SCC'2005-China - 1st International Symposium on Design, Performance and Use of Self-Consolidating Concrete ◽

10.1617/2912143624.012 ◽

2005 ◽

Author(s):

S. Asmus

Keyword(s):

High Performance ◽

State Of The Art

Download Full-text

A Study of the Cloud Computing Adoption Issues and Challenges

Recent Advances in Computer Science and Communications ◽

10.2174/2213275911666181114142428 ◽

2020 ◽

Vol 13 (3) ◽

pp. 313-318 ◽

Cited By ~ 3

Author(s):

Dhanapal Angamuthu ◽

Nithyanandam Pandian

Keyword(s):

Cloud Computing ◽

High Performance ◽

Cloud Service ◽

Ease Of Use ◽

Modern Trend ◽

Security Issues ◽

Research Areas ◽

Data Location ◽

Broad Classification

Background: The cloud computing is the modern trend in high-performance computing. Cloud computing becomes very popular due to its characteristic of available anywhere, elasticity, ease of use, cost-effectiveness, etc. Though the cloud grants various benefits, it has associated issues and challenges to prevent the organizations to adopt the cloud. Objective: The objective of this paper is to cover the several perspectives of Cloud Computing. This includes a basic definition of cloud, classification of the cloud based on Delivery and Deployment Model. The broad classification of the issues and challenges faced by the organization to adopt the cloud computing model are explored. Examples for the broad classification are Data Related issues in the cloud, Service availability related issues in cloud, etc. The detailed sub-classifications of each of the issues and challenges discussed. The example sub-classification of the Data Related issues in cloud shall be further classified into Data Security issues, Data Integrity issue, Data location issue, Multitenancy issues, etc. This paper also covers the typical problem of vendor lock-in issue. This article analyzed and described the various possible unique insider attacks in the cloud environment. Results: The guideline and recommendations for the different issues and challenges are discussed. The most importantly the potential research areas in the cloud domain are explored. Conclusion: This paper discussed the details on cloud computing, classifications and the several issues and challenges faced in adopting the cloud. The guideline and recommendations for issues and challenges are covered. The potential research areas in the cloud domain are captured. This helps the researchers, academicians and industries to focus and address the current challenges faced by the customers.

Download Full-text

High Performance Computing in Science and Engineering 2000

10.1007/978-3-642-56548-9 ◽

2001 ◽

Cited By ~ 4

Keyword(s):

High Performance Computing ◽

High Performance ◽

Science And Engineering ◽

Performance Computing

Download Full-text

High Performance Computing in Science and Engineering ’99

10.1007/978-3-642-59686-5 ◽

2000 ◽

Cited By ~ 1

Keyword(s):

High Performance Computing ◽

High Performance ◽

Science And Engineering ◽

Performance Computing

Download Full-text

Numerical algorithms for high-performance computational science

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0066 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190066 ◽

Cited By ~ 2

Author(s):

Jack Dongarra ◽

Laura Grigori ◽

Nicholas J. Higham

Keyword(s):

High Performance ◽

Numerical Algorithms ◽

Computational Science ◽

Floating Point ◽

Important Criterion ◽

Data Movement ◽

Floating Point Arithmetic ◽

High Performance Computers ◽

Point Arithmetic ◽

Speed And Accuracy

A number of features of today’s high-performance computers make it challenging to exploit these machines fully for computational science. These include increasing core counts but stagnant clock frequencies; the high cost of data movement; use of accelerators (GPUs, FPGAs, coprocessors), making architectures increasingly heterogeneous; and multi- ple precisions of floating-point arithmetic, including half-precision. Moreover, as well as maximizing speed and accuracy, minimizing energy consumption is an important criterion. New generations of algorithms are needed to tackle these challenges. We discuss some approaches that we can take to develop numerical algorithms for high-performance computational science, with a view to exploiting the next generation of supercomputers. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

ACM Transactions on Mathematical Software ◽

10.1145/3441850 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-28

Author(s):

Goran Flegar ◽

Hartwig Anzt ◽

Terry Cojean ◽

Enrique S. Quintana-Ortí

Keyword(s):

Linear Algebra ◽

Graphics Processing Units ◽

High Performance ◽

Numerical Algorithms ◽

Mixed Precision ◽

Before And After ◽

Memory Accesses ◽

Specialized Hardware ◽

The Individual ◽

Graphics Processing

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.

Download Full-text