scholarly journals The Strong Scaling Advantage of FPGAs in HPC for N-body Simulations

2022 ◽  
Vol 15 (1) ◽  
pp. 1-30
Author(s):  
Johannes Menzel ◽  
Christian Plessl ◽  
Tobias Kenter

N-body methods are one of the essential algorithmic building blocks of high-performance and parallel computing. Previous research has shown promising performance for implementing n-body simulations with pairwise force calculations on FPGAs. However, to avoid challenges with accumulation and memory access patterns, the presented designs calculate each pair of forces twice, along with both force sums of the involved particles. Also, they require large problem instances with hundreds of thousands of particles to reach their respective peak performance, limiting the applicability for strong scaling scenarios. This work addresses both issues by presenting a novel FPGA design that uses each calculated force twice and overlaps data transfers and computations in a way that allows to reach peak performance even for small problem instances, outperforming previous single precision results even in double precision, and scaling linearly over multiple interconnected FPGAs. For a comparison across architectures, we provide an equally optimized CPU reference, which for large problems actually achieves higher peak performance per device, however, given the strong scaling advantages of the FPGA design, in parallel setups with few thousand particles per device, the FPGA platform achieves highest performance and power efficiency.

2019 ◽  
Vol 2019 ◽  
pp. 1-18 ◽  
Author(s):  
Mike Ashworth ◽  
Graham D. Riley ◽  
Andrew Attwood ◽  
John Mawer

In recent years, there has been renewed interest in the use of field-programmable gate arrays (FPGAs) for high-performance computing (HPC). In this paper, we explore the techniques required by traditional HPC programmers in porting HPC applications to FPGAs, using as an example the LFRic weather and climate model. We report on the first steps in porting LFRic to the FPGAs of the EuroExa architecture. We have used Vivado High-Level Syntheusywwi to implement a matrix-vector kernel from the LFRic code on a Xilinx UltraScale+ development board containing an XCZU9EG multiprocessor system-on-chip. We describe the porting of the code, discuss the optimization decisions, and report performance of 5.34 Gflop/s with double precision and 5.58 Gflop/s with single precision. We discuss sources of inefficiencies, comparisons with peak performance, comparisons with CPU and GPU performance (taking into account power and price), comparisons with published techniques, and comparisons with published performance, and we conclude with some comments on the prospects for future progress with FPGA acceleration of the weather forecast model. The realization of practical exascale-class high-performance computinems requires significant improvements in the energy efficiency of such systems and their components. This has generated interest in computer architectures which utilize accelerators alongside traditional CPUs. FPGAs offer huge potential as an accelerator which can deliver performance for scientific applications at high levels of energy efficiency. The EuroExa project is developing and building a high-performance architecture based upon ARM CPUs with FPGA acceleration targeting exascale-class performance within a realistic power budget.


Author(s):  
Anish Varghese ◽  
Bob Edwards ◽  
Gaurav Mitra ◽  
Alistair P Rendell

Energy efficiency is the primary impediment in the path to exascale computing. Consequently, the high-performance computing community is increasingly interested in low-power high-performance embedded systems as building blocks for large-scale high-performance systems. The Adapteva Epiphany architecture integrates low-power RISC cores on a 2D mesh network and promises up to 70 GFLOPS/Watt of theoretical performance. However, with just 32 KB of memory per eCore for storing both data and code, programming the Epiphany system presents significant challenges. In this paper we evaluate the performance of a 64-core Epiphany system with a variety of basic compute and communication micro-benchmarks. Further, we implemented two well known application kernels, 5-point star-shaped heat stencil with a peak performance of 65.2 GFLOPS and matrix multiplication with 65.3 GFLOPS in single precision across 64 Epiphany cores. We discuss strategies for implementing high-performance computing application kernels on such memory constrained low-power devices and compare the Epiphany with competing low-power systems. With future Epiphany revisions expected to house thousands of cores on a single chip, understanding the merits of such an architecture is of prime importance to the exascale initiative.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Esteban Gonzalez-Valencia ◽  
Ignacio Del Villar ◽  
Pedro Torres

AbstractWith the goal of ultimate control over the light propagation, photonic crystals currently represent the primary building blocks for novel nanophotonic devices. Bloch surface waves (BSWs) in periodic dielectric multilayer structures with a surface defect is a well-known phenomenon, which implies new opportunities for controlling the light propagation and has many applications in the physical and biological science. However, most of the reported structures based on BSWs require depositing a large number of alternating layers or exploiting a large refractive index (RI) contrast between the materials constituting the multilayer structure, thereby increasing the complexity and costs of manufacturing. The combination of fiber–optic-based platforms with nanotechnology is opening the opportunity for the development of high-performance photonic devices that enhance the light-matter interaction in a strong way compared to other optical platforms. Here, we report a BSW-supporting platform that uses geometrically modified commercial optical fibers such as D-shaped optical fibers, where a few-layer structure is deposited on its flat surface using metal oxides with a moderate difference in RI. In this novel fiber optic platform, BSWs are excited through the evanescent field of the core-guided fundamental mode, which indicates that the structure proposed here can be used as a sensing probe, along with other intrinsic properties of fiber optic sensors, as lightness, multiplexing capacity and easiness of integration in an optical network. As a demonstration, fiber optic BSW excitation is shown to be suitable for measuring RI variations. The designed structure is easy to manufacture and could be adapted to a wide range of applications in the fields of telecommunications, environment, health, and material characterization.


Nano Letters ◽  
2014 ◽  
Vol 14 (11) ◽  
pp. 6547-6553 ◽  
Author(s):  
Zhaoyang Lin ◽  
Yu Chen ◽  
Anxiang Yin ◽  
Qiyuan He ◽  
Xiaoqing Huang ◽  
...  

2020 ◽  
Vol 11 (SPL4) ◽  
pp. 2756-2767
Author(s):  
Vijaya Vemani ◽  
Mounika P ◽  
Poulami Das ◽  
Anand Kumar Tengli

In the preservation of normal physiological functions, the building blocks of the body called amino acids play a crucial role. A number of valuable and nutritional phytoconstituents are contained in fruit juices, such as vitamins, minerals, microelements, organic acids, antioxidants, flavonoids, amino acids and other components. Due to the growing population and demand, the quality of fruit juices is decreasing. One of the unethical and harmful practices called adulteration or food fraudulence has been adopted by most food and beverage industries. The amino acids which is one of the most important phytochemicals of fruit and fruit juices which affects the organoleptic properties like color, odor, and taste of juices and also helps in authenticity process from governing bodies by providing total amino acid content. Consequently, the main aim of the present review work is to provide information regarding the importance of amino acids, how they are adulterated, the potential analytical approach to detected amino acids and which methods are generally accepted method by the food industries. According to the literature review, we presume that reverse phased high-performance liquid chromatography with pre-column derivatization was the most adopted method for quality checking due to its advantages over other old and recent analytical approaches like simple, rapid, cost-effective nature, less / no sample matrix effect with high sensitivity, accuracy, and precision.


2013 ◽  
Vol 23 (04) ◽  
pp. 1340011 ◽  
Author(s):  
FAISAL SHAHZAD ◽  
MARKUS WITTMANN ◽  
MORITZ KREUTZER ◽  
THOMAS ZEISER ◽  
GEORG HAGER ◽  
...  

The road to exascale computing poses many challenges for the High Performance Computing (HPC) community. Each step on the exascale path is mainly the result of a higher level of parallelism of the basic building blocks (i.e., CPUs, memory units, networking components, etc.). The reliability of each of these basic components does not increase at the same rate as the rate of hardware parallelism. This results in a reduction of the mean time to failure (MTTF) of the whole system. A fault tolerance environment is thus indispensable to run large applications on such clusters. Checkpoint/Restart (C/R) is the classic and most popular method to minimize failure damage. Its ease of implementation makes it useful, but typically it introduces significant overhead to the application. Several efforts have been made to reduce the C/R overhead. In this paper we compare various C/R techniques for their overheads by implementing them on two different categories of applications. These approaches are based on parallel-file-system (PFS)-level checkpoints (synchronous/asynchronous) and node-level checkpoints. We utilize the Scalable Checkpoint/Restart (SCR) library for the comparison of node-level checkpoints. For asynchronous PFS-level checkpoints, we use the Damaris library, the SCR asynchronous feature, and application-based checkpointing via dedicated threads. Our baseline for overhead comparison is the naïve application-based synchronous PFS-level checkpointing method. A 3D lattice-Boltzmann (LBM) flow solver and a Lanczos eigenvalue solver are used as prototypical applications in which all the techniques considered here may be applied.


2021 ◽  
Vol 21 (11) ◽  
pp. 281
Author(s):  
Qiao Wang ◽  
Chen Meng

Abstract We present a GPU-accelerated cosmological simulation code, PhotoNs-GPU, based on an algorithm of Particle Mesh Fast Multipole Method (PM-FMM), and focus on the GPU utilization and optimization. A proper interpolated method for truncated gravity is introduced to speed up the special functions in kernels. We verify the GPU code in mixed precision and different levels of theinterpolated method on GPU. A run with single precision is roughly two times faster than double precision for current practical cosmological simulations. But it could induce an unbiased small noise in power spectrum. Compared with the CPU version of PhotoNs and Gadget-2, the efficiency of the new code is significantly improved. Activated all the optimizations on the memory access, kernel functions and concurrency management, the peak performance of our test runs achieves 48% of the theoretical speed and the average performance approaches to ∼35% on GPU.


Sign in / Sign up

Export Citation Format

Share Document