The Strong Scaling Advantage of FPGAs in HPC for N-body Simulations

N-body methods are one of the essential algorithmic building blocks of high-performance and parallel computing. Previous research has shown promising performance for implementing n-body simulations with pairwise force calculations on FPGAs. However, to avoid challenges with accumulation and memory access patterns, the presented designs calculate each pair of forces twice, along with both force sums of the involved particles. Also, they require large problem instances with hundreds of thousands of particles to reach their respective peak performance, limiting the applicability for strong scaling scenarios. This work addresses both issues by presenting a novel FPGA design that uses each calculated force twice and overlaps data transfers and computations in a way that allows to reach peak performance even for small problem instances, outperforming previous single precision results even in double precision, and scaling linearly over multiple interconnected FPGAs. For a comparison across architectures, we provide an equally optimized CPU reference, which for large problems actually achieves higher peak performance per device, however, given the strong scaling advantages of the FPGA design, in parallel setups with few thousand particles per device, the FPGA platform achieves highest performance and power efficiency.

Download Full-text

First Steps in Porting the LFRic Weather and Climate Model to the FPGAs of the EuroExa Architecture

Scientific Programming ◽

10.1155/2019/7807860 ◽

2019 ◽

Vol 2019 ◽

pp. 1-18 ◽

Cited By ~ 2

Author(s):

Mike Ashworth ◽

Graham D. Riley ◽

Andrew Attwood ◽

John Mawer

Keyword(s):

Energy Efficiency ◽

High Performance ◽

Climate Model ◽

Forecast Model ◽

Peak Performance ◽

Multiprocessor System ◽

Double Precision ◽

Power Budget ◽

Weather And Climate ◽

Fpga Acceleration

In recent years, there has been renewed interest in the use of field-programmable gate arrays (FPGAs) for high-performance computing (HPC). In this paper, we explore the techniques required by traditional HPC programmers in porting HPC applications to FPGAs, using as an example the LFRic weather and climate model. We report on the first steps in porting LFRic to the FPGAs of the EuroExa architecture. We have used Vivado High-Level Syntheusywwi to implement a matrix-vector kernel from the LFRic code on a Xilinx UltraScale+ development board containing an XCZU9EG multiprocessor system-on-chip. We describe the porting of the code, discuss the optimization decisions, and report performance of 5.34 Gflop/s with double precision and 5.58 Gflop/s with single precision. We discuss sources of inefficiencies, comparisons with peak performance, comparisons with CPU and GPU performance (taking into account power and price), comparisons with published techniques, and comparisons with published performance, and we conclude with some comments on the prospects for future progress with FPGA acceleration of the weather forecast model. The realization of practical exascale-class high-performance computinems requires significant improvements in the energy efficiency of such systems and their components. This has generated interest in computer architectures which utilize accelerators alongside traditional CPUs. FPGAs offer huge potential as an accelerator which can deliver performance for scientific applications at high levels of energy efficiency. The EuroExa project is developing and building a high-performance architecture based upon ARM CPUs with FPGA acceleration targeting exascale-class performance within a realistic power budget.

Download Full-text

Programming the Adapteva Epiphany 64-core network-on-chip coprocessor

The International Journal of High Performance Computing Applications ◽

10.1177/1094342015599238 ◽

2015 ◽

Vol 31 (4) ◽

pp. 285-302 ◽

Cited By ~ 1

Author(s):

Anish Varghese ◽

Bob Edwards ◽

Gaurav Mitra ◽

Alistair P Rendell

Keyword(s):

Low Power ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Matrix Multiplication ◽

Building Blocks ◽

Mesh Network ◽

Peak Performance ◽

Core Network ◽

Performance Computing

Energy efficiency is the primary impediment in the path to exascale computing. Consequently, the high-performance computing community is increasingly interested in low-power high-performance embedded systems as building blocks for large-scale high-performance systems. The Adapteva Epiphany architecture integrates low-power RISC cores on a 2D mesh network and promises up to 70 GFLOPS/Watt of theoretical performance. However, with just 32 KB of memory per eCore for storing both data and code, programming the Epiphany system presents significant challenges. In this paper we evaluate the performance of a 64-core Epiphany system with a variety of basic compute and communication micro-benchmarks. Further, we implemented two well known application kernels, 5-point star-shaped heat stencil with a peak performance of 65.2 GFLOPS and matrix multiplication with 65.3 GFLOPS in single precision across 64 Epiphany cores. We discuss strategies for implementing high-performance computing application kernels on such memory constrained low-power devices and compare the Epiphany with competing low-power systems. With future Epiphany revisions expected to house thousands of cores on a single chip, understanding the merits of such an architecture is of prime importance to the exascale initiative.

Download Full-text

Collapsed carbon nanotubes as building blocks for high-performance thermal materials

Physical Review Materials ◽

10.1103/physrevmaterials.1.056001 ◽

2017 ◽

Vol 1 (5) ◽

Cited By ~ 5

Author(s):

Jihong Al-Ghalith ◽

Hao Xu ◽

Traian Dumitrică

Keyword(s):

Carbon Nanotubes ◽

High Performance ◽

Building Blocks

Download Full-text

Novel Bloch wave excitation platform based on few-layer photonic crystal deposited on D-shaped optical fiber

Scientific Reports ◽

10.1038/s41598-021-90504-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Esteban Gonzalez-Valencia ◽

Ignacio Del Villar ◽

Pedro Torres

Keyword(s):

Optical Fibers ◽

High Performance ◽

Light Propagation ◽

Fiber Optic ◽

Layer Structure ◽

Optical Network ◽

Building Blocks ◽

Bloch Wave ◽

Nanophotonic Devices ◽

Wide Range

AbstractWith the goal of ultimate control over the light propagation, photonic crystals currently represent the primary building blocks for novel nanophotonic devices. Bloch surface waves (BSWs) in periodic dielectric multilayer structures with a surface defect is a well-known phenomenon, which implies new opportunities for controlling the light propagation and has many applications in the physical and biological science. However, most of the reported structures based on BSWs require depositing a large number of alternating layers or exploiting a large refractive index (RI) contrast between the materials constituting the multilayer structure, thereby increasing the complexity and costs of manufacturing. The combination of fiber–optic-based platforms with nanotechnology is opening the opportunity for the development of high-performance photonic devices that enhance the light-matter interaction in a strong way compared to other optical platforms. Here, we report a BSW-supporting platform that uses geometrically modified commercial optical fibers such as D-shaped optical fibers, where a few-layer structure is deposited on its flat surface using metal oxides with a moderate difference in RI. In this novel fiber optic platform, BSWs are excited through the evanescent field of the core-guided fundamental mode, which indicates that the structure proposed here can be used as a sensing probe, along with other intrinsic properties of fiber optic sensors, as lightness, multiplexing capacity and easiness of integration in an optical network. As a demonstration, fiber optic BSW excitation is shown to be suitable for measuring RI variations. The designed structure is easy to manufacture and could be adapted to a wide range of applications in the fields of telecommunications, environment, health, and material characterization.

Download Full-text

A Novel Rounding Algorithm for a High Performance IEEE 754 Double-Precision Floating-Point Multiplier

2020 IEEE 38th International Conference on Computer Design (ICCD) ◽

10.1109/iccd50377.2020.00081 ◽

2020 ◽

Author(s):

S. Ross Thompson ◽

James E. Stine

Keyword(s):

High Performance ◽

Floating Point ◽

Double Precision ◽

Rounding Algorithm

Download Full-text

Solution Processable Colloidal Nanoplates as Building Blocks for High-Performance Electronic Thin Films on Flexible Substrates

Nano Letters ◽

10.1021/nl503140c ◽

2014 ◽

Vol 14 (11) ◽

pp. 6547-6553 ◽

Cited By ~ 43

Author(s):

Zhaoyang Lin ◽

Yu Chen ◽

Anxiang Yin ◽

Qiyuan He ◽

Xiaoqing Huang ◽

...

Keyword(s):

Thin Films ◽

High Performance ◽

Building Blocks ◽

Flexible Substrates ◽

Solution Processable

Download Full-text

Amino Acids Profiling in Fruit Juices by High Performance Liquid Chromatography: A Review

International Journal of Research in Pharmaceutical Sciences ◽

10.26452/ijrps.v11ispl4.4552 ◽

2020 ◽

Vol 11 (SPL4) ◽

pp. 2756-2767

Author(s):

Vijaya Vemani ◽

Mounika P ◽

Poulami Das ◽

Anand Kumar Tengli

Keyword(s):

Amino Acids ◽

High Performance Liquid Chromatography ◽

Liquid Chromatography ◽

High Performance ◽

Amino Acid Content ◽

Building Blocks ◽

Fruit Juices ◽

The Body ◽

Total Amino Acid ◽

Analytical Approaches

In the preservation of normal physiological functions, the building blocks of the body called amino acids play a crucial role. A number of valuable and nutritional phytoconstituents are contained in fruit juices, such as vitamins, minerals, microelements, organic acids, antioxidants, flavonoids, amino acids and other components. Due to the growing population and demand, the quality of fruit juices is decreasing. One of the unethical and harmful practices called adulteration or food fraudulence has been adopted by most food and beverage industries. The amino acids which is one of the most important phytochemicals of fruit and fruit juices which affects the organoleptic properties like color, odor, and taste of juices and also helps in authenticity process from governing bodies by providing total amino acid content. Consequently, the main aim of the present review work is to provide information regarding the importance of amino acids, how they are adulterated, the potential analytical approach to detected amino acids and which methods are generally accepted method by the food industries. According to the literature review, we presume that reverse phased high-performance liquid chromatography with pre-column derivatization was the most adopted method for quality checking due to its advantages over other old and recent analytical approaches like simple, rapid, cost-effective nature, less / no sample matrix effect with high sensitivity, accuracy, and precision.

Download Full-text

High-Performance Green OLEDs Using Thermally Activated Delayed Fluorescence with a Power Efficiency of over 100 lm W−1

Advanced Materials ◽

10.1002/adma.201503782 ◽

2016 ◽

Vol 28 (13) ◽

pp. 2638-2643 ◽

Cited By ~ 170

Author(s):

Yuki Seino ◽

Susumu Inomata ◽

Hisahiro Sasabe ◽

Yong-Jin Pu ◽

Junji Kido

Keyword(s):

Power Efficiency ◽

High Performance ◽

Delayed Fluorescence ◽

Thermally Activated Delayed Fluorescence ◽

Thermally Activated

Download Full-text

A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS

Parallel Processing Letters ◽

10.1142/s0129626413400112 ◽

2013 ◽

Vol 23 (04) ◽

pp. 1340011 ◽

Cited By ~ 7

Author(s):

FAISAL SHAHZAD ◽

MARKUS WITTMANN ◽

MORITZ KREUTZER ◽

THOMAS ZEISER ◽

GEORG HAGER ◽

...

Keyword(s):

High Performance ◽

Building Blocks ◽

Memory Systems ◽

Time To Failure ◽

Flow Solver ◽

The Road ◽

System A ◽

Node Level ◽

Mean Time ◽

Performance Computing

The road to exascale computing poses many challenges for the High Performance Computing (HPC) community. Each step on the exascale path is mainly the result of a higher level of parallelism of the basic building blocks (i.e., CPUs, memory units, networking components, etc.). The reliability of each of these basic components does not increase at the same rate as the rate of hardware parallelism. This results in a reduction of the mean time to failure (MTTF) of the whole system. A fault tolerance environment is thus indispensable to run large applications on such clusters. Checkpoint/Restart (C/R) is the classic and most popular method to minimize failure damage. Its ease of implementation makes it useful, but typically it introduces significant overhead to the application. Several efforts have been made to reduce the C/R overhead. In this paper we compare various C/R techniques for their overheads by implementing them on two different categories of applications. These approaches are based on parallel-file-system (PFS)-level checkpoints (synchronous/asynchronous) and node-level checkpoints. We utilize the Scalable Checkpoint/Restart (SCR) library for the comparison of node-level checkpoints. For asynchronous PFS-level checkpoints, we use the Damaris library, the SCR asynchronous feature, and application-based checkpointing via dedicated threads. Our baseline for overhead comparison is the naïve application-based synchronous PFS-level checkpointing method. A 3D lattice-Boltzmann (LBM) flow solver and a Lanczos eigenvalue solver are used as prototypical applications in which all the techniques considered here may be applied.

Download Full-text

PhotoNs-GPU: A GPU accelerated cosmological simulation code

Research in Astronomy and Astrophysics ◽

10.1088/1674-4527/21/11/281 ◽

2021 ◽

Vol 21 (11) ◽

pp. 281

Author(s):

Qiao Wang ◽

Chen Meng

Keyword(s):

Special Functions ◽

Fast Multipole Method ◽

Kernel Functions ◽

Peak Performance ◽

Double Precision ◽

Simulation Code ◽

Small Noise ◽

Mixed Precision ◽

Speed Up ◽

Different Levels

Abstract We present a GPU-accelerated cosmological simulation code, PhotoNs-GPU, based on an algorithm of Particle Mesh Fast Multipole Method (PM-FMM), and focus on the GPU utilization and optimization. A proper interpolated method for truncated gravity is introduced to speed up the special functions in kernels. We verify the GPU code in mixed precision and different levels of theinterpolated method on GPU. A run with single precision is roughly two times faster than double precision for current practical cosmological simulations. But it could induce an unbiased small noise in power spectrum. Compared with the CPU version of PhotoNs and Gadget-2, the efficiency of the new code is significantly improved. Activated all the optimizations on the memory access, kernel functions and concurrency management, the peak performance of our test runs achieves 48% of the theoretical speed and the average performance approaches to ∼35% on GPU.

Download Full-text