double precision
Recently Published Documents


TOTAL DOCUMENTS

366
(FIVE YEARS 96)

H-INDEX

21
(FIVE YEARS 3)

2022 ◽  
Vol 15 (1) ◽  
pp. 1-30
Author(s):  
Johannes Menzel ◽  
Christian Plessl ◽  
Tobias Kenter

N-body methods are one of the essential algorithmic building blocks of high-performance and parallel computing. Previous research has shown promising performance for implementing n-body simulations with pairwise force calculations on FPGAs. However, to avoid challenges with accumulation and memory access patterns, the presented designs calculate each pair of forces twice, along with both force sums of the involved particles. Also, they require large problem instances with hundreds of thousands of particles to reach their respective peak performance, limiting the applicability for strong scaling scenarios. This work addresses both issues by presenting a novel FPGA design that uses each calculated force twice and overlaps data transfers and computations in a way that allows to reach peak performance even for small problem instances, outperforming previous single precision results even in double precision, and scaling linearly over multiple interconnected FPGAs. For a comparison across architectures, we provide an equally optimized CPU reference, which for large problems actually achieves higher peak performance per device, however, given the strong scaling advantages of the FPGA design, in parallel setups with few thousand particles per device, the FPGA platform achieves highest performance and power efficiency.


2021 ◽  
Vol 12 (1) ◽  
pp. 378
Author(s):  
Enrique Cantó Navarro ◽  
Rafael Ramos Lara ◽  
Mariano López García

This paper describes three different approaches for the implementation of an online signature verification system on a low-cost FPGA. The system is based on an algorithm, which operates on real numbers using the double-precision floating-point IEEE 754 format. The double-precision computations are replaced by simpler formats, without affecting the biometrics performance, in order to permit efficient implementations on low-cost FPGA families. The first approach is an embedded system based on MicroBlaze, a 32-bit soft-core microprocessor designed for Xilinx FPGAs, which can be configured by including a single-precision floating-point unit (FPU). The second implementation attaches a hardware accelerator to the embedded system to reduce the execution time on floating-point vectors. The last approach is a custom computing system, which is built from a large set of arithmetic circuits that replace the floating-point data with a more efficient representation based on fixed-point format. The latter system provides a very high runtime acceleration factor at the expense of using a large number of FPGA resources, a complex development cycle and no flexibility since it cannot be adapted to other biometric algorithms. By contrast, the first system provides just the opposite features, while the second approach is a mixed solution between both of them. The experimental results show that both the hardware accelerator and the custom computing system reduce the execution time by a factor ×7.6 and ×201 but increase the logic FPGA resources by a factor ×2.3 and ×5.2, respectively, in comparison with the MicroBlaze embedded system.


2021 ◽  
Vol 21 (11) ◽  
pp. 281
Author(s):  
Qiao Wang ◽  
Chen Meng

Abstract We present a GPU-accelerated cosmological simulation code, PhotoNs-GPU, based on an algorithm of Particle Mesh Fast Multipole Method (PM-FMM), and focus on the GPU utilization and optimization. A proper interpolated method for truncated gravity is introduced to speed up the special functions in kernels. We verify the GPU code in mixed precision and different levels of theinterpolated method on GPU. A run with single precision is roughly two times faster than double precision for current practical cosmological simulations. But it could induce an unbiased small noise in power spectrum. Compared with the CPU version of PhotoNs and Gadget-2, the efficiency of the new code is significantly improved. Activated all the optimizations on the memory access, kernel functions and concurrency management, the peak performance of our test runs achieves 48% of the theoretical speed and the average performance approaches to ∼35% on GPU.


2021 ◽  
pp. 1-43
Author(s):  
E. Adam Paxton ◽  
Matthew Chantry ◽  
Milan Klöwer ◽  
Leo Saffin ◽  
Tim Palmer

AbstractMotivated by recent advances in operational weather forecasting, we study the efficacy of low-precision arithmetic for climate simulations. We develop a framework to measure rounding error in a climate model which provides a stress-test for a low-precision version of the model, and we apply our method to a variety of models including the Lorenz system; a shallow water approximation for ow over a ridge; and a coarse resolution spectral global atmospheric model with simplified parameterisations (SPEEDY). Although double precision (52 significant bits) is standard across operational climate models, in our experiments we find that single precision (23 sbits) is more than enough and that as low as half precision (10 sbits) is often sufficient. For example, SPEEDY can be run with 12 sbits across the code with negligible rounding error, and with 10 sbits if minor errors are accepted, amounting to less than 0.1 mm/6hr for average grid-point precipitation, for example. Our test is based on the Wasserstein metric and this provides stringent non-parametric bounds on rounding error accounting for annual means as well as extreme weather events. In addition, by testing models using both round-to-nearest (RN) and stochastic rounding (SR) we find that SR can mitigate rounding error across a range of applications, and thus our results also provide some evidence that SR could be relevant to next-generation climate models. Further research is needed to test if our results can be generalised to higher resolutions and alternative numerical schemes. However, the results open a promising avenue towards the use of low-precision hardware for improved climate modelling.


SIMULATION ◽  
2021 ◽  
pp. 003754972110544
Author(s):  
Joseph D. Richardson

Unpredictable pseudo-random number generators (PRNGs) are presented based on dissociated components with only coincidental interaction. The first components involve pointers taken from series of floating point numbers (float streams) arising from arithmetic. The pointers are formed by isolating generalized digits sufficiently far from the most significant digits in the float streams and may be combined into multi-digit pointers. The pointers indicate draw locations from the second component which are entropy decks having one or more cards corresponding to the elements used to assemble random numbers. Like playing cards, decks are cut and riffle-shuffled based on rules using digits appearing in the simulations. The various ordering states of the cards provide entropy to the PRNGs. The dual nature of the PRNGs is novel since they can operate either entirely on pointer variability to fixed decks or on shuffling variability using fixed pointer locations. Each component, pointers and dynamic entropy, is dissociated from the other and independently shown to pass stringent statistical tests with the other held as fixed; a “gold standard” mode involves changing the coincidental interaction between these two strong emulators of randomness by either cutting or shuffling prior to each draw. Gold standard modes may be useful in cryptography and in assessing tests themselves. One PRNG contains [Formula: see text] states in the entropy pool, another generates integers approximately 50% faster than the Advanced Encryption Standard (AES) PRNG with similar empirical performance, and a third generates full double-precision floats at speeds comparable to unsigned integer rates of the AES PRNG.


Author(s):  
M. W. Jahn ◽  
P. E. Bradley

Abstract. To simulate environmental processes, noise, flooding in cities as well as the behaviour of buildings and infrastructure, ‘watertight’ volumetric models are a measuring prerequisite. They ensure topologically consistent 3D models and allow the definition of proper topological operations. However, in many existing city or other geo-information models, topologically unchecked boundary representations are used to store spatial entities. In order to obtain consistent topological models, including their ‘fillings’, in this paper, a triangulation combined with overlay and path-finding methods is presented by climbing up the dimension, beginning with the wireframe model. The algorithms developed for this task are presented, whereby using the philosophy of graph databases and the Property Graph Model. Examples to illustrate the algorithms are given, and experiments are performed on a data-set from Erfurt, Thuringia (Germany), providing complex geometries of buildings. The heavy influence of double precision arithmetic on the results, in particular the positional and angular precision, is discussed in the end.


Electronics ◽  
2021 ◽  
Vol 10 (18) ◽  
pp. 2209
Author(s):  
Noureddine Ait Said ◽  
Mounir Benabdenbi ◽  
Katell Morin-Allory

Using standard Floating-Point (FP) formats for computation leads to significant hardware overhead since these formats are over-designed for error-resilient workloads such as iterative algorithms. Hence, hardware FP Unit (FPU) architectures need run-time variable precision capabilities. In this work, we propose a new method and an FPU architecture that enable designers to dynamically tune FP computations’ precision automatically at run-time called Variable Precision in Time (VPT), leading to significant power consumption, execution time, and energy savings. In spite of its circuit area overhead, the proposed approach simplifies the integration of variable precision in existing software workloads at any level of the software stack (OS, RTOS, or application-level): it only requires lightweight software support and solely relies on traditional assembly instructions, without the need for a specialized compiler or custom instructions. We apply the technique on the Jacobi and the Gauss–Seidel iterative methods taking full advantage of the suggested FPU. For each algorithm, two modified versions are proposed: a conservative version and a relaxed one. Both algorithms are analyzed and compared statistically to understand the effects of VPT on iterative applications. The implementations demonstrate up to 70.67% power consumption saving, up to 59.80% execution time saving, and up to 88.20% total energy saving w.r.t the reference double precision implementation, and with no accuracy loss.


Electronics ◽  
2021 ◽  
Vol 10 (16) ◽  
pp. 1984
Author(s):  
Wei Zhang ◽  
Zihao Jiang ◽  
Zhiguang Chen ◽  
Nong Xiao ◽  
Yang Ou

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.


2021 ◽  
Author(s):  
Meiyu Xu ◽  
Dayong Lu ◽  
Xiaoyun Sun

Abstract In the past few decades, quantum computation has become increasingly attractivedue to its remarkable performance. Quantum image scaling is considered a common geometric transformation in quantum image processing, however, the quantum floating-point data version of which does not exist. Is there a corresponding scaling for 2-D and 3-D floating-point data? The answer is yes.In this paper, we present quantum scaling up and down scheme for floating-point data by using trilinear interpolation method in 3-D space. This scheme offers better performance (in terms of the precision of floating-point numbers) for realizing the quantum floating-point algorithms compared to previously classical approaches. The Converter module we proposed can solve the conversion of fixed-point numbers to floating-point numbers of arbitrary size data with p + q qubits based on IEEE-754 format, instead of 32-bit single-precision, 64-bit double precision or 128-bit extended-precision. Usually, we use nearest neighbor interpolation and bilinear interpolation to achieve quantum image scaling algorithms, which are not applicable in high-dimensional space. This paper proposes trilinear interpolation of floating-point numbers in 3-D space to achieve quantum algorithms of scaling up and down for 3-D floating-point data. Finally, the circuits of quantum scaling up and down for 3-D floating-point data are designed.


Sign in / Sign up

Export Citation Format

Share Document