An automatic implementation of the mixed precision in NEMO 4.2

Author(s):  
Stella Valentina Paronuzzi Ticco ◽  
Oriol Tintó Prims ◽  
Mario Acosta Cobos ◽  
Miguel Castrillo Melguizo

<p>At the beginning of 2021 a mixed precision version of the NEMO code was included into the official NEMO repository. The implementation followed the approach presented in Tintó et al. 2019. The proposed optimization despite being not at all trivial, is not new, and quite popular nowadays. In fact, for historical reasons many computational models over-engineer the numerical precision, which leads to an under-optimal exploitation of computational infrastructures. By solving this miss-adjustment a conspicuous payback in terms of efficiency and throughput can be gained: we are not only taking a step toward a more environmentally friendly science, sometimes we are actually pushing the horizon of experiment feasibility a little further. For being able to smoothly include the changes needed in the official release an automatic workflow has been implemented: we attempt to minimize the number of changes required and, at the same time, maximize the number of variables that can be computed using single precision. Here we present a general sketch of the tool and workflow used.<br>Starting from the original code, we automatically produce a new version of the same, where the user can specify the precision of each real variable therein declared. With this new executable, a numerical precision analysis can be performed: a search algorithm specially designed for this task will drive a workflow manager toward the creation of a list of variables that is safe to switch to single precision. The algorithm compares the result of each intermediate step of the workflow with reliable results from a double precision version of the same code, detecting which variables need to retain a higher accuracy.<br>The result of this analysis is eventually used to perform the modification needed into the code in order to produce the desired working mixed precision version, while also keeping the number of necessary changes low. Finally, the previous double precision and the new mixed precision versions will be compared, including a computational comparison and a scientific validation to prove that the new version can be used for operational configurations, without losing accuracy and increasing the computational performance dramatically.</p>

2020 ◽  
Author(s):  
Oriol Tintó ◽  
Stella Valentina Paronuzzi Ticco ◽  
Mario C. Acosta ◽  
Miguel Castrillo ◽  
Kim Serradell ◽  
...  

<p>One of the requirements to keep improving the science produced using NEMO is to enhance its computational performance. The interest in improving its capability to efficiently use the computational infrastructure its two-fold: on one side there are experiments that would only be possible if a certain threshold of throughput is achieved, on the other side any development that achieves an increase in efficiency would help saving resources while reducing the environmental impact of our experiments. One of the opportunities that raised interest in the last few years is the optimization of the numerical precision. Historical reasons brought many computational models to over-engineer the numerical precision: solving this miss-adjustment can payback in terms of efficiency and throughput. In this direction, a research was carried out in order to safely reduce the numerical precision in NEMO which led to a mixed-precision version of the model. The implementation has been developed following the approach proposed by Tintó et al. 2019, in which the variables that require double precision are identified automatically and the remaining ones are switched to use single-precision. The implementation will be released in 2020 and this work presents its evaluation in terms of both performance and scientific results.</p>


2019 ◽  
Vol 12 (7) ◽  
pp. 3135-3148 ◽  
Author(s):  
Oriol Tintó Prims ◽  
Mario C. Acosta ◽  
Andrew M. Moore ◽  
Miguel Castrillo ◽  
Kim Serradell ◽  
...  

Abstract. Mixed-precision approaches can provide substantial speed-ups for both computing- and memory-bound codes with little effort. Most scientific codes have overengineered the numerical precision, leading to a situation in which models are using more resources than required without knowing where they are required and where they are not. Consequently, it is possible to improve computational performance by establishing a more appropriate choice of precision. The only input that is needed is a method to determine which real variables can be represented with fewer bits without affecting the accuracy of the results. This paper presents a novel method that enables modern and legacy codes to benefit from a reduction of the precision of certain variables without sacrificing accuracy. It consists of a simple idea: we reduce the precision of a group of variables and measure how it affects the outputs. Then we can evaluate the level of precision that they truly need. Modifying and recompiling the code for each case that has to be evaluated would require a prohibitive amount of effort. Instead, the method presented in this paper relies on the use of a tool called a reduced-precision emulator (RPE) that can significantly streamline the process. Using the RPE and a list of parameters containing the precisions that will be used for each real variable in the code, it is possible within a single binary to emulate the effect on the outputs of a specific choice of precision. When we are able to emulate the effects of reduced precision, we can proceed with the design of the tests that will give us knowledge of the sensitivity of the model variables regarding their numerical precision. The number of possible combinations is prohibitively large and therefore impossible to explore. The alternative of performing a screening of the variables individually can provide certain insight about the required precision of variables, but, on the other hand, other complex interactions that involve several variables may remain hidden. Instead, we use a divide-and-conquer algorithm that identifies the parts that require high precision and establishes a set of variables that can handle reduced precision. This method has been tested using two state-of-the-art ocean models, the Nucleus for European Modelling of the Ocean (NEMO) and the Regional Ocean Modeling System (ROMS), with very promising results. Obtaining this information is crucial to build an actual mixed-precision version of the code in the next phase that will bring the promised performance benefits.


2019 ◽  
Author(s):  
Oriol Tintó Prims ◽  
Mario C. Acosta ◽  
Andrew M. Moore ◽  
Miguel Castrillo ◽  
Kim Serradell ◽  
...  

Abstract. Mixed-precision approaches can provide substantial speed-ups for both computing- and memory-bound codes requiring little effort. Most scientific codes have overengineered the numerical precision leading to a situation where models are using more resources than required without having a clue about where these resources are unnecessary and where are really needed. Consequently, there is the possibility to obtain performance benefits from using a more appropriate choice of precision and the only thing that is needed is a method to determine which real variables can be represented with fewer bits without affecting the accuracy of the results. This paper presents a novel method to enable modern and legacy codes to benefit from a reduction of precision without sacrificing accuracy. It consists in a simple idea: if we can measure how reducing the precision of a group of variables affects the outputs, we can evaluate the level of precision this group of variables need. Modifying and recompiling the code for each case that has to be evaluated would require an amount of effort that makes this task prohibitive. Instead, the method presented in this paper relies on the use of a tool called Reduced Precision Emulator (RPE) that can significantly streamline the process . Using the RPE and a list of parameters containing the precisions that will be used for each real variable in the code, it is possible within a single binary to emulate the effect on the outputs of a specific choice of precision. Once we have the potential of emulating the effects of reduced precision, we can proceed with the design of the tests required to obtain knowledge about all the variables in the model. The number of possible combinations is prohibitively large and impossible to explore. The alternative of performing a screening of the variables individually can give certain insight about the precision needed by the variables, but on the other hand some more complex interactions that involve several variables may remain hidden. Instead, we use a divide-and-conquer algorithm that identifies the parts that cannot handle reduced precision and builds a set of variables that can. The method has been put to proof using two state-of-the-art ocean models, NEMO and ROMS, with very promising results. Obtaining this information is crucial to build afterwards an actual mixed precision version of the code that will bring the promised performance benefits.


2020 ◽  
Author(s):  
Jiayi Lai

<p><span>The next generation of weather and climate models will have an unprecedented level of resolution and model complexity, while also increasing the requirements for calculation and memory speed. Reducing the accuracy of certain variables and using mixed precision methods in atmospheric models can greatly improve Computing and memory speed. However, in order to ensure the accuracy of the results, most models have over-designed numerical accuracy, which results in that occupied resources have being much larger than the required resources. Previous studies have shown that the necessary precision for an accurate weather model has clear scale dependence, with large spatial scales requiring higher precision than small scales. Even at large scales the necessary precision is far below that of double precision. However, it is difficult to find a guided method to assign different precisions to different variables, so that it can save unnecessary waste. This paper will take CESM1.2.1 as a research object to conduct a large number of tests to reduce accuracy, and propose a new discrimination method similar to the CFL criterion. This method can realize the correlation verification of a single variable, thereby determining which variables can use a lower level of precision without degrading the accuracy of the results.</span></p>


2020 ◽  
Vol 148 (4) ◽  
pp. 1541-1552 ◽  
Author(s):  
Sam Hatfield ◽  
Andrew McRae ◽  
Tim Palmer ◽  
Peter Düben

Abstract The use of single-precision arithmetic in ECMWF’s forecasting model gave a 40% reduction in wall-clock time over double-precision, with no decrease in forecast quality. However, using reduced-precision in 4D-Var data assimilation is relatively unexplored and there are potential issues with using single-precision in the tangent-linear and adjoint models. Here, we present the results of reducing numerical precision in an incremental 4D-Var data assimilation scheme, with an underlying two-layer quasigeostrophic model. The minimizer used is the conjugate gradient method. We show how reducing precision increases the asymmetry between the tangent-linear and adjoint models. For ill-conditioned problems, this leads to a loss of orthogonality among the residuals of the conjugate gradient algorithm, which slows the convergence of the minimization procedure. However, we also show that a standard technique, reorthogonalization, eliminates these issues and therefore could allow the use of single-precision arithmetic. This work is carried out within ECMWF’s data assimilation framework, the Object Oriented Prediction System.


Geophysics ◽  
1996 ◽  
Vol 61 (2) ◽  
pp. 357-364 ◽  
Author(s):  
Horst Holstein ◽  
Ben Ketteridge

Analytical formulas for the gravity anomaly of a uniform polyhedral body are subject to numerical error that increases with distance from the target, while the anomaly decreases. This leads to a limited range of target distances in which the formulas are operational, beyond which the calculations are dominated by rounding error. We analyze the sources of error and propose a combination of numerical and analytical procedures that exhibit advantages over existing methods, namely (1) errors that diminish with distance, (2) enhanced operating range, and (3) algorithmic simplicity. The latter is achieved by avoiding the need to transform coordinates and the need to discriminate between projected observation points that lie inside, on, or outside a target facet boundary. Our error analysis is verified in computations based on a published code and on a code implementing our methods. The former requires a numerical precision of one part in [Formula: see text] (double precision) in problems of geophysical interest, whereas our code requires a precision of one part in [Formula: see text] (single precision) to give comparable results, typically in half the execution time.


Author(s):  
Diego Rossinelli ◽  
Christian Conti ◽  
Petros Koumoutsakos

Particle–mesh interpolations are fundamental operations for particle-in-cell codes, as implemented in vortex methods, plasma dynamics and electrostatics simulations. In these simulations, the mesh is used to solve the field equations and the gradients of the fields are used in order to advance the particles. The time integration of particle trajectories is performed through an extensive resampling of the flow field at the particle locations. The computational performance of this resampling turns out to be limited by the memory bandwidth of the underlying computer architecture. We investigate how mesh–particle interpolation can be efficiently performed on graphics processing units (GPUs) and multicore central processing units (CPUs), and we present two implementation techniques. The single-precision results for the multicore CPU implementation show an acceleration of 45–70×, depending on system size, and an acceleration of 85–155× for the GPU implementation over an efficient single-threaded C++ implementation. In double precision, we observe a performance improvement of 30–40× for the multicore CPU implementation and 20–45× for the GPU implementation. With respect to the 16-threaded standard C++ implementation, the present CPU technique leads to a performance increase of roughly 2.8–3.7× in single precision and 1.7–2.4× in double precision, whereas the GPU technique leads to an improvement of 9× in single precision and 2.2–2.8× in double precision.


2018 ◽  
Author(s):  
Pavel Pokhilko ◽  
Evgeny Epifanovsky ◽  
Anna I. Krylov

Using single precision floating point representation reduces the size of data and computation time by a factor of two relative to double precision conventionally used in electronic structure programs. For large-scale calculations, such as those encountered in many-body theories, reduced memory footprint alleviates memory and input/output bottlenecks. Reduced size of data can lead to additional gains due to improved parallel performance on CPUs and various accelerators. However, using single precision can potentially reduce the accuracy of computed observables. Here we report an implementation of coupled-cluster and equation-of-motion coupled-cluster methods with single and double excitations in single precision. We consider both standard implementation and one using Cholesky decomposition or resolution-of-the-identity of electron-repulsion integrals. Numerical tests illustrate that when single precision is used in correlated calculations, the loss of accuracy is insignificant and pure single-precision implementation can be used for computing energies, analytic gradients, excited states, and molecular properties. In addition to pure single-precision calculations, our implementation allows one to follow a single-precision calculation by clean-up iterations, fully recovering double-precision results while retaining significant savings.


2021 ◽  
Vol 21 (11) ◽  
pp. 281
Author(s):  
Qiao Wang ◽  
Chen Meng

Abstract We present a GPU-accelerated cosmological simulation code, PhotoNs-GPU, based on an algorithm of Particle Mesh Fast Multipole Method (PM-FMM), and focus on the GPU utilization and optimization. A proper interpolated method for truncated gravity is introduced to speed up the special functions in kernels. We verify the GPU code in mixed precision and different levels of theinterpolated method on GPU. A run with single precision is roughly two times faster than double precision for current practical cosmological simulations. But it could induce an unbiased small noise in power spectrum. Compared with the CPU version of PhotoNs and Gadget-2, the efficiency of the new code is significantly improved. Activated all the optimizations on the memory access, kernel functions and concurrency management, the peak performance of our test runs achieves 48% of the theoretical speed and the average performance approaches to ∼35% on GPU.


2020 ◽  
Author(s):  
Alessandro Cotronei ◽  
Thomas Slawig

Abstract. We converted the radiation part of the atmospheric model ECHAM to single precision arithmetic. We analyzed different conversion strategies and finally used a step by step change of all modules, subroutines and functions. We found out that a small code portion still requires higher precision arithmetic. We generated code that can be easily changed from double to single precision and vice versa, basically using a simple switch in one module. We compared the output of the single precision version in the coarse resolution with observational data and with the original double precision code. The results of both versions are comparable. We extensively tested different parallelization options with respect to the possible performance gain, in both coarse and low resolution. The single precision radiation itself was accelerated by about 40%, whereas the speed-up for the whole ECHAM model using the converted radiation achieved 18% in the best configuration. We further measured the energy consumption, which could also be reduced.


Sign in / Sign up

Export Citation Format

Share Document