speedup factor
Recently Published Documents


TOTAL DOCUMENTS

34
(FIVE YEARS 9)

H-INDEX

6
(FIVE YEARS 1)

2021 ◽  
Author(s):  
Xingwu Liu ◽  
Zizhao Chen ◽  
Xin Han ◽  
Zhenyu Sun ◽  
Zhishan Guo

2021 ◽  
Vol 7 ◽  
pp. e769
Author(s):  
Bérenger Bramas

The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs’ parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and their vectorization instruction sets were de-facto the standard for decades. Each new release of an instruction set was usually a doubling of the vector length coupled with new operations. Each generation was pushing for adapting and improving previous implementations. The release of the ARM scalable vector extension (SVE) changed things radically for several reasons. First, we expect ARM processors to equip many supercomputers in the next years. Second, SVE’s interface is different in several aspects from the x86 extensions as it provides different instructions, uses a predicate to control most operations, and has a vector size that is only known at execution time. Therefore, using SVE opens new challenges on how to adapt algorithms including the ones that are already well-optimized on x86. In this paper, we port a hybrid sort based on the well-known Quicksort and Bitonic-sort algorithms. We use a Bitonic sort to process small partitions/arrays and a vectorized partitioning implementation to divide the partitions. We explain how we use the predicates and how we manage the non-static vector size. We also explain how we efficiently implement the sorting kernels. Our approach only needs an array of O(log N) for the recursive calls in the partitioning phase, both in the sequential and in the parallel case. We test the performance of our approach on a modern ARMv8.2 (A64FX) CPU and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results show that our approach is faster than the GNU C++ sort algorithm by a speedup factor of 4 on average.


2021 ◽  
Vol 8 ◽  
Author(s):  
Federico Errica ◽  
Marco Giulini ◽  
Davide Bacciu ◽  
Roberto Menichetti ◽  
Alessio Micheli ◽  
...  

The limits of molecular dynamics (MD) simulations of macromolecules are steadily pushed forward by the relentless development of computer architectures and algorithms. The consequent explosion in the number and extent of MD trajectories induces the need for automated methods to rationalize the raw data and make quantitative sense of them. Recently, an algorithmic approach was introduced by some of us to identify the subset of a protein’s atoms, or mapping, that enables the most informative description of the system. This method relies on the computation, for a given reduced representation, of the associated mapping entropy, that is, a measure of the information loss due to such simplification; albeit relatively straightforward, this calculation can be time-consuming. Here, we describe the implementation of a deep learning approach aimed at accelerating the calculation of the mapping entropy. We rely on Deep Graph Networks, which provide extreme flexibility in handling structured input data and whose predictions prove to be accurate and-remarkably efficient. The trained network produces a speedup factor as large as 105 with respect to the algorithmic computation of the mapping entropy, enabling the reconstruction of its landscape by means of the Wang–Landau sampling scheme. Applications of this method reach much further than this, as the proposed pipeline is easily transferable to the computation of arbitrary properties of a molecular structure.


2021 ◽  
pp. 104743
Author(s):  
Xingwu Liu ◽  
Xin Han ◽  
Liang Zhao ◽  
Zhishan Guo
Keyword(s):  

2021 ◽  
Vol 247 ◽  
pp. 04003
Author(s):  
Andrew Cox ◽  
Albrecht Kyrieleis ◽  
Sam Powell-Gill ◽  
Simon Richards ◽  
Francesco Tantillo

The primary goal of this paper is to increase the efficiency of criticality and burnup calculations in the ANSWERS MONK® Monte Carlo code [1]. Two ways of achieving this goal are investigated as part of the H2020 McSAFE Project: creating a unified energy grid for all materials in the model, and reducing the spread in variances of fluxes for depletable materials using a generated optimised importance map. The average tracking speedup factor across all cycles of all burnup calculations ran using the unified energy grid, at base temperature, was found to be 1.96. For criticality calculations at 400K with runtime Doppler broadening, the unified grid approach gave a total speedup factor of 7.32. This demonstrates the potential importance of this method to reduce the calculation time with models with runtime Doppler broadening. The use of the generated optimised importance map has been demonstrated to significantly reduce the variance in the standard deviations on the fluxes in the fuel pins across two different test cases. If a solution is required in which the standard deviation in none of the fuel pins exceeds 5% it was found that the number of scoring stages required was more than halved, highlighting the potential for the outlined methodology to speedup burnup credit calculations.


10.29007/hb5r ◽  
2019 ◽  
Author(s):  
Mohammad Alkhamis ◽  
Amirali Baniasadi

cn.MOPS is a frequently cited model-based algorithm used to quantitatively detect copy-number variations in next-generation, DNA-sequencing data. Previous work has implemented the algorithm as an R package and has achieved considerable yet limited performance improvement by employing multi-CPU parallelism (maximum achievable speedup was experimentally determined to be 9.24). In this paper, we propose an alternative mechanism of process acceleration. Using one CPU core and a GPU device in the proposed solution, gcn.MOPS, we achieve a speedup factor of 159 and reduce memory usage by more than half compared to cn.MOPS running on one CPU core.


2019 ◽  
Vol 24 (1) ◽  
pp. 131-142 ◽  
Author(s):  
E. Tengs ◽  
F. Charrassier ◽  
M. Holst ◽  
Pål-Tore Storli

Abstract As part of an ongoing study into hydropower runner failure, a submerged, vibrating blade is investigated both experimentally and numerically. The numerical simulations performed are fully coupled acoustic-structural simulations in ANSYS Mechanical. In order to speed up the simulations, a model order reduction technique based on Krylov subspaces is implemented. This paper presents a comparison between the full ANSYS harmonic response and the reduced order model, and shows excellent agreement. The speedup factor obtained by using the reduced order model is shown to be between one and two orders of magnitude. The number of dimensions in the reduced subspace needed for accurate results is investigated, and confirms what is found in other studies on similar model order reduction applications. In addition, experimental results are available for validation, and show good match when not too far from the resonance peak.


2019 ◽  
Author(s):  
Adrián Pousa

Los procesadores multicore asimétricos o AMPs (Asymmetric Multicore Processors) constituyen una alternativa de bajo consumo energético a los procesadores multicore convencionales formados por cores idénticos, pero también plantean grandes desafíos para el software de sistema. Los AMPs integran cores complejos de alto rendimiento y cores simples de bajo consumo. La mayoría de los algoritmos de planificación existentes para AMPs intentan optimizar el rendimiento global. Sin embargo, estos algoritmos degradan otros aspectos como la justicia o la eficiencia energética. El principal objetivo de esta tesis doctoral es superar estas limitaciones, mediante el diseño de estrategias de planificación más flexibles para AMPs. Asimismo, en esta tesis mostramos el impacto que la optimización de una métrica tiene en otras. Para mejorar el rendimiento global, la justicia o la eficiencia energética en AMPs, el planificador debe tener en cuenta el beneficio que cada aplicación alcanza al usar los distintos cores en un AMP. Dado que no todos los hilos en ejecución de una carga de trabajo obtienen siempre el mismo beneficio relativo (speedup factor–SF) al usar un core de alto rendimiento, debe tenerse en cuenta esta diversidad de SFs para optimizar los distintos objetivos. El sistema operativo (SO) debe determinar de forma efectiva el SF de cada hilo en ejecución. En esta tesis proponemos una metodología general para construir modelos de estimación de SF precisos basados en el uso de contadores hardware. La mayoría de los algoritmos de planificación existentes para AMPs, han sido evaluados empleando o bien simuladores o plataformas asimétricas emuladas o bien prototipos de planificadores en modo usuario. Por el contrario, en esta tesis doctoral, evaluamos los algoritmos propuestos en un entorno realista: empleando implementaciones de los algoritmos en el kernel de SOs reales y sobre hardware multicore asimétrico real.


2019 ◽  
Vol 11 (1) ◽  
pp. 49-70
Author(s):  
Mohsin Altaf Wani ◽  
Manzoor Ahmad

Modern GPUs perform computation at a very high rate when compared to CPUs; as a result, they are increasingly used for general purpose parallel computation. Determining if a statically optimal binary search tree is an optimization problem to find the optimal arrangement of nodes in a binary search tree so that average search time is minimized. Knuth's modification to the dynamic programming algorithm improves the time complexity to O(n2). We develop a multiple GPU-based implementation of this algorithm using different approaches. Using suitable GPU implementation for a given workload provides a speedup of up to four times over other GPU based implementations. We are able to achieve a speedup factor of 409 on older GTX 570 and a speedup factor of 745 is achieved on a more modern GTX 1060 when compared to a conventional single threaded CPU based implementation.


Sign in / Sign up

Export Citation Format

Share Document