scholarly journals A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)

2021 ◽  
Vol 7 ◽  
pp. e769
Author(s):  
Bérenger Bramas

The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs’ parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and their vectorization instruction sets were de-facto the standard for decades. Each new release of an instruction set was usually a doubling of the vector length coupled with new operations. Each generation was pushing for adapting and improving previous implementations. The release of the ARM scalable vector extension (SVE) changed things radically for several reasons. First, we expect ARM processors to equip many supercomputers in the next years. Second, SVE’s interface is different in several aspects from the x86 extensions as it provides different instructions, uses a predicate to control most operations, and has a vector size that is only known at execution time. Therefore, using SVE opens new challenges on how to adapt algorithms including the ones that are already well-optimized on x86. In this paper, we port a hybrid sort based on the well-known Quicksort and Bitonic-sort algorithms. We use a Bitonic sort to process small partitions/arrays and a vectorized partitioning implementation to divide the partitions. We explain how we use the predicates and how we manage the non-static vector size. We also explain how we efficiently implement the sorting kernels. Our approach only needs an array of O(log N) for the recursive calls in the partitioning phase, both in the sequential and in the parallel case. We test the performance of our approach on a modern ARMv8.2 (A64FX) CPU and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results show that our approach is faster than the GNU C++ sort algorithm by a speedup factor of 4 on average.

2020 ◽  
Vol 33 (109) ◽  
pp. 21-31
Author(s):  
І. Ya. Zeleneva ◽  
Т. V. Golub ◽  
T. S. Diachuk ◽  
А. Ye. Didenko

The purpose of these studies is to develop an effective structure and internal functional blocks of a digital computing device – an adder, that performs addition and subtraction operations on floating- point numbers presented in IEEE Std 754TM-2008 format. To improve the characteristics of the adder, the circuit uses conveying, that is, division into levels, each of which performs a specific action on numbers. This allows you to perform addition / subtraction operations on several numbers at the same time, which increas- es the performance of calculations, and also makes the adder suitable for use in modern synchronous cir- cuits. Each block of the conveyor structure of the adder on FPGA is synthesized as a separate project of a digital functional unit, and thus, the overall task is divided into separate subtasks, which facilitates experi- mental testing and phased debugging of the entire device. Experimental studies were performed using EDA Quartus II. The developed circuit was modeled on FPGAs of the Stratix III and Cyclone III family. An ana- logue of the developed circuit was a functionally similar device from Altera. A comparative analysis is made and reasoned conclusions are drawn that the performance improvement is achieved due to the conveyor structure of the adder. Implementation of arithmetic over the floating-point numbers on programmable logic integrated cir- cuits, in particular on FPGA, has such advantages as flexibility of use and low production costs, and also provides the opportunity to solve problems for which there are no ready-made solutions in the form of stand- ard devices presented on the market. The developed adder has a wide scope, since most modern computing devices need to process floating-point numbers. The proposed conveyor model of the adder is quite simple to implement on the FPGA and can be an alternative to using built-in multipliers and processor cores in cases where the complex functionality of these devices is redundant for a specific task.


2016 ◽  
Vol 51 (1) ◽  
pp. 555-567
Author(s):  
Marc Andrysco ◽  
Ranjit Jhala ◽  
Sorin Lerner

2003 ◽  
Vol 12 (03) ◽  
pp. 333-351 ◽  
Author(s):  
B. Mesman ◽  
Q. Zhao ◽  
N. Busa ◽  
K. Leijten-Nowak

In current System-on-Chip (SoC) design, the main engineering trade-off concerns hardware efficiency and design effort. Hardware efficiency traditionally regards cost versus performance (in high-volume electronics), but recently energy consumption emerged as a dominant criterion, even in products without batteries. "The" most effective way to increase HW efficiency is to exploit application characteristics in the HW. The traditional way of looking at HW design tends to consider it a time-consuming and tedious task, however. Given the current lack of HW designers, and the pressure of time-to-market, clearly a desire exists to fine-balance the merits and effort of tuning your HW to your application. This paper discusses methods and tool support for HW application-tuning at different levels of granularity. Furthermore we treat several ways of applying reconfigurable HW to allow both silicon reuse and the ability to tune the HW to the application after fabrication. Our main focus is on a methodology for application-tuning the architecture of DSP datapaths. Our primary contribution is on reusing and generalizing this methodology to application-tuning DSP instruction sets, and providing tool support for efficient compilation for these instruction sets. Furthermore, we propose an architecure for a reconfigurable instruction-decoder, enabling application-tuning of the instruction-set after fabrication.


2004 ◽  
Vol 39 (4) ◽  
pp. 360-371 ◽  
Author(s):  
William D. Clinger

Sign in / Sign up

Export Citation Format

Share Document