SIMD, Single Instruction Multiple Data

Author(s):  
Wesley Petersen ◽  
Peter Arbenz

The single instruction, multiple data (SIMD) mode is the simplest method of parallelism and now becoming the most common. In most cases this SIMD mode means the same as vectorization. Ten years ago, ve ctor computers were expensive but reasonably simple to program. Today, encouraged by multimedia applications, vector hardware is now commonly available in Intel Pentium III and Pentium 4 PCs, and Apple/Motorola G-4 machines. In this chapter, we will cover both old and new and find that the old paradigms for programming were simpler because CMOS or ECL memories permitted easy non-unit stride memory access. Most of the ideas are the same, so the simpler programming methodology makes it easy to understand the concepts. As PC and Mac compilers improve, perhaps automatic vectorization will become as effective as on the older non-cache machines. In the meantime, on PCs and Macs we will often need to use intrinsics ([23, 22, 51]). It seems at first that the intrinsics keep a programmer close to the hardware, which is not a bad thing, but this is somewhat misleading. Hardware control in this method of programming is only indirect. Actual register assignments are made by the compiler and may not be quite what the programmer wants. The SSE2 or Altivec programming serves to illustrate a form of instruction level parallelism we wish to emphasize. This form, SIMD or vectorization, has single instructions which operate on multiple data. There are variants on this theme which use templates or macros which consist of multiple instructions carefully scheduled to accomplish the same objective, but are not strictly speaking SIMD, for example see Section 1.2.2.1. Intrinsics are C macros which contain one or more SIMD instructions to execute certain operations on multiple data, usually 4-words/time in our case. Data are explicitly declared mm128 datatypes in the Intel SSE case and vector variables using the G-4 Altivec. Our examples will show you how this works. Four basic concepts are important: Consistent with our notion that examples are the best way to learn, several will be illustrated: • from linear algebra, the Level 1 basic linear algebra subprograms (BLAS) — vector updates (-axpy) — reduction operations and linear searches • recurrence formulae and polynomial evaluations • uniform random number generation.

2009 ◽  
Vol 17 (1-2) ◽  
pp. 31-42 ◽  
Author(s):  
Jakub Kurzak ◽  
Jack Dongarra

The QR factorization is one of the most important operations in dense linear algebra, offering a numerically stable method for solving linear systems of equations including overdetermined and underdetermined systems. Modern implementations of the QR factorization, such as the one in the LAPACK library, suffer from performance limitations due to the use of matrix–vector type operations in the phase of panel factorization. These limitations can be remedied by using the idea of updating of QR factorization, rendering an algorithm, which is much more scalable and much more suitable for implementation on a multi-core processor. It is demonstrated how the potential of the cell broadband engine can be utilized to the fullest by employing the new algorithmic approach and successfully exploiting the capabilities of the chip in terms of single instruction multiple data parallelism, instruction level parallelism and thread-level parallelism.


2021 ◽  
Vol 11 (3) ◽  
pp. 1225
Author(s):  
Woohyong Lee ◽  
Jiyoung Lee ◽  
Bo Kyung Park ◽  
R. Young Chul Kim

Geekbench is one of the most referenced cross-platform benchmarks in the mobile world. Most of its workloads are synthetic but some of them aim to simulate real-world behavior. In the mobile world, its microarchitectural behavior has been reported rarely since the hardware profiling features are limited to the public. As a popular mobile performance workload, it is hard to find Geekbench’s microarchitecture characteristics in mobile devices. In this paper, a thorough experimental study of Geekbench performance characterization is reported with detailed performance metrics. This study also identifies mobile system on chip (SoC) microarchitecture impacts, such as the cache subsystem, instruction-level parallelism, and branch performance. After the study, we could understand the bottleneck of workloads, especially in the cache sub-system. This means that the change of data set size directly impacts performance score significantly in some systems and will ruin the fairness of the CPU benchmark. In the experiment, Samsung’s Exynos9820-based platform was used as the tested device with Android Native Development Kit (NDK) built binaries. The Exynos9820 is a superscalar processor capable of dual issuing some instructions. To help performance analysis, we enable the capability to collect performance events with performance monitoring unit (PMU) registers. The PMU is a set of hardware performance counters which are built into microprocessors to store the counts of hardware-related activities. Throughout the experiment, functional and microarchitectural performance profiles were fully studied. This paper describes the details of the mobile performance studies above. In our experiment, the ARM DS5 tool was used for collecting runtime PMU profiles including OS-level performance data. After the comparative study is completed, users will understand more about the mobile architecture behavior, and this will help to evaluate which benchmark is preferable for fair performance comparison.


Author(s):  
Dennis Wolf ◽  
Andreas Engel ◽  
Tajas Ruschke ◽  
Andreas Koch ◽  
Christian Hochberger

AbstractCoarse Grained Reconfigurable Arrays (CGRAs) or Architectures are a concept for hardware accelerators based on the idea of distributing workload over Processing Elements. These processors exploit instruction level parallelism, while being energy efficient due to their simplistic internal structure. However, the incorporation into a complete computing system raises severe challenges at the hardware and software level. This article evaluates a CGRA integrated into a control engineering environment targeting a Xilinx Zynq System on Chip (SoC) in detail. Besides the actual application execution performance, the practicability of the configuration toolchain is validated. Challenges of the real-world integration are discussed and practical insights are highlighted.


2020 ◽  
Author(s):  
Yan Gao ◽  
Yongzhuang Liu ◽  
Yanmei Ma ◽  
Bo Liu ◽  
Yadong Wang ◽  
...  

AbstractSummaryPartial order alignment, which aligns a sequence to a directed acyclic graph, is now frequently used as a key component in long-read error correction and assembly. We present abPOA (adaptive banded Partial Order Alignment), a Single Instruction Multiple Data (SIMD) based C library for fast partial order alignment using adaptive banded dynamic programming. It can work as a stand-alone multiple sequence alignment and consensus calling tool or be easily integrated into any long-read error correction and assembly workflow. Compared to a state-of-the-art tool (SPOA), abPOA is up to 15 times faster with a comparable alignment accuracy.Availability and implementationabPOA is implemented in C. A stand-alone tool and a C/Python software interface are freely available at https://github.com/yangao07/[email protected] or [email protected]


2017 ◽  
Vol 26 (09) ◽  
pp. 1750129 ◽  
Author(s):  
Mohamed Najoui ◽  
Mounir Bahtat ◽  
Anas Hatim ◽  
Said Belkouch ◽  
Noureddine Chabini

QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12[Formula: see text] and 6[Formula: see text] speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.


Sign in / Sign up

Export Citation Format

Share Document