VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing

2017 ◽  
Vol 26 (09) ◽  
pp. 1750129 ◽  
Author(s):  
Mohamed Najoui ◽  
Mounir Bahtat ◽  
Anas Hatim ◽  
Said Belkouch ◽  
Noureddine Chabini

QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12[Formula: see text] and 6[Formula: see text] speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.

2001 ◽  
Vol 01 (02) ◽  
pp. 251-271
Author(s):  
KWANG-MAN OH ◽  
JEONG-DAN CHOI ◽  
CHAN-SU LEE ◽  
CHAN-JONG PARK ◽  
EE-TAEK LEE

This paper presents an efficient and simple quad edge conversion method of polygonal (manifold) objects. In a wide variety of applications such as scientific visualization, virtual reality and computer aided geometric design, polygonal objects are expected to be visualized and manipulated within a given time constraint. To achieve these expectations, it is necessary to introduce an efficient data structure as well as high performance graphics hardware and real-time processing techniques such as simplification and level of details. The quad edge data structure is very efficient for handling polygonal objects even though it was originally designed to handle the subdivisions of manifold objects such as Delaunay triangulations and Voronoi diagrams. It, however, has not been used widely because there is no efficient algorithm for quad edge conversion of conventional polygonal objects. In this paper, we propose a new incremental quad edge conversion algorithm that processes the triangles one by one. Since quad edge has only the splice as a topological operator, the quad edge conversion of each triangle is done by applying three splice operations, a splice per vertex. As an applicaion for the quad edge, a simplification of conventional polygonal objects is implemented. It includes the removing, moving, replacing, and inserting of vertices and edges.


Author(s):  
Masa-aki Fukase ◽  
Tomoaki Sato

In developing cutting edge VLSI processors, parallelism is one of the most important global standard strategies to achieve power conscious high performance. These features are more critical for ubiquitous systems with great demands for multimedia mobile processing. Then, one of most important issues for ubiquitous systems is instruction scheduling, because floating point units indispensable for multimedia mobile applications take longer latency than integer units. Although software parallelism has been inevitable to fully utilize hardware parallelism between regular scalar units, it has been really awkward. Thus, we describe in this article a double scheme to achieve instruction scheduling free ILP (instruction level parallelism) and apply the double scheme to a ubiquitous processor HCgorilla we have so far developed. The double scheme is the multifunctionalization of scalar units and making a resultant multifunctional unit (MFU) wave-pipeline. The multifunctionalization frees the instruction scheduling, and the wave-pipelining recovers the reduction of clock speed to be caused by the scale up of a multifunctional circuit. HCgorilla built-in the waved MFU is promising for wide-range dynamic ILP at a rate higher than regular processors.


2018 ◽  
Vol 18 (3-4) ◽  
pp. 438-451
Author(s):  
MARC DAHLEM ◽  
ANOOP BHAGYANATH ◽  
KLAUS SCHNEIDER

AbstractConventional processor architectures are restricted in exploiting instruction level parallelism (ILP) due to the relatively low number of programmer-visible registers. Therefore, more recent processor architectures expose their datapaths so that the compiler (1) can schedule parallel instructions to different processing units and (2) can make effective use of local storage of the processing units. Among these architectures, the Synchronous Control Asynchronous Dataflow (SCAD) architecture is a new exposed datapath architecture whose processing units are equipped with first-in first-out (FIFO) buffers at their input and output ports.In contrast to register-based machines, the optimal code generation for SCAD is still a matter of research. In particular, SAT and SMT solvers were used to generate optimal resource constrained and optimal time constrained schedules for SCAD, respectively. As Answer Set Programming (ASP) offers better flexibility in handling such scheduling problems, we focus in this paper on using an answer set solver for both resource and time constrained optimal SCAD code generation. As a major benefit of using ASP, we are able to generatealloptimal schedules for a given program which allows one to study their properties. Furthermore, the experimental results of this paper demonstrate that the answer set solver can compete with SAT solvers and outperforms SMT solvers.This paper is under consideration for acceptance in TPLP.


2010 ◽  
Vol 57 (3) ◽  
pp. 314-338 ◽  
Author(s):  
Ben Abdallah Abderazek ◽  
Masashi Masuda ◽  
Arquimedes Canedo ◽  
Kenichi Kuroda

Author(s):  
M. KAMARAJU ◽  
M. ALEKHYA ◽  
K.LAL KISHORE

The main objective of this work is to implement a 32-bit pipelined RISC processor without interlocking stages. It is developed by S.I.M.E (Single Instruction Multiple Execution) that is with single instruction scheme more executions can be done and is based on VLIW(Very Long Instruction Word) architecture processing is an optimal choice in the attempt to obtain high performance level in Embedded Systems. In VLIW based architecture, the effectiveness of the processor depends on the ability of compilers to provide sufficient instruction level parallelism (ILP). The processor has been designed with VHDL, synthesized using Xilinx tool.


2008 ◽  
Vol 16 (4) ◽  
pp. 277-285 ◽  
Author(s):  
Ida M.B. Nielsen ◽  
Curtis L. Janssen

Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexity of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.


2013 ◽  
Vol 22 (10) ◽  
pp. 1340036
Author(s):  
ZIBIN DAI ◽  
LONGMEI NAN ◽  
XUAN YANG ◽  
XIAONAN LI

By analyzing the operation characteristic of linear feedback shifter registers (LFSRs) in many public stream cipher algorithms and its bottleneck realized by general processor, each specific instruction and reconfigurable hardware cell are proposed in this paper, which can neatly execute LFSR computing operation in parallel with high performance. The LFSR instructions can sustain different operation data widths, different operating models. Instruction-level parallelism based on VLIW system structure and instruction inner parallelism by operating several steps at one time are exploited too. Corresponding reconfigurable hardware units to sustain the implementation of each instruction forcefully by configurating is also developed. The circuit can be used as an important accelerated unit in special processing for stream cipher.


2010 ◽  
Vol 34 (6) ◽  
pp. 228-236 ◽  
Author(s):  
Yu Zhang ◽  
Dongdong Chen ◽  
Younhee Choi ◽  
Li Chen ◽  
Seok-Bum Ko

2021 ◽  
Vol 11 (3) ◽  
pp. 1225
Author(s):  
Woohyong Lee ◽  
Jiyoung Lee ◽  
Bo Kyung Park ◽  
R. Young Chul Kim

Geekbench is one of the most referenced cross-platform benchmarks in the mobile world. Most of its workloads are synthetic but some of them aim to simulate real-world behavior. In the mobile world, its microarchitectural behavior has been reported rarely since the hardware profiling features are limited to the public. As a popular mobile performance workload, it is hard to find Geekbench’s microarchitecture characteristics in mobile devices. In this paper, a thorough experimental study of Geekbench performance characterization is reported with detailed performance metrics. This study also identifies mobile system on chip (SoC) microarchitecture impacts, such as the cache subsystem, instruction-level parallelism, and branch performance. After the study, we could understand the bottleneck of workloads, especially in the cache sub-system. This means that the change of data set size directly impacts performance score significantly in some systems and will ruin the fairness of the CPU benchmark. In the experiment, Samsung’s Exynos9820-based platform was used as the tested device with Android Native Development Kit (NDK) built binaries. The Exynos9820 is a superscalar processor capable of dual issuing some instructions. To help performance analysis, we enable the capability to collect performance events with performance monitoring unit (PMU) registers. The PMU is a set of hardware performance counters which are built into microprocessors to store the counts of hardware-related activities. Throughout the experiment, functional and microarchitectural performance profiles were fully studied. This paper describes the details of the mobile performance studies above. In our experiment, the ARM DS5 tool was used for collecting runtime PMU profiles including OS-level performance data. After the comparative study is completed, users will understand more about the mobile architecture behavior, and this will help to evaluate which benchmark is preferable for fair performance comparison.


Sign in / Sign up

Export Citation Format

Share Document