VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing

QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12[Formula: see text] and 6[Formula: see text] speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.

Download Full-text

Instruction-Level Parallelism in Asynchronous Processor Architectures

Algorithms and Parallel VLSI Architectures III ◽

10.1016/b978-044482106-5/50018-1 ◽

1995 ◽

pp. 203-214 ◽

Cited By ~ 3

Author(s):

D.K. Arvind ◽

V.E.F. Rebello

Keyword(s):

Instruction Level Parallelism ◽

Processor Architectures ◽

Level Parallelism

Download Full-text

AN EFFICIENT AND SIMPLE QUAD EDGE CONVERSION OF POLYGONAL MAINFOLD OBJECTS

International Journal of Image and Graphics ◽

10.1142/s0219467801000165 ◽

2001 ◽

Vol 01 (02) ◽

pp. 251-271

Author(s):

KWANG-MAN OH ◽

JEONG-DAN CHOI ◽

CHAN-SU LEE ◽

CHAN-JONG PARK ◽

EE-TAEK LEE

Keyword(s):

Data Structure ◽

High Performance ◽

Graphics Hardware ◽

Delaunay Triangulations ◽

Computer Aided Geometric Design ◽

Real Time Processing ◽

Time Processing ◽

Level Of Details ◽

Efficient Data ◽

Processing Techniques

This paper presents an efficient and simple quad edge conversion method of polygonal (manifold) objects. In a wide variety of applications such as scientific visualization, virtual reality and computer aided geometric design, polygonal objects are expected to be visualized and manipulated within a given time constraint. To achieve these expectations, it is necessary to introduce an efficient data structure as well as high performance graphics hardware and real-time processing techniques such as simplification and level of details. The quad edge data structure is very efficient for handling polygonal objects even though it was originally designed to handle the subdivisions of manifold objects such as Delaunay triangulations and Voronoi diagrams. It, however, has not been used widely because there is no efficient algorithm for quad edge conversion of conventional polygonal objects. In this paper, we propose a new incremental quad edge conversion algorithm that processes the triangles one by one. Since quad edge has only the splice as a topological operator, the quad edge conversion of each triangle is done by applying three splice operations, a splice per vertex. As an applicaion for the quad edge, a simplification of conventional polygonal objects is implemented. It includes the removing, moving, replacing, and inserting of vertices and edges.

Download Full-text

A Ubiquitous Processor Built-in a Waved Multifunctional Unit

ECTI Transactions on Computer and Information Technology (ECTI-CIT) ◽

10.37936/ecti-cit.201041.54218 ◽

1970 ◽

Vol 4 (1) ◽

pp. 1-7

Author(s):

Masa-aki Fukase ◽

Tomoaki Sato

Keyword(s):

Mobile Applications ◽

High Performance ◽

Scale Up ◽

Instruction Scheduling ◽

Instruction Level Parallelism ◽

Floating Point ◽

Ubiquitous Systems ◽

Wide Range ◽

Wave Pipelining ◽

Level Parallelism

In developing cutting edge VLSI processors, parallelism is one of the most important global standard strategies to achieve power conscious high performance. These features are more critical for ubiquitous systems with great demands for multimedia mobile processing. Then, one of most important issues for ubiquitous systems is instruction scheduling, because floating point units indispensable for multimedia mobile applications take longer latency than integer units. Although software parallelism has been inevitable to fully utilize hardware parallelism between regular scalar units, it has been really awkward. Thus, we describe in this article a double scheme to achieve instruction scheduling free ILP (instruction level parallelism) and apply the double scheme to a ubiquitous processor HCgorilla we have so far developed. The double scheme is the multifunctionalization of scalar units and making a resultant multifunctional unit (MFU) wave-pipeline. The multifunctionalization frees the instruction scheduling, and the wave-pipelining recovers the reduction of clock speed to be caused by the scale up of a multifunctional circuit. HCgorilla built-in the waved MFU is promising for wide-range dynamic ILP at a rate higher than regular processors.

Download Full-text

Optimal Scheduling for Exposed Datapath Architectures with Buffered Processing Units by ASP

Theory and Practice of Logic Programming ◽

10.1017/s1471068418000170 ◽

2018 ◽

Vol 18 (3-4) ◽

pp. 438-451

Author(s):

MARC DAHLEM ◽

ANOOP BHAGYANATH ◽

KLAUS SCHNEIDER

Keyword(s):

Code Generation ◽

Instruction Level Parallelism ◽

Optimal Time ◽

Scheduling Problems ◽

Processor Architectures ◽

Smt Solvers ◽

Optimal Resource ◽

Local Storage ◽

Level Parallelism ◽

Answer Set

AbstractConventional processor architectures are restricted in exploiting instruction level parallelism (ILP) due to the relatively low number of programmer-visible registers. Therefore, more recent processor architectures expose their datapaths so that the compiler (1) can schedule parallel instructions to different processing units and (2) can make effective use of local storage of the processing units. Among these architectures, the Synchronous Control Asynchronous Dataflow (SCAD) architecture is a new exposed datapath architecture whose processing units are equipped with first-in first-out (FIFO) buffers at their input and output ports.In contrast to register-based machines, the optimal code generation for SCAD is still a matter of research. In particular, SAT and SMT solvers were used to generate optimal resource constrained and optimal time constrained schedules for SCAD, respectively. As Answer Set Programming (ASP) offers better flexibility in handling such scheduling problems, we focus in this paper on using an answer set solver for both resource and time constrained optimal SCAD code generation. As a major benefit of using ASP, we are able to generatealloptimal schedules for a given program which allows one to study their properties. Furthermore, the experimental results of this paper demonstrate that the answer set solver can compete with SAT solvers and outperforms SMT solvers.This paper is under consideration for acceptance in TPLP.

Download Full-text

Natural instruction level parallelism-aware compiler for high-performance QueueCore processor architecture

The Journal of Supercomputing ◽

10.1007/s11227-010-0409-z ◽

2010 ◽

Vol 57 (3) ◽

pp. 314-338 ◽

Cited By ~ 3

Author(s):

Ben Abdallah Abderazek ◽

Masashi Masuda ◽

Arquimedes Canedo ◽

Kenichi Kuroda

Keyword(s):

High Performance ◽

Instruction Level Parallelism ◽

Processor Architecture ◽

Level Parallelism

Download Full-text

A NOVEL IMPLEMENTATION OF 32-BIT VLIW-MISC PROCESSOR ON FPGA

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2016.1339 ◽

2016 ◽

pp. 60-63

Author(s):

M. KAMARAJU ◽

M. ALEKHYA ◽

K.LAL KISHORE

Keyword(s):

Embedded Systems ◽

High Performance ◽

Optimal Choice ◽

Performance Level ◽

Instruction Level Parallelism ◽

Very Long Instruction Word ◽

Risc Processor ◽

Level Parallelism ◽

High Performance Level

The main objective of this work is to implement a 32-bit pipelined RISC processor without interlocking stages. It is developed by S.I.M.E (Single Instruction Multiple Execution) that is with single instruction scheme more executions can be done and is based on VLIW(Very Long Instruction Word) architecture processing is an optimal choice in the attempt to obtain high performance level in Embedded Systems. In VLIW based architecture, the effectiveness of the processor depends on the ability of compilers to provide sufficient instruction level parallelism (ILP). The processor has been designed with VHDL, synthesized using Xilinx tool.

Download Full-text

Multicore Challenges and Benefits for High Performance Scientific Computing

Scientific Programming ◽

10.1155/2008/450818 ◽

2008 ◽

Vol 16 (4) ◽

pp. 277-285 ◽

Cited By ~ 5

Author(s):

Ida M.B. Nielsen ◽

Curtis L. Janssen

Keyword(s):

Message Passing ◽

High Performance ◽

Programming Model ◽

Instruction Level Parallelism ◽

Performance Improvements ◽

Processor Performance ◽

Multiple Threads ◽

Moller Plesset ◽

Multicore Chips ◽

Level Parallelism

Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexity of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.

Download Full-text

DESIGN AND IMPLEMENTATION OF CONFIGURABLE LFSR INSTRUCTIONS TARGETED AT STREAM CIPHER PROCESSING

Journal of Circuits System and Computers ◽

10.1142/s0218126613400367 ◽

2013 ◽

Vol 22 (10) ◽

pp. 1340036

Author(s):

ZIBIN DAI ◽

LONGMEI NAN ◽

XUAN YANG ◽

XIAONAN LI

Keyword(s):

High Performance ◽

Stream Cipher ◽

Reconfigurable Hardware ◽

System Structure ◽

Instruction Level Parallelism ◽

Linear Feedback ◽

Specific Instruction ◽

Design And Implementation ◽

Operation Characteristic ◽

Level Parallelism

By analyzing the operation characteristic of linear feedback shifter registers (LFSRs) in many public stream cipher algorithms and its bottleneck realized by general processor, each specific instruction and reconfigurable hardware cell are proposed in this paper, which can neatly execute LFSR computing operation in parallel with high performance. The LFSR instructions can sustain different operation data widths, different operating models. Instruction-level parallelism based on VLIW system structure and instruction inner parallelism by operating several steps at one time are exploited too. Corresponding reconfigurable hardware units to sustain the implementation of each instruction forcefully by configurating is also developed. The circuit can be used as an important accelerated unit in special processing for stream cipher.

Download Full-text

A high performance ECC hardware implementation with instruction-level parallelism over GF(2163)

Microprocessors and Microsystems ◽

10.1016/j.micpro.2010.04.006 ◽

2010 ◽

Vol 34 (6) ◽

pp. 228-236 ◽

Cited By ~ 38

Author(s):

Yu Zhang ◽

Dongdong Chen ◽

Younhee Choi ◽

Li Chen ◽

Seok-Bum Ko

Keyword(s):

High Performance ◽

Hardware Implementation ◽

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text

Microarchitectural Characterization on a Mobile Workload

Applied Sciences ◽

10.3390/app11031225 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1225

Author(s):

Woohyong Lee ◽

Jiyoung Lee ◽

Bo Kyung Park ◽

R. Young Chul Kim

Keyword(s):

Performance Monitoring ◽

Performance Metrics ◽

Performance Comparison ◽

Instruction Level Parallelism ◽

Data Set ◽

Performance Events ◽

Hardware Performance Counters ◽

On Chip ◽

The Comparative Study ◽

Level Parallelism

Geekbench is one of the most referenced cross-platform benchmarks in the mobile world. Most of its workloads are synthetic but some of them aim to simulate real-world behavior. In the mobile world, its microarchitectural behavior has been reported rarely since the hardware profiling features are limited to the public. As a popular mobile performance workload, it is hard to find Geekbench’s microarchitecture characteristics in mobile devices. In this paper, a thorough experimental study of Geekbench performance characterization is reported with detailed performance metrics. This study also identifies mobile system on chip (SoC) microarchitecture impacts, such as the cache subsystem, instruction-level parallelism, and branch performance. After the study, we could understand the bottleneck of workloads, especially in the cache sub-system. This means that the change of data set size directly impacts performance score significantly in some systems and will ruin the fairness of the CPU benchmark. In the experiment, Samsung’s Exynos9820-based platform was used as the tested device with Android Native Development Kit (NDK) built binaries. The Exynos9820 is a superscalar processor capable of dual issuing some instructions. To help performance analysis, we enable the capability to collect performance events with performance monitoring unit (PMU) registers. The PMU is a set of hardware performance counters which are built into microprocessors to store the counts of hardware-related activities. Throughout the experiment, functional and microarchitectural performance profiles were fully studied. This paper describes the details of the mobile performance studies above. In our experiment, the ARM DS5 tool was used for collecting runtime PMU profiles including OS-level performance data. After the comparative study is completed, users will understand more about the mobile architecture behavior, and this will help to evaluate which benchmark is preferable for fair performance comparison.

Download Full-text