Implementing and Optimizing of Entire System Toolkit of VLIW DSP Processors for Embedded Sensor-Based Systems

VLIW DSPs can largely enhance the Instruction-Level Parallelism, providing the capacity to meet the performance and energy efficiency requirement of sensor-based systems. However, the exploiting of VLIW DSPs in sensor-based domain has imposed a heavy challenge on software toolkit design. In this paper, we present our methods and experiences to develop system toolkit flows for a VLIW DSP, which is designed dedicated to sensor-based systems. Our system toolkit includes compiler, assembler, linker, debugger, and simulator. We have presented our experimental results in the compiler framework by incorporating several state-of-the-art optimization techniques for this VLIW DSP. The results indicate that our framework can largely enhance the performance and energy consumption against the code generated without it.

Download Full-text

A retargetable VLIW compiler framework for DSPs with instruction-level parallelism

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems ◽

10.1109/43.959861 ◽

2001 ◽

Vol 20 (11) ◽

pp. 1319-1328 ◽

Cited By ~ 10

Author(s):

S. Rajagopalan ◽

S.P. Rajan ◽

S. Malik ◽

S. Rigo ◽

G. Araujo ◽

...

Keyword(s):

Instruction Level Parallelism ◽

Compiler Framework ◽

Level Parallelism

Download Full-text

Optimizing Techniques for OpenCL Programs on Heterogeneous Platforms

International Journal of Grid and High Performance Computing ◽

10.4018/jghpc.2012070103 ◽

2012 ◽

Vol 4 (3) ◽

pp. 48-62

Author(s):

Slo-Li Chu ◽

Chih-Chieh Hsiao

Keyword(s):

General Purpose ◽

Optimization Techniques ◽

Instruction Level Parallelism ◽

Heterogeneous Platforms ◽

Modern Computer ◽

Level Data ◽

Performance Programming ◽

Architectural Characteristics ◽

Level Parallelism

Heterogeneous platforms that are consisted of CPU and add-on streaming processors are widely used in modern computer systems. These add-on processors provide substantially more computation capability and memory bandwidth than conventional multi-cores platforms. General-purpose computations can also be leveraged onto these add-on processors. In order to utilize their potential performance, programming these streaming processors is challenging because of their diverse underlying architectural characteristics. Several optimization techniques are applied on OpenCL-compatible heterogeneous platforms to achieve thread-level, data-level, and instruction-level parallelism. The architectural implications of these techniques and optimization principles are discussed. Finally, a case study of MRI-Q benchmark will be addressed to illustrate to capabilities of these optimization techniques. The experimental results reveal the speedup from non-optimized to optimized kernel can vary from 8 to 63 on different target platforms.

Download Full-text

A compiler framework for extracting superword level parallelism

ACM SIGPLAN Notices ◽

10.1145/2345156.2254106 ◽

2012 ◽

Vol 47 (6) ◽

pp. 347-358 ◽

Cited By ~ 2

Author(s):

Jun Liu ◽

Yuanrui Zhang ◽

Ohyoung Jang ◽

Wei Ding ◽

Mahmut Kandemir

Keyword(s):

Compiler Framework ◽

Level Parallelism

Download Full-text

Microarchitectural Characterization on a Mobile Workload

Applied Sciences ◽

10.3390/app11031225 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1225

Author(s):

Woohyong Lee ◽

Jiyoung Lee ◽

Bo Kyung Park ◽

R. Young Chul Kim

Keyword(s):

Performance Monitoring ◽

Performance Metrics ◽

Performance Comparison ◽

Instruction Level Parallelism ◽

Data Set ◽

Performance Events ◽

Hardware Performance Counters ◽

On Chip ◽

The Comparative Study ◽

Level Parallelism

Geekbench is one of the most referenced cross-platform benchmarks in the mobile world. Most of its workloads are synthetic but some of them aim to simulate real-world behavior. In the mobile world, its microarchitectural behavior has been reported rarely since the hardware profiling features are limited to the public. As a popular mobile performance workload, it is hard to find Geekbench’s microarchitecture characteristics in mobile devices. In this paper, a thorough experimental study of Geekbench performance characterization is reported with detailed performance metrics. This study also identifies mobile system on chip (SoC) microarchitecture impacts, such as the cache subsystem, instruction-level parallelism, and branch performance. After the study, we could understand the bottleneck of workloads, especially in the cache sub-system. This means that the change of data set size directly impacts performance score significantly in some systems and will ruin the fairness of the CPU benchmark. In the experiment, Samsung’s Exynos9820-based platform was used as the tested device with Android Native Development Kit (NDK) built binaries. The Exynos9820 is a superscalar processor capable of dual issuing some instructions. To help performance analysis, we enable the capability to collect performance events with performance monitoring unit (PMU) registers. The PMU is a set of hardware performance counters which are built into microprocessors to store the counts of hardware-related activities. Throughout the experiment, functional and microarchitectural performance profiles were fully studied. This paper describes the details of the mobile performance studies above. In our experiment, the ARM DS5 tool was used for collecting runtime PMU profiles including OS-level performance data. After the comparative study is completed, users will understand more about the mobile architecture behavior, and this will help to evaluate which benchmark is preferable for fair performance comparison.

Download Full-text

UltraSynth: Insights of a CGRA Integration into a Control Engineering Environment

Journal of Signal Processing Systems ◽

10.1007/s11265-021-01641-7 ◽

2021 ◽

Author(s):

Dennis Wolf ◽

Andreas Engel ◽

Tajas Ruschke ◽

Andreas Koch ◽

Christian Hochberger

Keyword(s):

Computing System ◽

Coarse Grained ◽

Instruction Level Parallelism ◽

Control Engineering ◽

Processing Elements ◽

Actual Application ◽

Reconfigurable Arrays ◽

Engineering Environment ◽

On Chip ◽

Level Parallelism

AbstractCoarse Grained Reconfigurable Arrays (CGRAs) or Architectures are a concept for hardware accelerators based on the idea of distributing workload over Processing Elements. These processors exploit instruction level parallelism, while being energy efficient due to their simplistic internal structure. However, the incorporation into a complete computing system raises severe challenges at the hardware and software level. This article evaluates a CGRA integrated into a control engineering environment targeting a Xilinx Zynq System on Chip (SoC) in detail. Besides the actual application execution performance, the practicability of the configuration toolchain is validated. Challenges of the real-world integration are discussed and practical insights are highlighted.

Download Full-text

Available instruction-level parallelism for superscalar and superpipelined machines

Proceedings of the third international conference on Architectural support for programming languages and operating systems - ASPLOS-III ◽

10.1145/70082.68207 ◽

1989 ◽

Cited By ~ 165

Author(s):

N. P. Jouppi ◽

D. W. Wall

Keyword(s):

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text

Real-Time Loop Scheduling with Leakage Energy Minimization for Embedded VLIW DSP Processors

13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 2007) ◽

10.1109/rtcsa.2007.60 ◽

2007 ◽

Cited By ~ 4

Author(s):

Meng Wang ◽

Zili Shao ◽

Chun Jason Xue ◽

Edwin H.-M. Sha

Keyword(s):

Real Time ◽

Energy Minimization ◽

Loop Scheduling ◽

Vliw Dsp ◽

Dsp Processors ◽

Time Loop

Download Full-text

Topic 8 Parallel Computer Architecture and Instruction-Level Parallelism

Euro-Par 2003 Parallel Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-540-45209-6_78 ◽

2003 ◽

pp. 541-542

Author(s):

Stamatis Vassiliadis ◽

Nikitas Dimopoulos ◽

Jean-Francois Collard ◽

Arndt Bode

Keyword(s):

Computer Architecture ◽

Parallel Computer ◽

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text

Optimization of Polymer Processing: A Review (Part I—Extrusion)

Materials ◽

10.3390/ma15010384 ◽

2022 ◽

Vol 15 (1) ◽

pp. 384

Author(s):

António Gaspar-Cunha ◽

José A. Covas ◽

Janusz Sikora

Keyword(s):

State Of The Art ◽

Polymer Processing ◽

Optimization Techniques ◽

Practical Situation ◽

Twin Screw Extruders ◽

Extrusion Dies ◽

Twin Screw ◽

Processing Techniques ◽

Screw Extruders ◽

Modelling Approach

Given the global economic and societal importance of the polymer industry, the continuous search for improvements in the various processing techniques is of practical primordial importance. This review evaluates the application of optimization methodologies to the main polymer processing operations. The most important characteristics related to the usage of optimization techniques, such as the nature of the objective function, the type of optimization algorithm, the modelling approach used to evaluate the solutions, and the parameters to optimize, are discussed. The aim is to identify the most important features of an optimization system for polymer processing problems and define the best procedure for each particular practical situation. For this purpose, the state of the art of the optimization methodologies usually employed is first presented, followed by an extensive review of the literature dealing with the major processing techniques, the discussion being completed by considering both the characteristics identified and the available optimization methodologies. This first part of the review focuses on extrusion, namely single and twin-screw extruders, extrusion dies, and calibrators. It is concluded that there is a set of methodologies that can be confidently applied in polymer processing with a very good performance and without the need of demanding computation requirements.

Download Full-text

VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing

Journal of Circuits System and Computers ◽

10.1142/s0218126617501298 ◽

2017 ◽

Vol 26 (09) ◽

pp. 1750129 ◽

Cited By ~ 2

Author(s):

Mohamed Najoui ◽

Mounir Bahtat ◽

Anas Hatim ◽

Said Belkouch ◽

Noureddine Chabini

Keyword(s):

High Performance ◽

Qr Decomposition ◽

Numerical Linear Algebra ◽

Instruction Level Parallelism ◽

Management Approach ◽

Real Time Processing ◽

Low Level ◽

Processor Architectures ◽

Efficient Data ◽

Level Parallelism

QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12[Formula: see text] and 6[Formula: see text] speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.

Download Full-text