scholarly journals Implementing and Optimizing of Entire System Toolkit of VLIW DSP Processors for Embedded Sensor-Based Systems

2015 ◽  
Vol 2015 ◽  
pp. 1-7
Author(s):  
Xu Yang ◽  
Mingbin Zeng ◽  
Yanjun Zhang

VLIW DSPs can largely enhance the Instruction-Level Parallelism, providing the capacity to meet the performance and energy efficiency requirement of sensor-based systems. However, the exploiting of VLIW DSPs in sensor-based domain has imposed a heavy challenge on software toolkit design. In this paper, we present our methods and experiences to develop system toolkit flows for a VLIW DSP, which is designed dedicated to sensor-based systems. Our system toolkit includes compiler, assembler, linker, debugger, and simulator. We have presented our experimental results in the compiler framework by incorporating several state-of-the-art optimization techniques for this VLIW DSP. The results indicate that our framework can largely enhance the performance and energy consumption against the code generated without it.

Author(s):  
S. Rajagopalan ◽  
S.P. Rajan ◽  
S. Malik ◽  
S. Rigo ◽  
G. Araujo ◽  
...  

2012 ◽  
Vol 4 (3) ◽  
pp. 48-62
Author(s):  
Slo-Li Chu ◽  
Chih-Chieh Hsiao

Heterogeneous platforms that are consisted of CPU and add-on streaming processors are widely used in modern computer systems. These add-on processors provide substantially more computation capability and memory bandwidth than conventional multi-cores platforms. General-purpose computations can also be leveraged onto these add-on processors. In order to utilize their potential performance, programming these streaming processors is challenging because of their diverse underlying architectural characteristics. Several optimization techniques are applied on OpenCL-compatible heterogeneous platforms to achieve thread-level, data-level, and instruction-level parallelism. The architectural implications of these techniques and optimization principles are discussed. Finally, a case study of MRI-Q benchmark will be addressed to illustrate to capabilities of these optimization techniques. The experimental results reveal the speedup from non-optimized to optimized kernel can vary from 8 to 63 on different target platforms.


2012 ◽  
Vol 47 (6) ◽  
pp. 347-358 ◽  
Author(s):  
Jun Liu ◽  
Yuanrui Zhang ◽  
Ohyoung Jang ◽  
Wei Ding ◽  
Mahmut Kandemir

2021 ◽  
Vol 11 (3) ◽  
pp. 1225
Author(s):  
Woohyong Lee ◽  
Jiyoung Lee ◽  
Bo Kyung Park ◽  
R. Young Chul Kim

Geekbench is one of the most referenced cross-platform benchmarks in the mobile world. Most of its workloads are synthetic but some of them aim to simulate real-world behavior. In the mobile world, its microarchitectural behavior has been reported rarely since the hardware profiling features are limited to the public. As a popular mobile performance workload, it is hard to find Geekbench’s microarchitecture characteristics in mobile devices. In this paper, a thorough experimental study of Geekbench performance characterization is reported with detailed performance metrics. This study also identifies mobile system on chip (SoC) microarchitecture impacts, such as the cache subsystem, instruction-level parallelism, and branch performance. After the study, we could understand the bottleneck of workloads, especially in the cache sub-system. This means that the change of data set size directly impacts performance score significantly in some systems and will ruin the fairness of the CPU benchmark. In the experiment, Samsung’s Exynos9820-based platform was used as the tested device with Android Native Development Kit (NDK) built binaries. The Exynos9820 is a superscalar processor capable of dual issuing some instructions. To help performance analysis, we enable the capability to collect performance events with performance monitoring unit (PMU) registers. The PMU is a set of hardware performance counters which are built into microprocessors to store the counts of hardware-related activities. Throughout the experiment, functional and microarchitectural performance profiles were fully studied. This paper describes the details of the mobile performance studies above. In our experiment, the ARM DS5 tool was used for collecting runtime PMU profiles including OS-level performance data. After the comparative study is completed, users will understand more about the mobile architecture behavior, and this will help to evaluate which benchmark is preferable for fair performance comparison.


Author(s):  
Dennis Wolf ◽  
Andreas Engel ◽  
Tajas Ruschke ◽  
Andreas Koch ◽  
Christian Hochberger

AbstractCoarse Grained Reconfigurable Arrays (CGRAs) or Architectures are a concept for hardware accelerators based on the idea of distributing workload over Processing Elements. These processors exploit instruction level parallelism, while being energy efficient due to their simplistic internal structure. However, the incorporation into a complete computing system raises severe challenges at the hardware and software level. This article evaluates a CGRA integrated into a control engineering environment targeting a Xilinx Zynq System on Chip (SoC) in detail. Besides the actual application execution performance, the practicability of the configuration toolchain is validated. Challenges of the real-world integration are discussed and practical insights are highlighted.


Materials ◽  
2022 ◽  
Vol 15 (1) ◽  
pp. 384
Author(s):  
António Gaspar-Cunha ◽  
José A. Covas ◽  
Janusz Sikora

Given the global economic and societal importance of the polymer industry, the continuous search for improvements in the various processing techniques is of practical primordial importance. This review evaluates the application of optimization methodologies to the main polymer processing operations. The most important characteristics related to the usage of optimization techniques, such as the nature of the objective function, the type of optimization algorithm, the modelling approach used to evaluate the solutions, and the parameters to optimize, are discussed. The aim is to identify the most important features of an optimization system for polymer processing problems and define the best procedure for each particular practical situation. For this purpose, the state of the art of the optimization methodologies usually employed is first presented, followed by an extensive review of the literature dealing with the major processing techniques, the discussion being completed by considering both the characteristics identified and the available optimization methodologies. This first part of the review focuses on extrusion, namely single and twin-screw extruders, extrusion dies, and calibrators. It is concluded that there is a set of methodologies that can be confidently applied in polymer processing with a very good performance and without the need of demanding computation requirements.


2017 ◽  
Vol 26 (09) ◽  
pp. 1750129 ◽  
Author(s):  
Mohamed Najoui ◽  
Mounir Bahtat ◽  
Anas Hatim ◽  
Said Belkouch ◽  
Noureddine Chabini

QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12[Formula: see text] and 6[Formula: see text] speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.


Sign in / Sign up

Export Citation Format

Share Document