Research on Cipher Coprocessor Instruction Level Parallelism Compiler

Design Technique ◽

Improve Performance ◽

Specific Instruction ◽

Very Long Instruction Word ◽

Important Method ◽

Reconfigurable Design ◽

The important method of studying cipher coprocessor is focus on system architecture of processor in combination with reconfigurable design technique. How to improve performance of cipher coprocessor is important. Based on very long instruction word (VLIW) structure and reconfigurable design technique, specific instruction cipher coprocessor is designed. In this paper, the cipher coprocessor instruction level parallelism compilation technique is studied to enhance the cipher coprocessor performance by increasing the instruction level parallelism.

CODE OPTIMIZATION METHOD FOR QUALCOMM HEXAGON PROCESSOR, SUPPORTING INSTRUCTION LEVEL PARALLELISM AND BUILT WITH VLIW (Very Long Instruction Word) ARCHITECTURE

ITNOU: Information technologies in education, science and management ◽

10.47501/itnou.2021.1.105-115 ◽

2021 ◽

Vol 115 ◽

pp. 105-115

Author(s):

Tatiana Nikolaevna Romanova ◽

◽

Dmitry Igorevich Gorin ◽

Keyword(s):

Optimization Method ◽

Code Optimization ◽

Very Long Instruction Word ◽

Running Time ◽

Level Parallelism ◽

Packet Density

A method for optimizing the filling of a machine word with independent instructions is proposed, which allows to increase the performance of programs by stacking the maximum number of independent commands in a package. The paper also confirms the hypothesis that with the transition to random register allocation by the compiler, the packet density will increase, which will result in a decrease in the program's running time.

A NOVEL IMPLEMENTATION OF 32-BIT VLIW-MISC PROCESSOR ON FPGA

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2016.1339 ◽

2016 ◽

pp. 60-63

Author(s):

M. KAMARAJU ◽

M. ALEKHYA ◽

K.LAL KISHORE

Keyword(s):

Embedded Systems ◽

High Performance ◽

Optimal Choice ◽

Performance Level ◽

Very Long Instruction Word ◽

Risc Processor ◽

Level Parallelism ◽

High Performance Level

The main objective of this work is to implement a 32-bit pipelined RISC processor without interlocking stages. It is developed by S.I.M.E (Single Instruction Multiple Execution) that is with single instruction scheme more executions can be done and is based on VLIW(Very Long Instruction Word) architecture processing is an optimal choice in the attempt to obtain high performance level in Embedded Systems. In VLIW based architecture, the effectiveness of the processor depends on the ability of compilers to provide sufficient instruction level parallelism (ILP). The processor has been designed with VHDL, synthesized using Xilinx tool.

DESIGN AND IMPLEMENTATION OF CONFIGURABLE LFSR INSTRUCTIONS TARGETED AT STREAM CIPHER PROCESSING

Journal of Circuits System and Computers ◽

10.1142/s0218126613400367 ◽

2013 ◽

Vol 22 (10) ◽

pp. 1340036

Author(s):

ZIBIN DAI ◽

LONGMEI NAN ◽

XUAN YANG ◽

XIAONAN LI

Keyword(s):

High Performance ◽

Stream Cipher ◽

Reconfigurable Hardware ◽

System Structure ◽

Linear Feedback ◽

Specific Instruction ◽

Design And Implementation ◽

Operation Characteristic ◽

By analyzing the operation characteristic of linear feedback shifter registers (LFSRs) in many public stream cipher algorithms and its bottleneck realized by general processor, each specific instruction and reconfigurable hardware cell are proposed in this paper, which can neatly execute LFSR computing operation in parallel with high performance. The LFSR instructions can sustain different operation data widths, different operating models. Instruction-level parallelism based on VLIW system structure and instruction inner parallelism by operating several steps at one time are exploited too. Corresponding reconfigurable hardware units to sustain the implementation of each instruction forcefully by configurating is also developed. The circuit can be used as an important accelerated unit in special processing for stream cipher.

Microarchitectural Characterization on a Mobile Workload

Applied Sciences ◽

10.3390/app11031225 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1225

Author(s):

Woohyong Lee ◽

Jiyoung Lee ◽

Bo Kyung Park ◽

R. Young Chul Kim

Keyword(s):

Performance Monitoring ◽

Performance Metrics ◽

Performance Comparison ◽

Data Set ◽

Performance Events ◽

Hardware Performance Counters ◽

On Chip ◽

The Comparative Study ◽

Geekbench is one of the most referenced cross-platform benchmarks in the mobile world. Most of its workloads are synthetic but some of them aim to simulate real-world behavior. In the mobile world, its microarchitectural behavior has been reported rarely since the hardware profiling features are limited to the public. As a popular mobile performance workload, it is hard to find Geekbench’s microarchitecture characteristics in mobile devices. In this paper, a thorough experimental study of Geekbench performance characterization is reported with detailed performance metrics. This study also identifies mobile system on chip (SoC) microarchitecture impacts, such as the cache subsystem, instruction-level parallelism, and branch performance. After the study, we could understand the bottleneck of workloads, especially in the cache sub-system. This means that the change of data set size directly impacts performance score significantly in some systems and will ruin the fairness of the CPU benchmark. In the experiment, Samsung’s Exynos9820-based platform was used as the tested device with Android Native Development Kit (NDK) built binaries. The Exynos9820 is a superscalar processor capable of dual issuing some instructions. To help performance analysis, we enable the capability to collect performance events with performance monitoring unit (PMU) registers. The PMU is a set of hardware performance counters which are built into microprocessors to store the counts of hardware-related activities. Throughout the experiment, functional and microarchitectural performance profiles were fully studied. This paper describes the details of the mobile performance studies above. In our experiment, the ARM DS5 tool was used for collecting runtime PMU profiles including OS-level performance data. After the comparative study is completed, users will understand more about the mobile architecture behavior, and this will help to evaluate which benchmark is preferable for fair performance comparison.

UltraSynth: Insights of a CGRA Integration into a Control Engineering Environment

Journal of Signal Processing Systems ◽

10.1007/s11265-021-01641-7 ◽

2021 ◽

Author(s):

Dennis Wolf ◽

Andreas Engel ◽

Tajas Ruschke ◽

Andreas Koch ◽

Christian Hochberger

Keyword(s):

Computing System ◽

Coarse Grained ◽

Control Engineering ◽

Processing Elements ◽

Actual Application ◽

Reconfigurable Arrays ◽

Engineering Environment ◽

On Chip ◽

AbstractCoarse Grained Reconfigurable Arrays (CGRAs) or Architectures are a concept for hardware accelerators based on the idea of distributing workload over Processing Elements. These processors exploit instruction level parallelism, while being energy efficient due to their simplistic internal structure. However, the incorporation into a complete computing system raises severe challenges at the hardware and software level. This article evaluates a CGRA integrated into a control engineering environment targeting a Xilinx Zynq System on Chip (SoC) in detail. Besides the actual application execution performance, the practicability of the configuration toolchain is validated. Challenges of the real-world integration are discussed and practical insights are highlighted.

NCOR: An FPGA-Friendly Nonblocking Data Cache for Soft Processors with Runahead Execution

International Journal of Reconfigurable Computing ◽

10.1155/2012/915178 ◽

2012 ◽

Vol 2012 ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Kaveh Aasaraai ◽

Andreas Moshovos

Keyword(s):

High Efficiency ◽

Main Memory ◽

Data Cache ◽

Improve Performance ◽

Data Caches ◽

Content Addressable Memories ◽

Processor Designs ◽

Level Parallelism ◽

Order Execution ◽

Runahead Execution

Soft processors often use data caches to reduce the gap between processor and main memory speeds. To achieve high efficiency, simple, blocking caches are used. Such caches are not appropriate for processor designs such as Runahead and out-of-order execution that require nonblocking caches to tolerate main memory latencies. Instead, these processors use non-blocking caches to extract memory level parallelism and improve performance. However, conventional non-blocking cache designs are expensive and slow on FPGAs as they use content-addressable memories (CAMs). This work proposes NCOR, an FPGA-friendly non-blocking cache that exploits the key properties of Runahead execution. NCOR does not require CAMs and utilizes smart cache controllers. A 4 KB NCOR operates at 329 MHz on Stratix III FPGAs while it uses only 270 logic elements. A 32 KB NCOR operates at 278 Mhz and uses 269 logic elements.

Proceedings of the third international conference on Architectural support for programming languages and operating systems - ASPLOS-III ◽

Available instruction-level parallelism for superscalar and superpipelined machines

10.1145/70082.68207 ◽

1989 ◽

Cited By ~ 165

Author(s):

N. P. Jouppi ◽

D. W. Wall

Keyword(s):

Topic 8 Parallel Computer Architecture and Instruction-Level Parallelism

Euro-Par 2003 Parallel Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-540-45209-6_78 ◽

2003 ◽

pp. 541-542

Author(s):

Stamatis Vassiliadis ◽

Nikitas Dimopoulos ◽

Jean-Francois Collard ◽

Arndt Bode

Keyword(s):

Computer Architecture ◽

Parallel Computer ◽

VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing

Journal of Circuits System and Computers ◽

10.1142/s0218126617501298 ◽

2017 ◽

Vol 26 (09) ◽

pp. 1750129 ◽

Cited By ~ 2

Author(s):

Mohamed Najoui ◽

Mounir Bahtat ◽

Anas Hatim ◽

Said Belkouch ◽

Noureddine Chabini

Keyword(s):

High Performance ◽

Qr Decomposition ◽

Numerical Linear Algebra ◽

Management Approach ◽

Real Time Processing ◽

Low Level ◽

Processor Architectures ◽

Efficient Data ◽

QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12[Formula: see text] and 6[Formula: see text] speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.

Software thread integration for instruction-level parallelism

ACM Transactions on Embedded Computing Systems ◽

10.1145/2512466 ◽

2013 ◽

Vol 13 (1) ◽

pp. 1-23

Author(s):

Won So ◽

Alexander G. Dean

Keyword(s):

Level Parallelism ◽

Software Thread Integration