DESIGN AND IMPLEMENTATION OF CONFIGURABLE LFSR INSTRUCTIONS TARGETED AT STREAM CIPHER PROCESSING

By analyzing the operation characteristic of linear feedback shifter registers (LFSRs) in many public stream cipher algorithms and its bottleneck realized by general processor, each specific instruction and reconfigurable hardware cell are proposed in this paper, which can neatly execute LFSR computing operation in parallel with high performance. The LFSR instructions can sustain different operation data widths, different operating models. Instruction-level parallelism based on VLIW system structure and instruction inner parallelism by operating several steps at one time are exploited too. Corresponding reconfigurable hardware units to sustain the implementation of each instruction forcefully by configurating is also developed. The circuit can be used as an important accelerated unit in special processing for stream cipher.

Download Full-text

VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing

Journal of Circuits System and Computers ◽

10.1142/s0218126617501298 ◽

2017 ◽

Vol 26 (09) ◽

pp. 1750129 ◽

Cited By ~ 2

Author(s):

Mohamed Najoui ◽

Mounir Bahtat ◽

Anas Hatim ◽

Said Belkouch ◽

Noureddine Chabini

Keyword(s):

High Performance ◽

Qr Decomposition ◽

Numerical Linear Algebra ◽

Instruction Level Parallelism ◽

Management Approach ◽

Real Time Processing ◽

Low Level ◽

Processor Architectures ◽

Efficient Data ◽

Level Parallelism

QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12[Formula: see text] and 6[Formula: see text] speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.

Download Full-text

The Design and Implementation of a Heterogeneous Multi-Core Security Chip Architecture Based on Shared Memory System

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.668-669.1314 ◽

2014 ◽

Vol 668-669 ◽

pp. 1314-1318

Author(s):

Lei Zhang ◽

Ren Ping Dong ◽

Chang Zhang ◽

Ya Ping Yu

Keyword(s):

Low Power ◽

Computer Architecture ◽

Shared Memory ◽

High Performance ◽

Stream Cipher ◽

Ic Card ◽

Task Partitioning ◽

Core System ◽

Design And Implementation ◽

Encryption And Decryption

With the existence of traditional SOC chip, the encryption and decryption speed and low power cannot meet the computing needs of the modern diversity, then we present a heterogeneous multi-core system which designed based on shared memory on the Xilinx Virtex-5 platform. This paper is in-depth research about heterogeneous multi-core password architecture, static task partitioning, scheduling strategy and the communication mechanism between cores. The three cores systems are designed and builded based on shared memory to realize ZUC algorithm which generates a stream cipher on virtex-5 platform. The three microblaze cores are responsible for inter-core communication, the implementation of ZUC algorithm and articulating IC card to read keys. Through the design of three cores system, give full play to the hardware, software and computer architecture parallelism at all levels to improve the performance of the algorithm to achieve high performance green computing.

Download Full-text

Research on Cipher Coprocessor Instruction Level Parallelism Compiler

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.130-134.2907 ◽

2011 ◽

Vol 130-134 ◽

pp. 2907-2910

Author(s):

Hong Yan Li

Keyword(s):

System Architecture ◽

Instruction Level Parallelism ◽

Design Technique ◽

Improve Performance ◽

Specific Instruction ◽

Very Long Instruction Word ◽

Important Method ◽

Reconfigurable Design ◽

Level Parallelism

The important method of studying cipher coprocessor is focus on system architecture of processor in combination with reconfigurable design technique. How to improve performance of cipher coprocessor is important. Based on very long instruction word (VLIW) structure and reconfigurable design technique, specific instruction cipher coprocessor is designed. In this paper, the cipher coprocessor instruction level parallelism compilation technique is studied to enhance the cipher coprocessor performance by increasing the instruction level parallelism.

Download Full-text

A Ubiquitous Processor Built-in a Waved Multifunctional Unit

ECTI Transactions on Computer and Information Technology (ECTI-CIT) ◽

10.37936/ecti-cit.201041.54218 ◽

1970 ◽

Vol 4 (1) ◽

pp. 1-7

Author(s):

Masa-aki Fukase ◽

Tomoaki Sato

Keyword(s):

Mobile Applications ◽

High Performance ◽

Scale Up ◽

Instruction Scheduling ◽

Instruction Level Parallelism ◽

Floating Point ◽

Ubiquitous Systems ◽

Wide Range ◽

Wave Pipelining ◽

Level Parallelism

In developing cutting edge VLSI processors, parallelism is one of the most important global standard strategies to achieve power conscious high performance. These features are more critical for ubiquitous systems with great demands for multimedia mobile processing. Then, one of most important issues for ubiquitous systems is instruction scheduling, because floating point units indispensable for multimedia mobile applications take longer latency than integer units. Although software parallelism has been inevitable to fully utilize hardware parallelism between regular scalar units, it has been really awkward. Thus, we describe in this article a double scheme to achieve instruction scheduling free ILP (instruction level parallelism) and apply the double scheme to a ubiquitous processor HCgorilla we have so far developed. The double scheme is the multifunctionalization of scalar units and making a resultant multifunctional unit (MFU) wave-pipeline. The multifunctionalization frees the instruction scheduling, and the wave-pipelining recovers the reduction of clock speed to be caused by the scale up of a multifunctional circuit. HCgorilla built-in the waved MFU is promising for wide-range dynamic ILP at a rate higher than regular processors.

Download Full-text

Natural instruction level parallelism-aware compiler for high-performance QueueCore processor architecture

The Journal of Supercomputing ◽

10.1007/s11227-010-0409-z ◽

2010 ◽

Vol 57 (3) ◽

pp. 314-338 ◽

Cited By ~ 3

Author(s):

Ben Abdallah Abderazek ◽

Masashi Masuda ◽

Arquimedes Canedo ◽

Kenichi Kuroda

Keyword(s):

High Performance ◽

Instruction Level Parallelism ◽

Processor Architecture ◽

Level Parallelism

Download Full-text

A NOVEL IMPLEMENTATION OF 32-BIT VLIW-MISC PROCESSOR ON FPGA

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2016.1339 ◽

2016 ◽

pp. 60-63

Author(s):

M. KAMARAJU ◽

M. ALEKHYA ◽

K.LAL KISHORE

Keyword(s):

Embedded Systems ◽

High Performance ◽

Optimal Choice ◽

Performance Level ◽

Instruction Level Parallelism ◽

Very Long Instruction Word ◽

Risc Processor ◽

Level Parallelism ◽

High Performance Level

The main objective of this work is to implement a 32-bit pipelined RISC processor without interlocking stages. It is developed by S.I.M.E (Single Instruction Multiple Execution) that is with single instruction scheme more executions can be done and is based on VLIW(Very Long Instruction Word) architecture processing is an optimal choice in the attempt to obtain high performance level in Embedded Systems. In VLIW based architecture, the effectiveness of the processor depends on the ability of compilers to provide sufficient instruction level parallelism (ILP). The processor has been designed with VHDL, synthesized using Xilinx tool.

Download Full-text

Multicore Challenges and Benefits for High Performance Scientific Computing

Scientific Programming ◽

10.1155/2008/450818 ◽

2008 ◽

Vol 16 (4) ◽

pp. 277-285 ◽

Cited By ~ 5

Author(s):

Ida M.B. Nielsen ◽

Curtis L. Janssen

Keyword(s):

Message Passing ◽

High Performance ◽

Programming Model ◽

Instruction Level Parallelism ◽

Performance Improvements ◽

Processor Performance ◽

Multiple Threads ◽

Moller Plesset ◽

Multicore Chips ◽

Level Parallelism

Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexity of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.

Download Full-text