DESIGN AND IMPLEMENTATION OF CONFIGURABLE LFSR INSTRUCTIONS TARGETED AT STREAM CIPHER PROCESSING

2013 ◽  
Vol 22 (10) ◽  
pp. 1340036
Author(s):  
ZIBIN DAI ◽  
LONGMEI NAN ◽  
XUAN YANG ◽  
XIAONAN LI

By analyzing the operation characteristic of linear feedback shifter registers (LFSRs) in many public stream cipher algorithms and its bottleneck realized by general processor, each specific instruction and reconfigurable hardware cell are proposed in this paper, which can neatly execute LFSR computing operation in parallel with high performance. The LFSR instructions can sustain different operation data widths, different operating models. Instruction-level parallelism based on VLIW system structure and instruction inner parallelism by operating several steps at one time are exploited too. Corresponding reconfigurable hardware units to sustain the implementation of each instruction forcefully by configurating is also developed. The circuit can be used as an important accelerated unit in special processing for stream cipher.

2017 ◽  
Vol 26 (09) ◽  
pp. 1750129 ◽  
Author(s):  
Mohamed Najoui ◽  
Mounir Bahtat ◽  
Anas Hatim ◽  
Said Belkouch ◽  
Noureddine Chabini

QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12[Formula: see text] and 6[Formula: see text] speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.


2014 ◽  
Vol 668-669 ◽  
pp. 1314-1318
Author(s):  
Lei Zhang ◽  
Ren Ping Dong ◽  
Chang Zhang ◽  
Ya Ping Yu

With the existence of traditional SOC chip, the encryption and decryption speed and low power cannot meet the computing needs of the modern diversity, then we present a heterogeneous multi-core system which designed based on shared memory on the Xilinx Virtex-5 platform. This paper is in-depth research about heterogeneous multi-core password architecture, static task partitioning, scheduling strategy and the communication mechanism between cores. The three cores systems are designed and builded based on shared memory to realize ZUC algorithm which generates a stream cipher on virtex-5 platform. The three microblaze cores are responsible for inter-core communication, the implementation of ZUC algorithm and articulating IC card to read keys. Through the design of three cores system, give full play to the hardware, software and computer architecture parallelism at all levels to improve the performance of the algorithm to achieve high performance green computing.


2011 ◽  
Vol 130-134 ◽  
pp. 2907-2910
Author(s):  
Hong Yan Li

The important method of studying cipher coprocessor is focus on system architecture of processor in combination with reconfigurable design technique. How to improve performance of cipher coprocessor is important. Based on very long instruction word (VLIW) structure and reconfigurable design technique, specific instruction cipher coprocessor is designed. In this paper, the cipher coprocessor instruction level parallelism compilation technique is studied to enhance the cipher coprocessor performance by increasing the instruction level parallelism.


Author(s):  
Masa-aki Fukase ◽  
Tomoaki Sato

In developing cutting edge VLSI processors, parallelism is one of the most important global standard strategies to achieve power conscious high performance. These features are more critical for ubiquitous systems with great demands for multimedia mobile processing. Then, one of most important issues for ubiquitous systems is instruction scheduling, because floating point units indispensable for multimedia mobile applications take longer latency than integer units. Although software parallelism has been inevitable to fully utilize hardware parallelism between regular scalar units, it has been really awkward. Thus, we describe in this article a double scheme to achieve instruction scheduling free ILP (instruction level parallelism) and apply the double scheme to a ubiquitous processor HCgorilla we have so far developed. The double scheme is the multifunctionalization of scalar units and making a resultant multifunctional unit (MFU) wave-pipeline. The multifunctionalization frees the instruction scheduling, and the wave-pipelining recovers the reduction of clock speed to be caused by the scale up of a multifunctional circuit. HCgorilla built-in the waved MFU is promising for wide-range dynamic ILP at a rate higher than regular processors.


2010 ◽  
Vol 57 (3) ◽  
pp. 314-338 ◽  
Author(s):  
Ben Abdallah Abderazek ◽  
Masashi Masuda ◽  
Arquimedes Canedo ◽  
Kenichi Kuroda

Author(s):  
M. KAMARAJU ◽  
M. ALEKHYA ◽  
K.LAL KISHORE

The main objective of this work is to implement a 32-bit pipelined RISC processor without interlocking stages. It is developed by S.I.M.E (Single Instruction Multiple Execution) that is with single instruction scheme more executions can be done and is based on VLIW(Very Long Instruction Word) architecture processing is an optimal choice in the attempt to obtain high performance level in Embedded Systems. In VLIW based architecture, the effectiveness of the processor depends on the ability of compilers to provide sufficient instruction level parallelism (ILP). The processor has been designed with VHDL, synthesized using Xilinx tool.


2008 ◽  
Vol 16 (4) ◽  
pp. 277-285 ◽  
Author(s):  
Ida M.B. Nielsen ◽  
Curtis L. Janssen

Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexity of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.


2010 ◽  
Vol 34 (6) ◽  
pp. 228-236 ◽  
Author(s):  
Yu Zhang ◽  
Dongdong Chen ◽  
Younhee Choi ◽  
Li Chen ◽  
Seok-Bum Ko

Sign in / Sign up

Export Citation Format

Share Document