A Ubiquitous Processor Built-in a Waved Multifunctional Unit

In developing cutting edge VLSI processors, parallelism is one of the most important global standard strategies to achieve power conscious high performance. These features are more critical for ubiquitous systems with great demands for multimedia mobile processing. Then, one of most important issues for ubiquitous systems is instruction scheduling, because floating point units indispensable for multimedia mobile applications take longer latency than integer units. Although software parallelism has been inevitable to fully utilize hardware parallelism between regular scalar units, it has been really awkward. Thus, we describe in this article a double scheme to achieve instruction scheduling free ILP (instruction level parallelism) and apply the double scheme to a ubiquitous processor HCgorilla we have so far developed. The double scheme is the multifunctionalization of scalar units and making a resultant multifunctional unit (MFU) wave-pipeline. The multifunctionalization frees the instruction scheduling, and the wave-pipelining recovers the reduction of clock speed to be caused by the scale up of a multifunctional circuit. HCgorilla built-in the waved MFU is promising for wide-range dynamic ILP at a rate higher than regular processors.

Download Full-text

VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing

Journal of Circuits System and Computers ◽

10.1142/s0218126617501298 ◽

2017 ◽

Vol 26 (09) ◽

pp. 1750129 ◽

Cited By ~ 2

Author(s):

Mohamed Najoui ◽

Mounir Bahtat ◽

Anas Hatim ◽

Said Belkouch ◽

Noureddine Chabini

Keyword(s):

High Performance ◽

Qr Decomposition ◽

Numerical Linear Algebra ◽

Instruction Level Parallelism ◽

Management Approach ◽

Real Time Processing ◽

Low Level ◽

Processor Architectures ◽

Efficient Data ◽

Level Parallelism

QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12[Formula: see text] and 6[Formula: see text] speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.

Download Full-text

Natural instruction level parallelism-aware compiler for high-performance QueueCore processor architecture

The Journal of Supercomputing ◽

10.1007/s11227-010-0409-z ◽

2010 ◽

Vol 57 (3) ◽

pp. 314-338 ◽

Cited By ~ 3

Author(s):

Ben Abdallah Abderazek ◽

Masashi Masuda ◽

Arquimedes Canedo ◽

Kenichi Kuroda

Keyword(s):

High Performance ◽

Instruction Level Parallelism ◽

Processor Architecture ◽

Level Parallelism

Download Full-text

COMPARISON OF INSTRUCTION SCHEDULING AND REGISTER ALLOCATION FOR MIPS AND HPL-PD ARCHITECTURE FOR EXPLOITATION OF INSTRUCTION LEVEL PARALLELISM

Engineering Heritage Journal ◽

10.26480/gwk.02.2018.04.08 ◽

2018 ◽

Vol 2 (2) ◽

pp. 04-08

Author(s):

Rajendra Kumar

Keyword(s):

Instruction Scheduling ◽

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text

Boosting Parallel Applications Performance on Applying DIM Technique in a Multiprocessing Environment

International Journal of Reconfigurable Computing ◽

10.1155/2011/546962 ◽

2011 ◽

Vol 2011 ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Mateus B. Rutzig ◽

Antonio C. S. Beck ◽

Felipe Madruga ◽

Marco A. Alves ◽

Henrique C. Freitas ◽

...

Keyword(s):

General Purpose ◽

Parallel Applications ◽

Instruction Level Parallelism ◽

Great Level ◽

Embedded Processor ◽

Wide Range ◽

Thread Level Parallelism ◽

Multiprocessing Systems ◽

Performance Gains ◽

Level Parallelism

Limits of instruction-level parallelism and higher transistor density sustain the increasing need for multiprocessor systems: they are rapidly taking over both general-purpose and embedded processor domains. Current multiprocessing systems are composed either of many homogeneous and simple cores or of complex superscalar, simultaneous multithread processing elements. As parallel applications are becoming increasingly present in embedded and general-purpose domains and multiprocessing systems must handle a wide range of different application classes, there is no consensus over which are the best hardware solutions to better exploit instruction-level parallelism (TLP) and thread-level parallelism (TLP) together. Therefore, in this work, we have expanded the DIM (dynamic instruction merging) technique to be used in a multiprocessing scenario, proving the need for an adaptable ILP exploitation even in TLP architectures. We have successfully coupled a dynamic reconfigurable system to an SPARC-based multiprocessor and obtained performance gains of up to 40%, even for applications that show a great level of parallelism at thread level.

Download Full-text

A NOVEL IMPLEMENTATION OF 32-BIT VLIW-MISC PROCESSOR ON FPGA

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2016.1339 ◽

2016 ◽

pp. 60-63

Author(s):

M. KAMARAJU ◽

M. ALEKHYA ◽

K.LAL KISHORE

Keyword(s):

Embedded Systems ◽

High Performance ◽

Optimal Choice ◽

Performance Level ◽

Instruction Level Parallelism ◽

Very Long Instruction Word ◽

Risc Processor ◽

Level Parallelism ◽

High Performance Level

The main objective of this work is to implement a 32-bit pipelined RISC processor without interlocking stages. It is developed by S.I.M.E (Single Instruction Multiple Execution) that is with single instruction scheme more executions can be done and is based on VLIW(Very Long Instruction Word) architecture processing is an optimal choice in the attempt to obtain high performance level in Embedded Systems. In VLIW based architecture, the effectiveness of the processor depends on the ability of compilers to provide sufficient instruction level parallelism (ILP). The processor has been designed with VHDL, synthesized using Xilinx tool.

Download Full-text

COMPARISON OF INSTRUCTION SCHEDULING AND REGISTER ALLOCATION FOR MIPS AND HPL-PD ARCHITECTURE FOR EXPLOITATION OF INSTRUCTION LEVEL PARALLELISM

Engineering Heritage Journal ◽

10.26480/gwk.01.2018.04.08 ◽

2018 ◽

Vol 2 (2) ◽

pp. 04-08 ◽

Cited By ~ 1

Author(s):

Rajendra Kumar

Keyword(s):

Instruction Scheduling ◽

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text

Design of an Extended Floating-Point Multiply-Add-Fused Unit for Exploiting Instruction-Level Parallelism

2007 International Symposium on Integrated Circuits ◽

10.1109/isicir.2007.4441785 ◽

2007 ◽

Cited By ~ 1

Author(s):

Zhaolin Li ◽

Gongqiong Li

Keyword(s):

Instruction Level Parallelism ◽

Floating Point ◽

Level Parallelism

Download Full-text

The Potential for a GPU-Like Overlay Architecture for FPGAs

International Journal of Reconfigurable Computing ◽

10.1155/2011/514581 ◽

2011 ◽

Vol 2011 ◽

pp. 1-15 ◽

Cited By ~ 14

Author(s):

Jeffrey Kingyens ◽

J. Gregory Steffan

Keyword(s):

Graphics Processing Units ◽

Programming Model ◽

Instruction Level Parallelism ◽

Floating Point ◽

High Level ◽

Graphics Processing ◽

Level Parallelism ◽

Data Level ◽

Accelerator System

We propose a soft processor programming model and architecture inspired by graphics processing units (GPUs) that are well-matched to the strengths of FPGAs, namely, highly parallel and pipelinable computation. In particular, our soft processor architecture exploits multithreading, vector operations, and predication to supply a floating-point pipeline of 64 stages via hardware support for up to 256 concurrent thread contexts. The key new contributions of our architecture are mechanisms for managing threads and register files that maximize data-level and instruction-level parallelism while overcoming the challenges of port limitations of FPGA block memories as well as memory and pipeline latency. Through simulation of a system that (i) is programmable via NVIDIA's high-levelCglanguage, (ii) supports AMD's CTM r5xx GPU ISA, and (iii) is realizable on an XtremeData XD1000 FPGA-based accelerator system, we demonstrate the potential for such a system to achieve 100% utilization of a deeply pipelined floating-point datapath.

Download Full-text

Multicore Challenges and Benefits for High Performance Scientific Computing

Scientific Programming ◽

10.1155/2008/450818 ◽

2008 ◽

Vol 16 (4) ◽

pp. 277-285 ◽

Cited By ~ 5

Author(s):

Ida M.B. Nielsen ◽

Curtis L. Janssen

Keyword(s):

Message Passing ◽

High Performance ◽

Programming Model ◽

Instruction Level Parallelism ◽

Performance Improvements ◽

Processor Performance ◽

Multiple Threads ◽

Moller Plesset ◽

Multicore Chips ◽

Level Parallelism

Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexity of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.

Download Full-text

Design of High Performance Modified Wave pipelined DAA Filter with Critical Path Approach

International Journal of Electronics and Electical Engineering ◽

10.47893/ijeee.2012.1016 ◽

2012 ◽

pp. 78-82

Author(s):

Charanjit Singh ◽

Balwinder Singh

Keyword(s):

High Speed ◽

High Performance ◽

Critical Path ◽

Fir Filter ◽

Control Circuit ◽

Path Delay ◽

Clock Skew ◽

Wide Range ◽

Wave Pipelining ◽

The Cost

In this paper, a new high speed control circuit is proposed which will act as a critical path for the data which will go from input to output to improve the performance of wave pipelining circuits The wave pipelining is a method of high performance circuit designs which implements pipelining in logic without the use of intermediate registers. Wave pipelining has been widely used in the past few years with a great deal of significant features in technology and applications. It has the ability to improve speed, efficiency, economy in every aspect which it presents. Wave pipelining is being used in wide range of applications like digital filters, network routers, multipliers, fast convolvers, MODEMs, image processing, control systems, radars and many others. In previous work, the operating speed of the wave-pipelined circuit can be increased by the following three tasks: adjustment of the clock period, clock skew and equalization of path delays. The path-delay equalization task can be done theoretically, but the real challenge is to accomplish it in the presence of various different delays. So, the main objective of this paper is to solve the path delay equalization problem by inserting the control circuit in wave pipelined based circuit which will act as critical path for the data that moves from input to output. The proposed technique is evaluated for DSP applications by designing 4- tap FIR filter using Distributed arithmetic algorithm (DAA). Then comparison of this design is done with 4-tap FIR filter designs using conventional pipelining and non pipelining. The synthesis and simulation results based on Xilinx ISE Navigator 12.3 shows that wave pipelined DAA based filter is faster by a factor of 1.43 compared to non pipelined one and the conventional pipelined filter is faster than non pipelined by factor of 1.61 but at the cost of increased logic utilization by 200 %. So, the wave-pipelined DA filters designed with the proposed control circuit can operate at higher frequency than that of non-pipelined but less than that of pipelined. The gain in speed in pipelined compared to that of wavepipelined is at the cost of increased area and more dissipated power. When latency is considered, wavepipelined design filters with the proposed scheme are having the lowest latency among three schemes designed.

Download Full-text