scholarly journals Boosting Parallel Applications Performance on Applying DIM Technique in a Multiprocessing Environment

2011 ◽  
Vol 2011 ◽  
pp. 1-13 ◽  
Author(s):  
Mateus B. Rutzig ◽  
Antonio C. S. Beck ◽  
Felipe Madruga ◽  
Marco A. Alves ◽  
Henrique C. Freitas ◽  
...  

Limits of instruction-level parallelism and higher transistor density sustain the increasing need for multiprocessor systems: they are rapidly taking over both general-purpose and embedded processor domains. Current multiprocessing systems are composed either of many homogeneous and simple cores or of complex superscalar, simultaneous multithread processing elements. As parallel applications are becoming increasingly present in embedded and general-purpose domains and multiprocessing systems must handle a wide range of different application classes, there is no consensus over which are the best hardware solutions to better exploit instruction-level parallelism (TLP) and thread-level parallelism (TLP) together. Therefore, in this work, we have expanded the DIM (dynamic instruction merging) technique to be used in a multiprocessing scenario, proving the need for an adaptable ILP exploitation even in TLP architectures. We have successfully coupled a dynamic reconfigurable system to an SPARC-based multiprocessor and obtained performance gains of up to 40%, even for applications that show a great level of parallelism at thread level.

Author(s):  
Eduardo H. M. Cruz ◽  
Matthias Diener ◽  
Laércio L. Pilla ◽  
Philippe O. A. Navaux

Current and future architectures rely on thread-level parallelism to sustain performance growth. These architectures have introduced a complex memory hierarchy, consisting of several cores organized hierarchically with multiple cache levels and NUMA nodes. These memory hierarchies can have an impact on the performance and energy efficiency of parallel applications as the importance of memory access locality is increased. In order to improve locality, the analysis of the memory access behavior of parallel applications is critical for mapping threads and data. Nevertheless, most previous work relies on indirect information about the memory accesses, or does not combine thread and data mapping, resulting in less accurate mappings. In this paper, we propose the Sharing-Aware Memory Management Unit (SAMMU), an extension to the memory management unit that allows it to detect the memory access behavior in hardware. With this information, the operating system can perform online mapping without any previous knowledge about the behavior of the application. In the evaluation with a wide range of parallel applications (NAS Parallel Benchmarks and PARSEC Benchmark Suite), performance was improved by up to 35.7% (10.0% on average) and energy efficiency was improved by up to 11.9% (4.1% on average). These improvements happened due to a substantial reduction of cache misses and interconnection traffic.


2014 ◽  
Vol 2014 ◽  
pp. 1-13 ◽  
Author(s):  
Mouna Baklouti ◽  
Mohamed Abid

To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs) are a cheap and fast option to develop and test such systems. This paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soft-processor core from Altera is also presented. The NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication).


VLSI Design ◽  
2016 ◽  
Vol 2016 ◽  
pp. 1-12 ◽  
Author(s):  
Yumin Hou ◽  
Hu He ◽  
Xu Yang ◽  
Deyuan Guo ◽  
Xu Wang ◽  
...  

This paper proposes FuMicro, a fused microarchitecture integrating both in-order superscalar and Very Long Instruction Word (VLIW) in a single core. A processor with FuMicro microarchitecture can work under alternative in-order superscalar and VLIW mode, using the same pipeline and the same Instruction Set Architecture (ISA). Small modification to the compiler is made to expand the register file in VLIW mode. The decision of mode switch is made by software, and this does not need extra hardware. VLIW code can be exploited in the form of library function and the users will be exposed under only superscalar mode; by this means, we can provide the users with a convenient development environment. FuMicro could serve as a universal microarchitecture for it can be applied to different ISAs. In this paper, we focus on the implementation of FuMicro with ARM ISA. This architecture is evaluated on gem5, which is a cycle accurate microarchitecture simulation platform. By adopting FuMicro microarchitecture, the performance can be improved on an average of 10%, with the best performance improvement being 47.3%, compared with that under pure in-order superscalar mode. The result shows that FuMicro microarchitecture can improve Instruction Level Parallelism (ILP) significantly, making it promising to expand digital signal processing capability on a General Purpose Processor.


Author(s):  
Sanket Dessai ◽  
Krishna Bhushan Vutukuru

Graphical Processing Units (GPUs) have become an integral part of today’s mainstream computing systems. They are also being used as reprogrammable General Purpose GPUs (GP-GPUs) to perform complex scientific computations. Reconfigurability is an attractive approach to embedded systems allowing hardware level modification. Hence, there is a high demand for GPU designs based on reconfigurable hardware. Stream processor consists of clusters of functional units which provide a bandwidth hierarchy, supporting hundreds of arithmetic units. The arithmetic cluster units are designed to exploit instruction level parallelism and subword parallelism within a cluster and data parallelism across the clusters.For decreasing the area and power, a single controller is used to control data flow between clusters and between host processor and GPU. The designed of stream processor unit has been carried out in Verilog on Altera Quartus II and simulated using ModelSim tools. The functionality of the modelled blocks is verified using test inputs in the simulator.The simulated execution time of 8-bit pipelined multiplier is 60 ps and 100 ns for 8-bit pipelined adder while operating at 90 MHz.


2021 ◽  
Author(s):  
Anita Tino

As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, com- puting systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core. These processors effectively improve the throughput and utilization of a core, however often at the expense of single-thread performance as threads per core scale. Accordingly, applications which require higher single-thread performance must often resort to single-thread core multi-processor systems which incur additional area overhead and power dissipation. In attempts to improve single- and multi-thread core efficiency, this work introduces the concept of a Configurable Simultaneously Single-Threaded (Multi-)Engine Processor (ConSSTEP). ConSSTEP is a nuanced approach to multi- threaded processors, achieving performance gains and energy efficiency by invoking low overhead reconfigurable properties with full software compatibility. Experimen- tal results demonstrate that ConSSTEP is able to increase single-thread Instruc- tions Per Cycle (IPC) up to 1.39x and 2.4x for 2-thread and 4-thread workloads, respectively, improving throughput and providing up to 2x energy efficiency when compared to a conventional SMT processor.


Author(s):  
Masa-aki Fukase ◽  
Tomoaki Sato

In developing cutting edge VLSI processors, parallelism is one of the most important global standard strategies to achieve power conscious high performance. These features are more critical for ubiquitous systems with great demands for multimedia mobile processing. Then, one of most important issues for ubiquitous systems is instruction scheduling, because floating point units indispensable for multimedia mobile applications take longer latency than integer units. Although software parallelism has been inevitable to fully utilize hardware parallelism between regular scalar units, it has been really awkward. Thus, we describe in this article a double scheme to achieve instruction scheduling free ILP (instruction level parallelism) and apply the double scheme to a ubiquitous processor HCgorilla we have so far developed. The double scheme is the multifunctionalization of scalar units and making a resultant multifunctional unit (MFU) wave-pipeline. The multifunctionalization frees the instruction scheduling, and the wave-pipelining recovers the reduction of clock speed to be caused by the scale up of a multifunctional circuit. HCgorilla built-in the waved MFU is promising for wide-range dynamic ILP at a rate higher than regular processors.


2018 ◽  
Vol 28 (02) ◽  
pp. 1950020 ◽  
Author(s):  
Yumin Hou ◽  
Xu Wang ◽  
Jiawei Fu ◽  
Junping Ma ◽  
Hu He ◽  
...  

In order to expand the computation capability of digital signal processing on a General Purpose Processor (GPP), we propose a fused microarchitecture that improves Instruction Level Parallelism (ILP) by supporting both in-order superscalar and very long instruction word (VLIW) dispatch methods in a single pipeline. This design is based on ARMv7-A&R Instruction Set Architecture (ISA). To provide a performance comparison, we first design an in-order superscalar processor, considering that ARM GPPs always adopt superscalar approaches. And then we expand VLIW dispatch method based on this processor, to realize the fused microarchitecture. The two designs are both evaluated on the Xilinx 7-series FPGA (XC7K325T-2FFG900C), using Xilinx Vivado design suite. The results show that, compared with the superscalar processor, the processor working under VLIW mode can improve the performance by 15% and 8%, respectively, when running EEMBC and DSPstone benchmarks. We also run the two benchmarks on ARM Cortex-A9 processor, which is integrated in the Zynq-7000 AP SoC device on Xilinx ZC706 evaluation board. The processor in VLIW mode shows 44% and 30% performance improvements than ARM Cortex-A9. The fused microarchitecture adopts a combined bimodal and PAp branch prediction method. This method achieves 93.7% prediction accuracy with limited hardware overhead.


2021 ◽  
Author(s):  
Anita Tino

As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, com- puting systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core. These processors effectively improve the throughput and utilization of a core, however often at the expense of single-thread performance as threads per core scale. Accordingly, applications which require higher single-thread performance must often resort to single-thread core multi-processor systems which incur additional area overhead and power dissipation. In attempts to improve single- and multi-thread core efficiency, this work introduces the concept of a Configurable Simultaneously Single-Threaded (Multi-)Engine Processor (ConSSTEP). ConSSTEP is a nuanced approach to multi- threaded processors, achieving performance gains and energy efficiency by invoking low overhead reconfigurable properties with full software compatibility. Experimen- tal results demonstrate that ConSSTEP is able to increase single-thread Instruc- tions Per Cycle (IPC) up to 1.39x and 2.4x for 2-thread and 4-thread workloads, respectively, improving throughput and providing up to 2x energy efficiency when compared to a conventional SMT processor.


1997 ◽  
Vol 15 (3) ◽  
pp. 322-354 ◽  
Author(s):  
Jack L. Lo ◽  
Joel S. Emer ◽  
Henry M. Levy ◽  
Rebecca L. Stamm ◽  
Dean M. Tullsen ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document