instruction level parallelism
Recently Published Documents


TOTAL DOCUMENTS

165
(FIVE YEARS 11)

H-INDEX

15
(FIVE YEARS 0)

Author(s):  
Antonio Fuentes-Alventosa ◽  
Juan Gómez-Luna ◽  
José Maria González-Linares ◽  
Nicolás Guil ◽  
R. Medina-Carnicer

AbstractCAVLC (Context-Adaptive Variable Length Coding) is a high-performance entropy method for video and image compression. It is the most commonly used entropy method in the video standard H.264. In recent years, several hardware accelerators for CAVLC have been designed. In contrast, high-performance software implementations of CAVLC (e.g., GPU-based) are scarce. A high-performance GPU-based implementation of CAVLC is desirable in several scenarios. On the one hand, it can be exploited as the entropy component in GPU-based H.264 encoders, which are a very suitable solution when GPU built-in H.264 hardware encoders lack certain necessary functionality, such as data encryption and information hiding. On the other hand, a GPU-based implementation of CAVLC can be reused in a wide variety of GPU-based compression systems for encoding images and videos in formats other than H.264, such as medical images. This is not possible with hardware implementations of CAVLC, as they are non-separable components of hardware H.264 encoders. In this paper, we present CAVLCU, an efficient implementation of CAVLC on GPU, which is based on four key ideas. First, we use only one kernel to avoid the long latency global memory accesses required to transmit intermediate results among different kernels, and the costly launches and terminations of additional kernels. Second, we apply an efficient synchronization mechanism for thread-blocks (In this paper, to prevent confusion, a block of pixels of a frame will be referred to as simply block and a GPU thread block as thread-block.) that process adjacent frame regions (in horizontal and vertical dimensions) to share results in global memory space. Third, we exploit fully the available global memory bandwidth by using vectorized loads to move directly the quantized transform coefficients to registers. Fourth, we use register tiling to implement the zigzag sorting, thus obtaining high instruction-level parallelism. An exhaustive experimental evaluation showed that our approach is between 2.5$$\times$$ × and 5.4$$\times$$ × faster than the only state-of-the-art GPU-based implementation of CAVLC.


Author(s):  
Krishan Kumar ◽  
Renu

Multithreading is ability of a central processing unit (CPU) or a single core within a multi-core processor to execute multiple processes or threads concurrently, appropriately supported by operating system. This approach differs from multiprocessing, as with multithreading processes & threads have to share resources of a single or multiple cores: computing units, CPU caches, & translation lookaside buffer (TLB). Multiprocessing systems include multiple complete processing units, multithreading aims to increase utilization of a single core by using thread-level as well as instruction-level parallelism. Objective of research is increase efficiency of scheduling dependent task using enhanced multithreading. gang scheduling of parallel implicit-deadline periodic task systems upon identical multiprocessor platforms is considered. In this scheduling problem, parallel tasks use several processors simultaneously. first algorithm is based on linear programming & is first one to be proved optimal for considered gang scheduling problem. Furthermore, it runs in polynomial time for a fixed number m of processors & an efficient implementation is fully detailed. Second algorithm is an approximation algorithm based on a fixed-priority rule that is competitive under resource augmentation analysis in order to compute an optimal schedule pattern. Precisely, its speedup factor is bounded by (2?1/m). Both algorithms are also evaluated through intensive numerical experiments. In our research we have enhanced capability of Gang Scheduling by integration of multi core processor & Cache & make simulation of performance in MATLAB.


Author(s):  
Dennis Wolf ◽  
Andreas Engel ◽  
Tajas Ruschke ◽  
Andreas Koch ◽  
Christian Hochberger

AbstractCoarse Grained Reconfigurable Arrays (CGRAs) or Architectures are a concept for hardware accelerators based on the idea of distributing workload over Processing Elements. These processors exploit instruction level parallelism, while being energy efficient due to their simplistic internal structure. However, the incorporation into a complete computing system raises severe challenges at the hardware and software level. This article evaluates a CGRA integrated into a control engineering environment targeting a Xilinx Zynq System on Chip (SoC) in detail. Besides the actual application execution performance, the practicability of the configuration toolchain is validated. Challenges of the real-world integration are discussed and practical insights are highlighted.


2021 ◽  
Vol 11 (3) ◽  
pp. 1225
Author(s):  
Woohyong Lee ◽  
Jiyoung Lee ◽  
Bo Kyung Park ◽  
R. Young Chul Kim

Geekbench is one of the most referenced cross-platform benchmarks in the mobile world. Most of its workloads are synthetic but some of them aim to simulate real-world behavior. In the mobile world, its microarchitectural behavior has been reported rarely since the hardware profiling features are limited to the public. As a popular mobile performance workload, it is hard to find Geekbench’s microarchitecture characteristics in mobile devices. In this paper, a thorough experimental study of Geekbench performance characterization is reported with detailed performance metrics. This study also identifies mobile system on chip (SoC) microarchitecture impacts, such as the cache subsystem, instruction-level parallelism, and branch performance. After the study, we could understand the bottleneck of workloads, especially in the cache sub-system. This means that the change of data set size directly impacts performance score significantly in some systems and will ruin the fairness of the CPU benchmark. In the experiment, Samsung’s Exynos9820-based platform was used as the tested device with Android Native Development Kit (NDK) built binaries. The Exynos9820 is a superscalar processor capable of dual issuing some instructions. To help performance analysis, we enable the capability to collect performance events with performance monitoring unit (PMU) registers. The PMU is a set of hardware performance counters which are built into microprocessors to store the counts of hardware-related activities. Throughout the experiment, functional and microarchitectural performance profiles were fully studied. This paper describes the details of the mobile performance studies above. In our experiment, the ARM DS5 tool was used for collecting runtime PMU profiles including OS-level performance data. After the comparative study is completed, users will understand more about the mobile architecture behavior, and this will help to evaluate which benchmark is preferable for fair performance comparison.


Author(s):  
Tatiana Nikolaevna Romanova ◽  
◽  
Dmitry Igorevich Gorin ◽  

A method for optimizing the filling of a machine word with independent instructions is proposed, which allows to increase the performance of programs by stacking the maximum number of independent commands in a package. The paper also confirms the hypothesis that with the transition to random register allocation by the compiler, the packet density will increase, which will result in a decrease in the program's running time.


Author(s):  
V. Venkata Nagendra Reddy ◽  
A. Sudhakar ◽  
Dr. P. Sivakumar

Our paper proposes the new method of processor architecture called as VLIW for enhancing the performance of the architecture. VLIW is the complexity architecture because the enormous number of registers, slices, flip flops, counters, operand, ALUs, and MUXs used. The VLIW has the fife stages of pipelines for executing the architecture are (1) fetching the 128-bit instruction memory, (2) decode stage or it is also called as the operands reading stage because the total number of operands are implemented in this stage, (3) execution stage, here the operations with the parallel executions units which has the four operations, (4) memory stage is used for loading or for storing the data from/to the memory and (5) write back stage in this stage the outputs of all the stage is collected and write back into the register file for storing the output values. The whole process of implementation is implemented in the FPGA of the family of Spartan-6 XC6SLX-3CSG324 device. In this proposed architecture the performance of the architecture is increased by reducing the time taken to execute the CPU of Xst completion of the architecture


Author(s):  
Lin Li ◽  
Shengbing Zhang ◽  
Juan Wu

In order to adapt the application demands of high resolution images recognition and efficient processing of localization in aviation and aerospace fields, and to solve the problem of insufficient parallelism in existing researches, an extensible multiprocessor cluster deep learning processor architecture based on VLIW is designed by optimizing the computation of each layer of deep convolutional neural network model. Parallel processing of feature maps and neurons, instruction level parallelism based on very long instruction word (VLIW), data level parallelism of multiprocessor clusters and pipeline technologies are adopted in the design. The test results based on FPGA prototype system show that the processor can effectively complete the image classification and object detection applications. The peak performance of processor is up to 128 GOP/s when it operates at 200 MHz. For selecting benchmarks, the processor speed is about 12X faster than CPU and 7X faster than GPU at least. Comparing with the results of the software framework, the average error of the test accuracy of the processor is less than 1%.


2019 ◽  
Author(s):  
Anil Kumar Bheemaiah

A new algorithm of data dependencies and ILP is defined with the sense index of a thread in true-parallelism(™), from the definitions of Quasi-Parallelism, which is the sensitivity and sense indices defined for true scalability between single/multi-cores. The application to the CUDA architecture is delineated in formal architectural definitions. Keywords: CUDA architectures, superscalar, ILP, data prediction, sense sensitivity index. What:Out of order processing in a pipeline, can be optimized with the sense-boarding processor. In this single to multi-core scalable architecture, the processor is thread-centric with sleeping and active threads. Sleeping threads have a sense() function associated with them. Unlike their human counterparts, snoring is a useful feature that helps keep sensitive threads awake and running. sense-boarding is a scheduling algorithm that tracks the sensitivity indices of threads to snoring and helps schedule threads with dependency relationships for out of order execution.How:sense boarding is a board based dependency for instruction-level parallelism in multi-thread vector processing in out of order single-core/multicore symmetries.Inter thread dependencies of data are marked in a board data-structure by maps to define sensitivity and sense indices, sense functionality is useful in the case of dependencies, resource waiting and speculative execution or in data generation and prediction. sense determines the relationship in instruction-level parallelism, to sensitive out of order and data speculation. The application to the CUDA architecture for stream processing in GPUs is also mentioned.Algorithms are:Instruction level parallelism in sense-sensitivity index metrics:Data speculation , dirty caches, parallel pipeline algorithms.Scalability in single core/ multi core implementations.CUDA multi core architectures for stream speculation and instruction level parallelism.Why: Sleep is rest, and sense a measure of thread parallel-ness. While threads sleep for the right time, the awake ones perform in quasi-parallelism as HPC. Asynchronous with Lamport clocks.


Sign in / Sign up

Export Citation Format

Share Document