PERFORMANCE EVALUATION OF BLAS ON THE TRIDENT PROCESSOR

2005 ◽  
Vol 15 (04) ◽  
pp. 407-414
Author(s):  
MOSTAFA I. SOLIMAN ◽  
STANISLAV G. SEDUKHIN

Different subtasks of an application usually have different computational, memory, and I/O requirements that result in different needs for computer capabilities. Thus, the more appropriate approach for both high performance and simple programming model is designing a processor having multi-level instruction set architecture (ISA). This leads to high performance and minimum executable code size. Since the fundamental data structures for a wide variety of existing applications are scalar, vector, and matrix, our research Trident processor has three-level ISA executed on zero-, one-, and two-dimensional arrays of data. These levels are used to express a great amount of fine-grain data parallelism to a processor instead of the dynamical extraction by a complicated logic or statically with compilers. This reduces the design complexity and provides high-level programming interface to hardware. In this paper, the performance of Trident processor is evaluated on BLAS, which represent the kernel operations of many data parallel applications. We show that Trident processor proportionally reduces the number of clock cycles per floating-point operation by increasing the number of execution datapaths.

Author(s):  
Breno A. de Melo Menezes ◽  
Nina Herrmann ◽  
Herbert Kuchen ◽  
Fernando Buarque de Lima Neto

AbstractParallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.


2000 ◽  
Vol 09 (03) ◽  
pp. 343-367
Author(s):  
STEPHEN W. RYAN ◽  
ARVIND K. BANSAL

This paper describes a system to distribute and retrieve multimedia knowledge on a cluster of heterogeneous high performance architectures distributed over the Internet. The knowledge is represented using facts and rules in an associative logic-programming model. Associative computation facilitates distribution of facts and rules, and exploits coarse grain data parallel computation. Associative logic programming uses a flat data model that can be easily mapped onto heterogeneous architectures. The paper describes an abstract instruction set for the distributed version of the associative logic programming and the corresponding implementation. The implementation uses a message-passing library for architecture independence within a cluster, uses object oriented programming for modularity and portability, and uses Java as a front-end interface to provide a graphical user interface and multimedia capability and remote access via the Internet. The performance results on a cluster of IBM RS 6000 workstations are presented. The results show that distribution of data improves the performance almost linearly for small number of processors in a cluster.


Author(s):  
Miaoqing Huang ◽  
Chenggang Lai ◽  
Xuan Shi ◽  
Zhijun Hao ◽  
Haihang You

Coprocessors based on the Intel Many Integrated Core (MIC) Architecture have been adopted in many high-performance computer clusters. Typical parallel programming models, such as MPI and OpenMP, are supported on MIC processors to achieve the parallelism. In this work, we conduct a detailed study on the performance and scalability of the MIC processors under different programming models using the Beacon computer cluster. Our findings are as follows. (1) The native MPI programming model on the MIC processors is typically better than the offload programming model, which offloads the workload to MIC cores using OpenMP. (2) On top of the native MPI programming model, multithreading inside each MPI process can further improve the performance for parallel applications on computer clusters with MIC coprocessors. (3) Given a fixed number of MPI processes, it is a good strategy to schedule these MPI processes to as few MIC processors as possible to reduce the cross-processor communication overhead. (4) The hybrid MPI programming model, in which data processing is distributed to both MIC cores and CPU cores, can outperform the native MPI programming model.


Author(s):  
Yao Wu ◽  
Long Zheng ◽  
Brian Heilig ◽  
Guang R Gao

As the attention given to big data grows, cluster computing systems for distributed processing of large data sets become the mainstream and critical requirement in high performance distributed system research. One of the most successful systems is Hadoop, which uses MapReduce as a programming/execution model and takes disks as intermedia to process huge volumes of data. Spark, as an in-memory computing engine, can solve the iterative and interactive problems more efficiently. However, currently it is a consensus that they are not the final solutions to big data due to a MapReduce-like programming model, synchronous execution model and the constraint that only supports batch processing, and so on. A new solution, especially, a fundamental evolution is needed to bring big data solutions into a new era. In this paper, we introduce a new cluster computing system called HAMR which supports both batch and streaming processing. To achieve better performance, HAMR integrates high performance computing approaches, i.e. dataflow fundamental into a big data solution. With more specifications, HAMR is fully designed based on in-memory computing to reduce the unnecessary disk access overhead; task scheduling and memory management are in fine-grain manner to explore more parallelism; asynchronous execution improves efficiency of computation resource usage, and also makes workload balance across the whole cluster better. The experimental results show that HAMR can outperform Hadoop MapReduce and Spark by up to 19x and 7x respectively, in the same cluster environment. Furthermore, HAMR can handle scaling data size well beyond the capabilities of Spark.


1995 ◽  
Vol 04 (01n02) ◽  
pp. 33-53 ◽  
Author(s):  
ARVIND K. BANSAL

Associative computation is characterized by seamless intertwining of search-by-content and data parallel computation. The search-by-content paradigm is natural to scalable high performance heterogeneous computing since the use of tagged data avoids the need for explicit addressing mechanisms. In this paper, the author presents an algebra for associative logic programming, an associative resolution scheme, and a generic framework of an associative abstract instruction set. The model is based on the integration of data alignment and the use of two types of bags: data element bags and filter bags of Boolean values to select and restrict computation on data elements. The use of filter bags integrated with data alignment reduces computation and data transfer overhead, and the use of tagged data reduces overhead of preparing data before data transmission. The abstract instruction set has been illustrated by an example. Performance results are presented for a simulation in a homogeneous address space.


2020 ◽  
Vol 38 (3-4) ◽  
pp. 1-31
Author(s):  
Won Wook Song ◽  
Youngseok Yang ◽  
Jeongyoon Eo ◽  
Jangho Seo ◽  
Joo Yeon Kim ◽  
...  

Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation of dataflow, compiler optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario. Apache Nemo is open-sourced at https://nemo.apache.org as an Apache incubator project.


1997 ◽  
Vol 6 (1) ◽  
pp. 153-158 ◽  
Author(s):  
Terry W. Clark ◽  
Reinhard v. Hanxleden ◽  
Ken Kennedy

To efficiently parallelize a scientific application with a data-parallel compiler requires certain structural properties in the source program, and conversely, the absence of others. A recent parallelization effort of ours reinforced this observation and motivated this correspondence. Specifically, we have transformed a Fortran 77 version of GROMOS, a popular dusty-deck program for molecular dynamics, into Fortran D, a data-parallel dialect of Fortran. During this transformation we have encountered a number of difficulties that probably are neither limited to this particular application nor do they seem likely to be addressed by improved compiler technology in the near future. Our experience with GROMOS suggests a number of points to keep in mind when developing software that may at some time in its life cycle be parallelized with a data-parallel compiler. This note presents some guidelines for engineering data-parallel applications that are compatible with Fortran D or High Performance Fortran compilers.


2021 ◽  
Vol 17 (2) ◽  
pp. 145-158
Author(s):  
Ahmad Qawasmeh ◽  
Salah Taamneh ◽  
Ashraf H. Aljammal ◽  
Nabhan Hamadneh ◽  
Mustafa Banikhalaf ◽  
...  

Different high performance techniques, such as profiling, tracing, and instrumentation, have been used to tune and enhance the performance of parallel applications. However, these techniques do not show how to explore the potential of parallelism in a given application. Animating and visualizing the execution process of a sequential algorithm provide a thorough understanding of its usage and functionality. In this work, an interactive web-based educational animation tool was developed to assist users in analyzing sequential algorithms to detect parallel regions regardless of the used parallel programming model. The tool simplifies algorithms’ learning, and helps students to analyze programs efficiently. Our statistical t-test study on a sample of students showed a significant improvement in their perception of the mechanism and parallelism of applications and an increase in their willingness to learn algorithms and parallel programming.


2015 ◽  
Vol 76 (12) ◽  
Author(s):  
K. Umapathy ◽  
V. Balaji ◽  
V. Duraisamy ◽  
S. S. Saravanakumar

In this paper presents the implementation of wavelet based medical image fusion on FPGA is performed using high level language C. The high- level instruction set of the image processor is based on the operation of image algebra like convolution, additive max-min, and multiplicative max-min. The above parameters are used to increase the speed. The FPGA based microprocessor is used to accelerate the extraction of texture features and high level C programming language is used for hardware design. This proposed hardware architecture reduces the hardware utilizations and best suitable for low power applications. The paper describes the programming interface of the user and outlines the approach for generating FPGA architectures dynamically for the image co-processor. It also presents sample implementation results (speed, area) for different neighborhood operations.


2018 ◽  
Vol 2018 ◽  
pp. 1-11
Author(s):  
Haoduo Yang ◽  
Huayou Su ◽  
Qiang Lan ◽  
Mei Wen ◽  
Chunyuan Zhang

The growing use of graph in many fields has sparked a broad interest in developing high-level graph analytics programs. Existing GPU implementations have limited performance with compromising on productivity. HPGraph, our high-performance bulk-synchronous graph analytics framework based on the GPU, provides an abstraction focused on mapping vertex programs to generalized sparse matrix operations on GPU as the backend. HPGraph strikes a balance between performance and productivity by coupling high-performance GPU computing primitives and optimization strategies with a high-level programming model for users to implement various graph algorithms with relatively little effort. We evaluate the performance of HPGraph for four graph primitives (BFS, SSSP, PageRank, and TC). Our experiments show that HPGraph matches or even exceeds the performance of high-performance GPU graph libraries such as MapGraph, nvGraph, and Gunrock. HPGraph also runs significantly faster than advanced CPU graph libraries.


Sign in / Sign up

Export Citation Format

Share Document