High Performance Parallel Programming of a GA Using Multi-core Technology

A novel master-multi-SIMD architecture and its kernel (template) based parallel programming flow is introduced as a parallel signal processing platform. The name of the platform is ePUMA (embedded Parallel DSP processor architecture with Unique Memory Access). The essential technology is to separate data accessing kernels from arithmetic computing kernels so that the run-time cost of data access can be minimized by running it in parallel with algorithm computing. The SIMD memory subsystem architecture based on the proposed flow dramatically improves the total computing performance. The hardware system and programming flow introduced in this article will primarily aim at low-power high-performance embedded parallel computing with low silicon cost for communications and similar real-time signal processing.

Download Full-text

Introduction to high-performance computing and parallel programming

SciVee ◽

10.4016/19132.01 ◽

2010 ◽

Keyword(s):

High Performance Computing ◽

Parallel Programming ◽

High Performance ◽

Performance Computing

Download Full-text

Teaching High Performance Computing through Parallel Programming Marathons

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) ◽

10.1109/ipdpsw.2019.00058 ◽

2019 ◽

Cited By ~ 1

Author(s):

Leandro Marzulo ◽

Calebe Bianchini ◽

Leandro Santiago ◽

Victor Ferreira ◽

Brunno Goldstein ◽

...

Keyword(s):

High Performance Computing ◽

Parallel Programming ◽

High Performance ◽

Performance Computing

Download Full-text

Run-Time and Compiler Support for Programming in Adaptive Parallel Environments

Scientific Programming ◽

10.1155/1997/926796 ◽

1997 ◽

Vol 6 (2) ◽

pp. 215-227 ◽

Cited By ~ 11

Author(s):

Guy Edjlali ◽

Gagan Guyagrawal ◽

Alan Sussman ◽

Jim Humphries ◽

Joel Saltz

Keyword(s):

Parallel Programming ◽

High Performance ◽

Navier Stokes ◽

Programming Environments ◽

Data Parallel ◽

Adaptive Environment ◽

Run Time ◽

Time Required ◽

The Cost ◽

Performance Results

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at run-time. In this article, we discuss run-time support for data-parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a run-time library to provide this support. We discuss how the run-time library can be used by compilers of high-performance Fortran (HPF)-like languages to generate code for an adaptive environment. We present performance results for a Navier-Stokes solver and a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network of nondedicated workstations, which are likely to be an important resource for parallel programming in the future.

Download Full-text

Study of parallel programming models on computer clusters with Intel MIC coprocessors

The International Journal of High Performance Computing Applications ◽

10.1177/1094342015580864 ◽

2015 ◽

Vol 31 (4) ◽

pp. 303-315 ◽

Cited By ~ 3

Author(s):

Miaoqing Huang ◽

Chenggang Lai ◽

Xuan Shi ◽

Zhijun Hao ◽

Haihang You

Keyword(s):

Parallel Programming ◽

High Performance ◽

Programming Model ◽

Fixed Number ◽

Parallel Applications ◽

Programming Models ◽

Communication Overhead ◽

Computer Clusters ◽

Parallel Programming Models ◽

Intel Mic

Coprocessors based on the Intel Many Integrated Core (MIC) Architecture have been adopted in many high-performance computer clusters. Typical parallel programming models, such as MPI and OpenMP, are supported on MIC processors to achieve the parallelism. In this work, we conduct a detailed study on the performance and scalability of the MIC processors under different programming models using the Beacon computer cluster. Our findings are as follows. (1) The native MPI programming model on the MIC processors is typically better than the offload programming model, which offloads the workload to MIC cores using OpenMP. (2) On top of the native MPI programming model, multithreading inside each MPI process can further improve the performance for parallel applications on computer clusters with MIC coprocessors. (3) Given a fixed number of MPI processes, it is a good strategy to schedule these MPI processes to as few MIC processors as possible to reduce the cross-processor communication overhead. (4) The hybrid MPI programming model, in which data processing is distributed to both MIC cores and CPU cores, can outperform the native MPI programming model.

Download Full-text

Easy PRAM-Based High-Performance Parallel Programming with ICE

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/tpds.2017.2754376 ◽

2018 ◽

Vol 29 (2) ◽

pp. 377-390 ◽

Cited By ~ 4

Author(s):

Fady Ghanim ◽

Uzi Vishkin ◽

Rajeev Barua

Keyword(s):

Parallel Programming ◽

High Performance

Download Full-text

Designing High-Performance Fuzzy Controllers Combining IP Cores and Soft Processors

Advances in Fuzzy Systems ◽

10.1155/2012/475894 ◽

2012 ◽

Vol 2012 ◽

pp. 1-11 ◽

Cited By ~ 4

Author(s):

Oscar Montiel-Ross ◽

Jorge Quiñones ◽

Roberto Sepúlveda

Keyword(s):

Parallel Programming ◽

High Speed ◽

High Performance ◽

Hardware Description Language ◽

System Response ◽

Description Language ◽

Tuning Method ◽

Hardware Description ◽

Ip Cores ◽

Fuzzy Pd

This paper presents a methodology to integrate a fuzzy coprocessor described in VHDL (VHSIC Hardware Description Language) to a soft processor embedded into an FPGA, which increases the throughput of the whole system, since the controller uses parallelism at the circuitry level for high-speed-demanding applications, the rest of the application can be written in C/C++. We used the ARM 32-bit soft processor, which allows sequential and parallel programming. The FLC coprocessor incorporates a tuning method that allows to manipulate the system response. We show experimental results using a fuzzy PD+I controller as the embedded coprocessor.

Download Full-text