rMPI: Message Passing on Multicore Processors with On-Chip Interconnect

Author(s):  
James Psota ◽  
Anant Agarwal
Author(s):  
Ram Prasad Mohanty ◽  
Ashok Kumar Turuk ◽  
Bibhudatta Sahoo

The growing number of cores increases the demand for a powerful memory subsystem which leads to enhancement in the size of caches in multicore processors. Caches are responsible for giving processing elements a faster, higher bandwidth local memory to work with. In this chapter, an attempt has been made to analyze the impact of cache size on performance of Multi-core processors by varying L1 and L2 cache size on the multicore processor with internal network (MPIN) referenced from NIAGRA architecture. As the number of core's increases, traditional on-chip interconnects like bus and crossbar proves to be low in efficiency as well as suffer from poor scalability. In order to overcome the scalability and efficiency issues in these conventional interconnect, ring based design has been proposed. The effect of interconnect on the performance of multicore processors has been analyzed and a novel scalable on-chip interconnection mechanism (INOC) for multicore processors has been proposed. The benchmark results are presented by using a full system simulator. Results show that, using the proposed INoC, compared with the MPIN; the execution time are significantly reduced.


2016 ◽  
Vol 15 (1) ◽  
pp. 33-36 ◽  
Author(s):  
P. Garcia ◽  
T. Gomes ◽  
J. Monteiro ◽  
A. Tavares ◽  
M. Ekpanyapong
Keyword(s):  

2013 ◽  
Vol 2013 ◽  
pp. 1-7 ◽  
Author(s):  
Xizhong Wang ◽  
Deyun Chen

We introduce a parallel chaos-based encryption algorithm for taking advantage of multicore processors. The chaotic cryptosystem is generated by the piecewise linear chaotic map (PWLCM). The parallel algorithm is designed with a master/slave communication model with the Message Passing Interface (MPI). The algorithm is suitable not only for multicore processors but also for the single-processor architecture. The experimental results show that the chaos-based cryptosystem possesses good statistical properties. The parallel algorithm provides much better performance than the serial ones and would be useful to apply in encryption/decryption file with large size or multimedia.


Author(s):  
Martin J. Chorley ◽  
David W. Walker ◽  
Martyn F. Guest

Hybrid programming, whereby shared-memory and message-passing programming techniques are combined within a single parallel application, has often been discussed as a method for increasing code performance on clusters of symmetric multiprocessors (SMPs). This paper examines whether the hybrid model brings any performance benefits for clusters based on multicore processors. A molecular dynamics application has been parallelized using both MPI and hybrid MPI/OpenMP programming models. The performance of this application has been examined on two high-end multicore clusters using both Infiniband and Gigabit Ethernet interconnects. The hybrid model has been found to perform well on the higher-latency Gigabit Ethernet connection, but offers no performance benefit on low-latency Infiniband interconnects. The changes in performance are attributed to the differing communication profiles of the hybrid and MPI codes.


2013 ◽  
Vol 22 (05) ◽  
pp. 1350036
Author(s):  
F. A. ESCOBAR-JUZGA ◽  
F. E. SEGURA-QUIJANO

Networks on Chip (NoCs) are commonly used to integrate complex embedded systems and multiprocessor platforms due to their scalability and versatility. Modeling tools used at the functional level use SystemC to perform hardware–software co-design and error correction concurrently, thus, reducing time to market. This work analyzes a JPEG encoding algorithm mapped onto a configurable M × N, mesh/torus, NoC platform described in SystemC with the transaction level modeling (TLM) standard; timing constraints for both, the router and network interface controller, are assigned according to a hardware description language (HDL) model written for this purpose. Processing nodes are also described as SystemC threads and their computation delays are assigned depending on the amount and cost of the operations they perform. The programming model employed is message passing. We start by describing and profiling the JPEG algorithm as a task graph; then, four partitioning proposals are mapped onto three NoCs of different size. Our analysis comprises changes in topology, virtual channel depth, routing algorithms, network speed and task-node assignments. Through several high-level simulations we evaluate the impact of each parameter and we show that, for the proposed model, most improvements come from the algorithm partitioning.


Electronics ◽  
2021 ◽  
Vol 10 (21) ◽  
pp. 2681
Author(s):  
Joonmoo Huh ◽  
Deokwoo Lee

Shared memory is the most popular parallel programming model for multi-core processors, while message passing is generally used for large distributed machines. However, as the number of cores on a chip increases, the relative merits of shared memory versus message passing change, and we argue that message passing becomes a viable, high performing, and parallel programming model. To demonstrate this hypothesis, we compare a shared memory architecture with a new message passing architecture on a suite of applications tuned for each system independently. Perhaps surprisingly, the fundamental behaviors of the applications studied in this work, when optimized for both models, are very similar to each other, and both could execute efficiently on multicore architectures despite many implementations being different from each other. Furthermore, if hardware is tuned to support message passing by supporting bulk message transfer and the elimination of unnecessary coherence overheads, and if effective support is available for global operations, then some applications would perform much better on a message passing architecture. Leveraging our insights, we design a message passing architecture that supports both memory-to-memory and cache-to-cache messaging in hardware. With the new architecture, message passing is able to outperform its shared memory counterparts on many of the applications due to the unique advantages of the message passing hardware as compared to cache coherence. In the best case, message passing achieves up to a 34% increase in speed over its shared memory counterpart, and it achieves an average 10% increase in speed. In the worst case, message passing is slowed down in two applications—CG (conjugate gradient) and FT (Fourier transform)—because it could not perform well on the unique data sharing patterns as its counterpart of shared memory. Overall, our analysis demonstrates the importance of considering message passing as a high performing and hardware-supported programming model on future multicore architectures.


Sign in / Sign up

Export Citation Format

Share Document