The ROSACE case study: From Simulink specification to multi/many-core execution

Due to the growing demand on high performance and low power in embedded systems, many core architectures are proposed the most suitable solutions. While the design concentration of many core embedded systems is switching from computation-centric to communication-centric, Network-on-Chip (NoC) is one of the best interconnect techniques for such architectures because of the scalability and high communication bandwidth. Formalized and optimized system-level design methods for NoC-based many core embedded systems are desired to improve the system performance and to reduce the power consumption. In order to understand the design optimization methods in depth, a case study of optimizing many core embedded systems based on 3-Dimensional (3D) NoC with irregular vertical link distribution topology through task mapping, core placement, routing, and topology generation is demonstrated in this chapter. Results of cycle-accurate simulation experiments prove the validity and efficiency of the design methods. Specific to the case study configuration, in maximum 60% vertical links can be saved while maintaining the system efficiency in comparison to full vertical link connection 3D NoCs by applying the design optimization methods.

Download Full-text

Performance evaluation of many‐core systems: case study with TILEPro64

IET Computers & Digital Techniques ◽

10.1049/iet-cdt.2012.0101 ◽

2013 ◽

Vol 7 (4) ◽

pp. 143-154

Author(s):

Han‐Yee Kim ◽

Young‐Hwan Kim ◽

HeonChang Yu ◽

Taeweon Suh

Keyword(s):

Performance Evaluation ◽

Many Core

Download Full-text

Many-core needs fine-grained scheduling: A case study of query processing on Intel Xeon Phi processors

Journal of Parallel and Distributed Computing ◽

10.1016/j.jpdc.2017.09.005 ◽

2018 ◽

Vol 120 ◽

pp. 395-404 ◽

Cited By ~ 3

Author(s):

Xuntao Cheng ◽

Bingsheng He ◽

Mian Lu ◽

Chiew Tong Lau

Keyword(s):

Query Processing ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Fine Grained ◽

Many Core ◽

Intel Xeon

Download Full-text

Exploring performance and energy tradeoffs for irregular applications: A case study on the Tilera many-core architecture

Journal of Parallel and Distributed Computing ◽

10.1016/j.jpdc.2016.06.006 ◽

2017 ◽

Vol 104 ◽

pp. 234-251 ◽

Cited By ~ 1

Author(s):

Ajay Panyala ◽

Daniel Chavarría-Miranda ◽

Joseph B. Manzano ◽

Antonino Tumeo ◽

Mahantesh Halappanavar

Keyword(s):

Irregular Applications ◽

Energy Tradeoffs ◽

Many Core

Download Full-text

Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †

Journal of Low Power Electronics and Applications ◽

10.3390/jlpea7010005 ◽

2017 ◽

Vol 7 (1) ◽

pp. 5 ◽

Cited By ~ 1

Author(s):

Abdullah Al Hasib ◽

Lasse Natvig ◽

Per Kjeldsberg ◽

Juan Cebrián

Keyword(s):

Energy Efficiency ◽

Data Reuse ◽

Many Core ◽

Efficiency Effects

Download Full-text

FFT on XMT: Case Study of a Bandwidth-Intensive Regular Algorithm on a Highly-Parallel Many Core

2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) ◽

10.1109/ipdpsw.2016.157 ◽

2016 ◽

Author(s):

James Edwards ◽

Uzi Vishkin

Keyword(s):

Many Core

Download Full-text

Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study

Scientific Programming ◽

10.1155/2015/904983 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12

Author(s):

Hari Radhakrishnan ◽

Damian W. I. Rouson ◽

Karla Morris ◽

Sameer Shende ◽

Stavros C. Kassinos

Keyword(s):

Distributed Memory ◽

Profile Analysis ◽

Multicore Processors ◽

Rapid Evolution ◽

Model Verification ◽

Parallel Application ◽

Linear Speedup ◽

And Performance ◽

Many Core

This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were done using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.

Download Full-text