High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

Due to the growing demand on high performance and low power in embedded systems, many core architectures are proposed the most suitable solutions. While the design concentration of many core embedded systems is switching from computation-centric to communication-centric, Network-on-Chip (NoC) is one of the best interconnect techniques for such architectures because of the scalability and high communication bandwidth. Formalized and optimized system-level design methods for NoC-based many core embedded systems are desired to improve the system performance and to reduce the power consumption. In order to understand the design optimization methods in depth, a case study of optimizing many core embedded systems based on 3-Dimensional (3D) NoC with irregular vertical link distribution topology through task mapping, core placement, routing, and topology generation is demonstrated in this chapter. Results of cycle-accurate simulation experiments prove the validity and efficiency of the design methods. Specific to the case study configuration, in maximum 60% vertical links can be saved while maintaining the system efficiency in comparison to full vertical link connection 3D NoCs by applying the design optimization methods.

Download Full-text

HIGH PERFORMANCE AND SCALABLE RADIX SORTING: A CASE STUDY OF IMPLEMENTING DYNAMIC PARALLELISM FOR GPU COMPUTING

Parallel Processing Letters ◽

10.1142/s0129626411000187 ◽

2011 ◽

Vol 21 (02) ◽

pp. 245-272 ◽

Cited By ~ 106

Author(s):

DUANE MERRILL ◽

ANDREW GRIMSHAW

Keyword(s):

High Performance ◽

Gpu Computing ◽

State Of The Art ◽

Design Strategies ◽

Kernel Fusion ◽

Parallel Prefix ◽

Scan Data ◽

Many Core ◽

Global Data

The need to rank and order data is pervasive, and many algorithms are fundamentally dependent upon sorting and partitioning operations. Prior to this work, GPU stream processors have been perceived as challenging targets for problems with dynamic and global data-dependences such as sorting. This paper presents: (1) a family of very efficient parallel algorithms for radix sorting; and (2) our allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism. We demonstrate multiple factors of speedup (up to 3.8x) compared to state-of-the-art GPU sorting. We also reverse the performance differentials observed between GPU and multi/many-core CPU architectures by recent comparisons in the literature, including those with 32-core CPU-based accelerators. Our average sorting rates exceed 1B 32-bit keys/sec on a single GPU microprocessor. Our sorting passes are constructed from a very efficient parallel prefix scan "runtime" that incorporates three design features: (1) kernel fusion for locally generating and consuming prefix scan data; (2) multi-scan for performing multiple related, concurrent prefix scans (one for each partitioning bin); and (3) flexible algorithm serialization for avoiding unnecessary synchronization and communication within algorithmic phases, allowing us to construct a single implementation that scales well across all generations and configurations of programmable NVIDIA GPUs.

Download Full-text

Analysis and Identification of Possible Automation Approaches for Embedded Systems Design Flows

Information ◽

10.3390/info11020120 ◽

2020 ◽

Vol 11 (2) ◽

pp. 120

Author(s):

Augusto Y. Horita ◽

Denis S. Loubach ◽

Ricardo Bonna

Keyword(s):

Embedded Systems ◽

High Performance ◽

Development Process ◽

Systems Design ◽

Design Flow ◽

Formal Models ◽

Models Of Computation ◽

Implementation Phase ◽

Novel Method

Sophisticated and high performance embedded systems are present in an increasing number of application domains. In this context, formal-based design methods have been studied to make the development process robust and scalable. Models of computation (MoC) allows the modeling of an application at a high abstraction level by using a formal base. This enables analysis before the application moves to the implementation phase. Different tools and frameworks supporting MoCs have been developed. Some of them can simulate the models and also verify their functionality and feasibility before the next design steps. In view of this, we present a novel method for analysis and identification of possible automation approaches applicable to embedded systems design flow supported by formal models of computation. A comprehensive case study shows the potential and applicability of our method.

Download Full-text

Evaluation by Neutron Radiation of the NMR-MPar Fault-Tolerance Approach Applied to Applications Running on a 28-nm Many-Core Processor

Electronics ◽

10.3390/electronics7110312 ◽

2018 ◽

Vol 7 (11) ◽

pp. 312 ◽

Cited By ~ 1

Author(s):

Vanessa Vargas ◽

Pablo Ramos ◽

Raoul Velazco

Keyword(s):

Fault Tolerance ◽

High Performance ◽

Matrix Multiplication ◽

Neutron Radiation ◽

High Energy ◽

Particle Accelerator ◽

Natural Radiation ◽

Tolerance Approach ◽

Many Core ◽

28 Nm

Currently, there is a special interest in validating the use of Commercial-Off-The-Shelf (COTS) multi/many-core processors for critical applications thanks to their high performance, low power consumption and affordability. However, the continuous shrinking of transistor geometry and the increasing complexity of these devices dramatically affect their sensitivity to natural radiation, and thus diminish their reliability. One of the most common effects produced by natural radiation is the Single Event Upset which is the bit-flip of a memory content producing unexpected results at application-level. For this reason, manufacturers and users implement hardware and software error-mitigation techniques on multi/many-core processors. In this context, the present work aims at evaluating a new fault-tolerance approach based on N-Modular redundancy (NMR) and partitioning called NMR-MPar by means of 14 MeV neutron radiation ground testing in order to emulate the effects of high-energy neutrons present at avionics altitudes. For evaluation purposes, a case-study is implemented on the 28 nm CMOS KALRAY MPPA-256 many-core processor running two complementary benchmarks applications: a distributed Matrix Multiplication and the Travel Salesman Problem. Radiation experiments were conducted in GENEPI2 particle-accelerator. The correctness of the results of the application when an error is detected confirms the approach’s effectiveness and boosts their usage on avionics applications.

Download Full-text

Efficient Instruction and Data Caching for High Performance Embedded Processors

Jornada de Jóvenes Investigadores del I3A ◽

10.26754/jji-i3a.201201788 ◽

1970 ◽

pp. 9

Author(s):

A. Ferrerón Labari ◽

D. Suárez Gracia ◽

V. Viñals Yúfera

Keyword(s):

Embedded Systems ◽

Power Consumption ◽

Low Power ◽

Interconnection Networks ◽

High Performance ◽

Critical Issue ◽

Content Management ◽

Structure Design ◽

Portable Devices ◽

On Chip

In the last years, embedded systems have evolved so that they offer capabilities we could only find before in high performance systems. Portable devices already have multiprocessors on-chip (such as PowerPC 476FP or ARM Cortex A9 MP), usually multi-threaded, and a powerful multi-level cache memory hierarchy on-chip. As most of these systems are battery-powered, the power consumption becomes a critical issue. Achieving high performance and low power consumption is a high complexity challenge where some proposals have been already made. Suarez et al. proposed a new cache hierarchy on-chip, the LP-NUCA (Low Power NUCA), which is able to reduce the access latency taking advantage of NUCA (Non-Uniform Cache Architectures) properties. The key points are decoupling the functionality, and utilizing three specialized networks on-chip. This structure has been proved to be efficient for data hierarchies, achieving a good performance and reducing the energy consumption. On the other hand, instruction caches have different requirements and characteristics than data caches, contradicting the low-power embedded systems requirements, especially in SMT (simultaneous multi-threading) environments. We want to study the benefits of utilizing small tiled caches for the instruction hierarchy, so we propose a new design, ID-LP-NUCAs. Thus, we need to re-evaluate completely our previous design in terms of structure design, interconnection networks (including topologies, flow control and routing), content management (with special interest in hardware/software content allocation policies), and structure sharing. In CMP environments (chip multiprocessors) with parallel workloads, coherence plays an important role, and must be taken into consideration.

Download Full-text

THE USE OF CASE STUDY METHOD IN TEACHING THE TERMINOLOGY IN THE FIELD OF AGRICULTURAL MACHINERY

10.36078/987654370 ◽

2019 ◽

pp. 123-130

Keyword(s):

Case Studies ◽

High Performance ◽

Mechanical Engineering ◽

The United States ◽

Machine Building ◽

Soil Tillage ◽

Case Study Method ◽

Agricultural Machine ◽

Manufacturing Machine

The scientific research works concerning the field of mechanical engineering such as, manufacturing machine slate, soil tillage, sowing and harvesting based on the requirements for the implementation of agrotechnical measures for the cultivation of plants in its transportation, through the development of mastering new types of high-performance and energy-saving machines in manufacturing machine slate, creation of multifunctional machines, allowing simultaneous soil cultivation, by means of several planting operations, integration of agricultural machine designs are taken into account in manufacturing of the local universal tractor designed basing on high ergonomic indicators. For this reason, this article explores the use of case studies in teaching agricultural terminology by means analyzing the researches in machine building. Case study method was firstly used in 1870 in Harvard University of Law School in the United States. Also in the article, we give the examples of agricultural machine-building terms, teaching terminology and case methods, case study process and case studies method itself. The research works in the field of mechanical engineering and the use of case studies in teaching terminology have also been analyzed. In addition, the requirements for the development of case study tasks are given in their practical didactic nature. We also give case study models that allow us analyzing and evaluating students' activities.

Download Full-text

Geophysical Parameters Retrieval From Sentinel-1 Sar Data: A Case Study For High Performance Computing At EODC

24th High Performance Computing Symposium ◽

10.22360/springsim.2016.hpc.026 ◽

2016 ◽

Cited By ~ 1

Keyword(s):

High Performance Computing ◽

High Performance ◽

Sar Data ◽

Performance Computing

Download Full-text

Neighborhood Energy Modeling and Monitoring: A Case Study

Energies ◽

10.3390/en14123716 ◽

2021 ◽

Vol 14 (12) ◽

pp. 3716

Author(s):

Francesco Causone ◽

Rossano Scoccia ◽

Martina Pelle ◽

Paola Colombo ◽

Mario Motta ◽

...

Keyword(s):

High Performance ◽

High Efficiency ◽

Early Stage ◽

Energy Performance ◽

Monitoring Plan ◽

Performance Targets ◽

Carrier Energy ◽

Zero Carbon ◽

Energy Grid

Cities and nations worldwide are pledging to energy and carbon neutral objectives that imply a huge contribution from buildings. High-performance targets, either zero energy or zero carbon, are typically difficult to be reached by single buildings, but groups of properly-managed buildings might reach these ambitious goals. For this purpose we need tools and experiences to model, monitor, manage and optimize buildings and their neighborhood-level systems. The paper describes the activities pursued for the deployment of an advanced energy management system for a multi-carrier energy grid of an existing neighborhood in the area of Milan. The activities included: (i) development of a detailed monitoring plan, (ii) deployment of the monitoring plan, (iii) development of a virtual model of the neighborhood and simulation of the energy performance. Comparisons against early-stage energy monitoring data proved promising and the generation system showed high efficiency (EER equal to 5.84), to be further exploited.

Download Full-text

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

Journal of Computer Science and Technology ◽

10.1007/s11390-020-0741-6 ◽

2021 ◽

Vol 36 (1) ◽

pp. 33-43

Author(s):

Jian-Bin Fang ◽

Xiang-Ke Liao ◽

Chun Huang ◽

De-Zun Dong

Keyword(s):

Performance Evaluation ◽

Many Core

Download Full-text