multicore architectures Latest Research Papers

Graph structures are a natural representation of important and pervasive data. While graph applications have significant parallelism, their characteristic pointer indirect loads to neighbor data hinder scalability to large datasets on multicore systems. A scalable and efficient system must tolerate latency while leveraging data parallelism across millions of vertices. Modern Out-of-Order (OoO) cores inherently tolerate a fraction of long latencies, but become clogged when running severely memory-bound applications. Combined with large power/area footprints, this limits their parallel scaling potential and, consequently, the gains that existing software frameworks can achieve. Conversely, accelerator and memory hierarchy designs provide performant hardware specializations, but cannot support diverse application demands. To address these shortcomings, we present GraphAttack, a hardware-software data supply approach that accelerates graph applications on in-order multicore architectures. GraphAttack proposes compiler passes to (1) identify idiomatic long-latency loads and (2) slice programs along these loads into data Producer/ Consumer threads to map onto pairs of parallel cores. Each pair shares a communication queue; the Producer asynchronously issues long-latency loads, whose results are buffered in the queue and used by the Consumer. This scheme drastically increases memory-level parallelism (MLP) to mitigate latency bottlenecks. In equal-area comparisons, GraphAttack outperforms OoO cores, do-all parallelism, prefetching, and prior decoupling approaches, achieving a 2.87× speedup and 8.61× gain in energy efficiency across a range of graph applications. These improvements scale; GraphAttack achieves a 3× speedup over 64 parallel cores. Lastly, it has pragmatic design principles; it enhances in-order architectures that are gaining increasing open-source support.

Download Full-text

Effective On-Chip Communication for Message Passing Programs on Multi-Core Processors

Electronics ◽

10.3390/electronics10212681 ◽

2021 ◽

Vol 10 (21) ◽

pp. 2681

Author(s):

Joonmoo Huh ◽

Deokwoo Lee

Keyword(s):

Parallel Programming ◽

Shared Memory ◽

Message Passing ◽

Programming Model ◽

Multicore Architectures ◽

Worst Case ◽

High Performing ◽

Parallel Programming Model ◽

On Chip ◽

Sharing Patterns

Shared memory is the most popular parallel programming model for multi-core processors, while message passing is generally used for large distributed machines. However, as the number of cores on a chip increases, the relative merits of shared memory versus message passing change, and we argue that message passing becomes a viable, high performing, and parallel programming model. To demonstrate this hypothesis, we compare a shared memory architecture with a new message passing architecture on a suite of applications tuned for each system independently. Perhaps surprisingly, the fundamental behaviors of the applications studied in this work, when optimized for both models, are very similar to each other, and both could execute efficiently on multicore architectures despite many implementations being different from each other. Furthermore, if hardware is tuned to support message passing by supporting bulk message transfer and the elimination of unnecessary coherence overheads, and if effective support is available for global operations, then some applications would perform much better on a message passing architecture. Leveraging our insights, we design a message passing architecture that supports both memory-to-memory and cache-to-cache messaging in hardware. With the new architecture, message passing is able to outperform its shared memory counterparts on many of the applications due to the unique advantages of the message passing hardware as compared to cache coherence. In the best case, message passing achieves up to a 34% increase in speed over its shared memory counterpart, and it achieves an average 10% increase in speed. In the worst case, message passing is slowed down in two applications—CG (conjugate gradient) and FT (Fourier transform)—because it could not perform well on the unique data sharing patterns as its counterpart of shared memory. Overall, our analysis demonstrates the importance of considering message passing as a high performing and hardware-supported programming model on future multicore architectures.

Download Full-text

Performance vs Programming Effort between Rust and C on Multicore Architectures: Case Study in N-Body

10.1109/clei53233.2021.9640225 ◽

2021 ◽

Author(s):

Manuel Costanzo ◽

Enzo Rucci ◽

Marcelo Naiouf ◽

Armando De Giusti

Keyword(s):

Multicore Architectures ◽

Programming Effort

Download Full-text

Shared Memory Verification for Multicore Chip Designs

10.5753/ctd.2021.15760 ◽

2021 ◽

Author(s):

Marleson Graf ◽

Luiz C. V. dos Santos

Keyword(s):

Integrated Circuits ◽

Shared Memory ◽

Test Generation ◽

Computer Aided Design ◽

General Purpose ◽

Functional Verification ◽

Multicore Architectures ◽

Consistency Model ◽

Novel Approach ◽

On Chip

A multicore chip usually provides a shared memory abstraction implemented by a cache coherence protocol. On-chip coherence can scale gracefully as the number of cores grows, and it plays a major role for general purpose applications. Besides, multicore architectures are likely to relax constraints on store atomicity and on the ordering between loads and stores. As a result, the validation of shared memory faces two main challenges: the higher number of valid execution behaviors and the larger coherence protocol's state space. This dissertation faces those challenges and targets an important design automation phase: the (pre-silicon) functional verification of the shared memory subsystem of a multicore chip, whose behavior is specified by a memory consistency model (MCM). The main scientific contribution is a novel approach to the building of MCM checkers, along with technical contributions on random test generation and directed test generation. The contributions were reported by two papers in a premier IEEE/ACM conference and two articles in the most prestigious IEEE journal on Computer Aided Design of Integrated Circuits and Systems.

Download Full-text

Performance improvement and analysis of snoopy cache coherence based multicore architectures

International Journal of Systems Assurance Engineering and Management ◽

10.1007/s13198-021-01177-w ◽

2021 ◽

Author(s):

Amit D. Joshi ◽

N. Ramasubramanian

Keyword(s):

Performance Improvement ◽

Cache Coherence ◽

Multicore Architectures

Download Full-text

High-performance parallel implementations of flow accumulation algorithms for multicore architectures

Computers & Geosciences ◽

10.1016/j.cageo.2021.104741 ◽

2021 ◽

Vol 151 ◽

pp. 104741

Author(s):

Bartłomiej Kotyra ◽

Łukasz Chabudziński ◽

Przemysław Stpiczyński

Keyword(s):

High Performance ◽

Multicore Architectures ◽

Flow Accumulation ◽

Parallel Implementations

Download Full-text

Multithreading Based Parallel Processing for Image Geometric Coregistration in SAR Interferometry

Remote Sensing ◽

10.3390/rs13101963 ◽

2021 ◽

Vol 13 (10) ◽

pp. 1963

Author(s):

Pasquale Imperatore ◽

Eugenio Sansosti

Keyword(s):

Parallel Algorithm ◽

Shared Memory ◽

Multicore Processors ◽

Sar Interferometry ◽

Sar Image ◽

Multicore Architectures ◽

Image Coregistration ◽

Functional Scheme ◽

Sar Data ◽

Computationally Intensive

Within the framework of multi-temporal Synthetic Aperture Radar (SAR) interferometric processing, image coregistration is a fundamental operation that might be extremely time-consuming. This paper explores the possibility of addressing fast and accurate SAR image geometric coregistration, with sub-pixel accuracy and in the presence of a complex 3-D object scene, by exploiting the parallelism offered by shared-memory architectures. An efficient and scalable processor is proposed by designing a parallel algorithm incorporating thread-level parallelism for solving the inherent computationally intensive problem. The adopted functional scheme is first mathematically framed and then investigated in detail in terms of its computational structures. Subsequently, a parallel version of the algorithm is designed, according to a fork-join model, by suitably taking into account the granularity of the decomposition, load-balancing, and different scheduling strategies. The developed parallel algorithm implements parallelism at the thread-level by using OpenMP (Open Multi-Processing) and it is specifically targeted at shared-memory multiprocessors. The parallel performance of the implemented multithreading-based SAR image coregistration prototype processor is experimentally investigated and quantitatively assessed by processing high-resolution X-band COSMO-SkyMed SAR data and using two different multicore architectures. The effectiveness of the developed multithreaded prototype solution in fully benefitting from the computing power offered by multicore processors has successfully been demonstrated via a suitable experimental performance analysis conducted in terms of parallel speedup and efficiency. The demonstrated scalable performance and portability of the developed parallel processor confirm its potential for operational use in the interferometric SAR data processing at large scales.

Download Full-text

Modeling and Analysis of Automotive Cyber-physical Systems

10.21941/kcss/2021/2 ◽

2021 ◽

Author(s):

Max Jonas Friese

Keyword(s):

Cyber Physical Systems ◽

Multicore Architectures ◽

Industrial Practice ◽

Modeling And Analysis ◽

Physical Systems ◽

Measurement Analysis ◽

Wide Range ◽

Latency Analysis ◽

End To End ◽

Time Systems

Based on advances in scheduling analysis in the 1970s, a whole area of research has evolved: formal end-to-end latency analysis in real-time systems. Although multiple approaches from the scientific community have successfully been applied in industrial practice, a gap is emerging between the means provided by formally backed approaches and the need of the automotive industry where cyber-physical systems have taken over from classic embedded systems. They are accompanied by a shift to heterogeneous platforms build upon multicore architectures. Scien- tific techniques are often still based on too simple system models and estimations on important end-to-end latencies have only been tightened recently. To this end, we present an expressive system model and formally describe the problem of end-to-end latency analysis in modern automotive cyber-physical systems. Based on this we examine approaches to formally estimate tight end-to-end latencies in Chapter 4 and Chapter 5. The de- veloped approaches include a wide range of relevant systems. We show that our approach for the estimation of latencies of task chains dominates existing approaches in terms of tightness of the results. In the last chapter we make a brief digression to measurement analysis since measuring and simulation is an important part of verification in current industrial practice.

Download Full-text

Accelerated Analysis of Simulation Dumps through Parallelization on Multicore Architectures

2021 24th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS) ◽

10.1109/ddecs52668.2021.9417048 ◽

2021 ◽

Author(s):

D. Appello ◽

P. Bernardi ◽

A. Calabrese ◽

S. Littardi ◽

G. Pollaccia ◽

...

Keyword(s):

Multicore Architectures

Download Full-text