memory hierarchies
Recently Published Documents


TOTAL DOCUMENTS

158
(FIVE YEARS 13)

H-INDEX

21
(FIVE YEARS 2)

Author(s):  
Seher Acer ◽  
Ariful Azad ◽  
Erik G Boman ◽  
Aydın Buluç ◽  
Karen D. Devine ◽  
...  

Combinatorial algorithms in general and graph algorithms in particular play a critical enabling role in numerous scientific applications. However, the irregular memory access nature of these algorithms makes them one of the hardest algorithmic kernels to implement on parallel systems. With tens of billions of hardware threads and deep memory hierarchies, the exascale computing systems in particular pose extreme challenges in scaling graph algorithms. The codesign center on combinatorial algorithms, ExaGraph, was established to design and develop methods and techniques for efficient implementation of key combinatorial (graph) algorithms chosen from a diverse set of exascale applications. Algebraic and combinatorial methods have a complementary role in the advancement of computational science and engineering, including playing an enabling role on each other. In this paper, we survey the algorithmic and software development activities performed under the auspices of ExaGraph from both a combinatorial and an algebraic perspective. In particular, we detail our recent efforts in porting the algorithms to manycore accelerator (GPU) architectures. We also provide a brief survey of the applications that have benefited from the scalable implementations of different combinatorial algorithms to enable scientific discovery at scale. We believe that several applications will benefit from the algorithmic and software tools developed by the ExaGraph team.


2021 ◽  
Vol 18 (1) ◽  
pp. 1-27
Author(s):  
Sooraj Puthoor ◽  
Mikko H. Lipasti

Sequential consistency (SC) is the most intuitive memory consistency model and the easiest for programmers and hardware designers to reason about. However, the strict memory ordering restrictions imposed by SC make it less attractive from a performance standpoint. Additionally, prior high-performance SC implementations required complex hardware structures to support speculation and recovery. In this article, we introduce the lockstep SC consistency model (LSC), a new memory model based on SC but carefully defined to accommodate the data parallel lockstep execution paradigm of GPUs. We also describe an efficient LSC implementation for an APU system-on-chip (SoC) and show that our implementation performs close to the baseline relaxed model. Evaluation of our implementation shows that the geometric mean performance cost for lockstep SC is just 0.76% for GPU execution and 6.11% for the entire APU SoC compared to a baseline with a weaker memory consistency model. Adoption of LSC in future APU and SoC designs will reduce the burden on programmers trying to write correct parallel programs, while also simplifying the implementation and verification of systems with heterogeneous processing elements and complex memory hierarchies. 1


Electronics ◽  
2020 ◽  
Vol 9 (12) ◽  
pp. 2061
Author(s):  
Seung-Ho Lim ◽  
Hyunchul Seok ◽  
Ki-Woong Park

The key challenges of manycore systems are the large amount of memory and high bandwidth required to run many applications. Three-dimesnional integrated on-chip memory is a promising candidate for addressing these challenges. The advent of on-chip memory has provided new opportunities to rethink traditional memory hierarchies and their management. In this study, we propose a polymorphic memory as a hybrid approach when using on-chip memory. In contrast to previous studies, we use the on-chip memory as both a main memory (called M1 memory) and a Dynamic Random Access Memory (DRAM) cache (called M2 cache). The main memory consists of M1 memory and a conventional DRAM memory called M2 memory. To achieve high performance when running many applications on this memory architecture, we propose management techniques for the main memory with M1 and M2 memories and for polymorphic memory with dynamic memory allocations for many applications in a manycore system. The first technique is to move frequently accessed pages to M1 memory via hardware monitoring in a memory controller. The second is M1 memory partitioning to mitigate contention problems among many processes. Finally, we propose a method to use M2 cache between a conventional last-level cache and M2 memory, and we determine the best cache size for improving the performance with polymorphic memory. The proposed schemes are evaluated with the SPEC CPU2006 benchmark, and the experimental results show that the proposed approaches can improve the performance under various workloads of the benchmark. The performance evaluation confirms that the average performance improvement of polymorphic memory is 21.7%, with 0.026 standard deviation for the normalized results, compared to the previous method of using on-chip memory as a last-level cache.


2020 ◽  
Vol 20 (3) ◽  
pp. 223-230
Author(s):  
Jan Mühlig ◽  
Michael Müller ◽  
Olaf Spinczyk ◽  
Jens Teubner

Abstract Emerging hardware platforms are characterized by large degrees of parallelism, complex memory hierarchies, and increasing hardware heterogeneity. Their theoretical peak data processing performance can only be unleashed if the different pieces of systems software collaborate much more closely and if their traditional dependencies and interfaces are redesigned. We have developed the key concepts and a prototype implementation of a novel system software stack named mxkernel. For MxKernel, efficient large scale data processing capabilities are a primary design goal. To achieve this, heterogeneity and parallelism become first-class citizens and deep memory hierarchies are considered from the very beginning. Instead of a classical “thread” model, mxkernel provides a simpler control flow abstraction: mxtasks model closed units of work, for which mxkernel will guarantee the required execution semantics, such exclusive access to a specific object in memory. They can be a very elegant abstraction also for heterogeneity and resource sharing. Furthermore, mxtasks are annotated with metadata, such as code variants (to support heterogeneity), memory access behavior (to improve cache efficiency and support memory hierarchies), or dependencies between mxtasks (to improve scheduling and avoid synchronization cost). With precisely the required metadata available, mxkernel can provide a lightweight, yet highly efficient form of resource management, even across applications, operating system, and database. Based on the mxkernel prototype we present preliminary results from this ambitious undertaking. We argue that threads are an ill-suited control flow abstraction for our modern computer architectures and that a task-based execution model is to be favored.


Author(s):  
Mohammed Al Farhan ◽  
Ahmad Abdelfattah ◽  
Stanimire Tomov ◽  
Mark Gates ◽  
Dalal Sukkari ◽  
...  

With the acquisition and widespread use of more resources that rely on accelerator/wide vector–based computing, there has been a strong demand for science and engineering applications to take advantage of these latest assets. This, however, has been extremely challenging due to the diversity of systems to support their extreme concurrency, complex memory hierarchies, costly data movement, and heterogeneous node architectures. To address these challenges, we design a programming model and describe its ease of use in the development of a new MAGMA Templates library that delivers high-performance scalable linear algebra portable on current and emerging architectures. MAGMA Templates derives its performance and portability by (1) building on existing state-of-the-art linear algebra libraries, like MAGMA, SLATE, Trilinos, and vendor-optimized math libraries, and (2) providing access (seamlessly to the users) to the latest algorithms and architecture-specific optimizations through a single, easy-to-use C++-based API.


2020 ◽  
pp. 1-1 ◽  
Author(s):  
Nicola Capodieci ◽  
Paolo Burgio ◽  
Roberto Cavicchioli ◽  
Ignacio Sanudo Olmedo ◽  
Marco Solieri ◽  
...  

Algorithms ◽  
2019 ◽  
Vol 12 (11) ◽  
pp. 229
Author(s):  
Mattia D’Emidio ◽  
Daniele Frigioni

The purpose of this special issue of Algorithms was to attract papers presenting original research in the area of algorithm engineering. In particular, submissions concerning the design, analysis, implementation, tuning, and experimental evaluation of discrete algorithms and data structures, and/or addressing methodological issues and standards in algorithmic experimentation were encouraged. Papers dealing with advanced models of computing, including memory hierarchies, cloud architectures, and parallel processing were also welcome. In this regard, we solicited contributions from all most prominent areas of applied algorithmic research, which include but are not limited to graphs, databases, computational geometry, big data, networking, combinatorial aspects of scientific computing, and computational problems in the natural sciences or engineering.


Mathematics ◽  
2019 ◽  
Vol 7 (5) ◽  
pp. 441 ◽  
Author(s):  
Mohammadali Asadi ◽  
Alexander Brandt ◽  
Robert H. C. Moir ◽  
Marc Moreno Maza

We provide a comprehensive presentation of algorithms, data structures, and implementation techniques for high-performance sparse multivariate polynomial arithmetic over the integers and rational numbers as implemented in the freely available Basic Polynomial Algebra Subprograms (BPAS) library. We report on an algorithm for sparse pseudo-division, based on the algorithms for division with remainder, multiplication, and addition, which are also examined herein. The pseudo-division and division with remainder operations are extended to multi-divisor pseudo-division and normal form algorithms, respectively, where the divisor set is assumed to form a triangular set. Our operations make use of two data structures for sparse distributed polynomials and sparse recursively viewed polynomials, with a keen focus on locality and memory usage for optimized performance on modern memory hierarchies. Experimentation shows that these new implementations compare favorably against competing implementations, performing between a factor of 3 better (for multiplication over the integers) to more than 4 orders of magnitude better (for pseudo-division with respect to a triangular set).


2019 ◽  
Vol 12 (2) ◽  
pp. 101-109
Author(s):  
Mohamed Salah Souahi ◽  
Mohamed Ben Mohammed

Background: The last three decades were marked by a spectacular evolution of CPUs. Both cores number on chip and shared Low Level Cache (LLC) size are increasing what makes LLC the bottleneck's system. One major weakness of future cache memory hierarchies will be to carry out memory blocks availability for vertical requests, with no consideration to horizontal proximity to cores. Simulations show that some LLC accesses cost more latency cycles than off-chip accesses. Objective: This paper presents a new adaptive and blocks behavior aware process, called NUCA-2A. It manages blocks in LLC in a purpose of reducing it's latency, and it's inner bandwidth, by studying each block's behavior, and by placing it in the most suitable location among LLC banks. Methods: LLC accesses are classified basing on each one's specific behavior. Authors establish also a two levels horizontal hierarchy in LLC. This work consists to place blocks in the zones that matches the best their behaviors. Results: In contrast to the classic S-NUCA scheme, NUCA-2A makes a reduction of up to 60,39% of global LLC latency as well as 40,74% of average inner traffic. It makes also an average speedup of 17,89 % in term of number of instructions executed by cycle. Conclusion: Behaviors study gives encouraging results. Several methods are in use in different fields to forecast a behavior basing on previous observations. We are working on a prefetching model that permits blocks migration to and from privileged banks.


Sign in / Sign up

Export Citation Format

Share Document