access patterns
Recently Published Documents


TOTAL DOCUMENTS

489
(FIVE YEARS 112)

H-INDEX

25
(FIVE YEARS 2)

2022 ◽  
Vol 15 (1) ◽  
pp. 1-30
Author(s):  
Johannes Menzel ◽  
Christian Plessl ◽  
Tobias Kenter

N-body methods are one of the essential algorithmic building blocks of high-performance and parallel computing. Previous research has shown promising performance for implementing n-body simulations with pairwise force calculations on FPGAs. However, to avoid challenges with accumulation and memory access patterns, the presented designs calculate each pair of forces twice, along with both force sums of the involved particles. Also, they require large problem instances with hundreds of thousands of particles to reach their respective peak performance, limiting the applicability for strong scaling scenarios. This work addresses both issues by presenting a novel FPGA design that uses each calculated force twice and overlaps data transfers and computations in a way that allows to reach peak performance even for small problem instances, outperforming previous single precision results even in double precision, and scaling linearly over multiple interconnected FPGAs. For a comparison across architectures, we provide an equally optimized CPU reference, which for large problems actually achieves higher peak performance per device, however, given the strong scaling advantages of the FPGA design, in parallel setups with few thousand particles per device, the FPGA platform achieves highest performance and power efficiency.


2021 ◽  
Vol 18 (4) ◽  
pp. 1-24
Author(s):  
Sriseshan Srikanth ◽  
Anirudh Jain ◽  
Thomas M. Conte ◽  
Erik P. Debenedictis ◽  
Jeanine Cook

Sparse data applications have irregular access patterns that stymie modern memory architectures. Although hyper-sparse workloads have received considerable attention in the past, moderately-sparse workloads prevalent in machine learning applications, graph processing and HPC have not. Where the former can bypass the cache hierarchy, the latter fit in the cache. This article makes the observation that intelligent, near-processor cache management can improve bandwidth utilization for data-irregular accesses, thereby accelerating moderately-sparse workloads. We propose SortCache, a processor-centric approach to accelerating sparse workloads by introducing accelerators that leverage the on-chip cache subsystem, with minimal programmer intervention.


Author(s):  
Geoffrey Messier ◽  
Leslie Tutty ◽  
Caleb John

This paper explores how to best identify clients for housing services based on their homeless shelter access patterns. We utilize the number of shelter stays and episodes of shelter use for a client within a specified time window. Thresholds are then applied to these values to determine if that individual is a good candidate for housing support. Using new housing referral impact metrics, we explore a range of threshold and time window values to determine which combination both maximizes impact and identifies good candidates for housing as soon as possible. New insights are also provided regarding the characteristics of the “under-the-radar” client group who are typically not identified for housing support.


2021 ◽  
Vol 11 (21) ◽  
pp. 10377
Author(s):  
Hyeonseong Choi ◽  
Jaehwan Lee

To achieve high accuracy when performing deep learning, it is necessary to use a large-scale training model. However, due to the limitations of GPU memory, it is difficult to train large-scale training models within a single GPU. NVIDIA introduced a technology called CUDA Unified Memory with CUDA 6 to overcome the limitations of GPU memory by virtually combining GPU memory and CPU memory. In addition, in CUDA 8, memory advise options are introduced to efficiently utilize CUDA Unified Memory. In this work, we propose a newly optimized scheme based on CUDA Unified Memory to efficiently use GPU memory by applying different memory advise to each data type according to access patterns in deep learning training. We apply CUDA Unified Memory technology to PyTorch to see the performance of large-scale learning models through the expanded GPU memory. We conduct comprehensive experiments on how to efficiently utilize Unified Memory by applying memory advises when performing deep learning. As a result, when the data used for deep learning are divided into three types and a memory advise is applied to the data according to the access pattern, the deep learning execution time is reduced by 9.4% compared to the default Unified Memory.


Entropy ◽  
2021 ◽  
Vol 23 (11) ◽  
pp. 1459
Author(s):  
Behrouz Zolfaghari ◽  
Vikrant Singh ◽  
Brijesh Kumar Rai ◽  
Khodakhast Bibak ◽  
Takeshi Koshiba

The idea behind network caching is to reduce network traffic during peak hours via transmitting frequently-requested content items to end users during off-peak hours. However, due to limited cache sizes and unpredictable access patterns, this might not totally eliminate the need for data transmission during peak hours. Coded caching was introduced to further reduce the peak hour traffic. The idea of coded caching is based on sending coded content which can be decoded in different ways by different users. This allows the server to service multiple requests by transmitting a single content item. Research works regarding coded caching traditionally adopt a simple network topology consisting of a single server, a single hub, a shared link connecting the server to the hub, and private links which connect the users to the hub. Building on the results of Sengupta et al. (IEEE Trans. Inf. Forensics Secur., 2015), we propose and evaluate a yet more complex system model that takes into consideration both throughput and security via combining the mentioned ideas. It is demonstrated that the achievable rates in the proposed model are within a constant multiplicative and additive gap with the minimum secure rates.


2021 ◽  
pp. 69-82
Author(s):  
Abraham Castillo-García ◽  
Lisbeth Rodríguez-Mazahua ◽  
Felipe Castro-Medina ◽  
Beatriz A. Olivares-Zepahua ◽  
María A. Abud-Figueroa

2021 ◽  
Vol 11 (20) ◽  
pp. 9495
Author(s):  
Tadeusz Tomczak

The performance of lattice–Boltzmann solver implementations usually depends mainly on memory access patterns. Achieving high performance requires then complex code which handles careful data placement and ordering of memory transactions. In this work, we analyse the performance of an implementation based on a new approach called the data-oriented language, which allows the combination of complex memory access patterns with simple source code. As a use case, we present and provide the source code of a solver for D2Q9 lattice and show its performance on GTX Titan Xp GPU for dense and sparse geometries up to 40962 nodes. The obtained results are promising, around 1000 lines of code allowed us to achieve performance in the range of 0.6 to 0.7 of maximum theoretical memory bandwidth (over 2.5 and 5.0 GLUPS for double and single precision, respectively) for meshes of sizes above 10242 nodes, which is close to the current state-of-the-art. However, we also observed relatively high and sometimes difficult to predict overheads, especially for sparse data structures. The additional issue was also a rather long compilation, which extended the time of short simulations, and a lack of access to low-level optimisation mechanisms.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Giacomo Pedretti ◽  
Catherine E. Graves ◽  
Sergey Serebryakov ◽  
Ruibin Mao ◽  
Xia Sheng ◽  
...  

AbstractTree-based machine learning techniques, such as Decision Trees and Random Forests, are top performers in several domains as they do well with limited training datasets and offer improved interpretability compared to Deep Neural Networks (DNN). However, these models are difficult to optimize for fast inference at scale without accuracy loss in von Neumann architectures due to non-uniform memory access patterns. Recently, we proposed a novel analog content addressable memory (CAM) based on emerging memristor devices for fast look-up table operations. Here, we propose for the first time to use the analog CAM as an in-memory computational primitive to accelerate tree-based model inference. We demonstrate an efficient mapping algorithm leveraging the new analog CAM capabilities such that each root to leaf path of a Decision Tree is programmed into a row. This new in-memory compute concept for enables few-cycle model inference, dramatically increasing 103 × the throughput over conventional approaches.


2021 ◽  
Vol 11 (10) ◽  
pp. 2639-2645
Author(s):  
T. Sivaprakasam ◽  
M. Ramasamy

In FFT algorithms memory access patterns prevent multiple architectures from achieving high machine use, particularly when parallel processing is needed to achieve the desired efficiency rates. Beginning with the extremely powerful FFT heart, the on-chip memory hierarchy for the multicored FFT processor, is co-designed and linked on-chip. We have shown that the Floating Processing Factor (FPPE) proposed achieves greater operating rate and lower power for the application of health informatics. This test mechanism aids in omission of faulty cores and autonomous detection also makes elegant multi-core architecture degradation feasible. Experimental results illustrate that the anticipated design is scalable widely in terms of performance overhead and hardware overhead which makes it appropriate to many-cores with more than a thousand processing cores through Low Power and High Speed.


Sign in / Sign up

Export Citation Format

Share Document