efficient memory
Recently Published Documents


TOTAL DOCUMENTS

329
(FIVE YEARS 67)

H-INDEX

23
(FIVE YEARS 3)

Author(s):  
Shashi Kant Ratnakar ◽  
Subhajit Sanfui ◽  
Deepak Sharma

Abstract Topology optimization has been successful in generating optimal topologies of various structures arising in real-world applications. Since these applications can have complex and large domains, topology optimization suffers from a high computational cost because of the use of unstructured meshes for discretization of these domains and their finite element analysis (FEA). This paper addresses this challenge by developing three GPU-based element-by-element strategies targeting unstructured all-hexahedral mesh for the matrix-free precondition conjugate gradient (PCG) finite element solver. These strategies mainly perform sparse matrix multiplication (SpMV) arising with the FEA solver by allocating more compute threads of GPU per element. Moreover, the strategies are developed to use shared memory of GPU for efficient memory transactions. The proposed strategies are tested with solid isotropic material with penalization (SIMP) method on four examples of 3D structural topology optimization. Results demonstrate that the proposed strategies achieve speedup up to 8.2× over the standard GPU-based SpMV strategies from the literature.


Author(s):  
Yunhee Woo ◽  
Dongyoung Kim ◽  
Jaemin Jeong ◽  
Jeong-Gun Lee
Keyword(s):  

2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Peng Fang ◽  
Fang Wang ◽  
Zhan Shi ◽  
Dan Feng ◽  
Qianxu Yi ◽  
...  

2021 ◽  
Vol 17 (3) ◽  
pp. 1-32
Author(s):  
Xingda Wei ◽  
Rong Chen ◽  
Haibo Chen ◽  
Binyu Zang

RDMA ( Remote Direct Memory Access ) has gained considerable interests in network-attached in-memory key-value stores. However, traversing the remote tree-based index in ordered key-value stores with RDMA becomes a critical obstacle, causing an order-of-magnitude slowdown and limited scalability due to multiple round trips. Using index cache with conventional wisdom—caching partial data and traversing them locally—usually leads to limited effect because of unavoidable capacity misses, massive random accesses, and costly cache invalidations. We argue that the machine learning (ML) model is a perfect cache structure for the tree-based index, termed learned cache . Based on it, we design and implement XStore , an RDMA-based ordered key-value store with a new hybrid architecture that retains a tree-based index at the server to perform dynamic workloads (e.g., inserts) and leverages a learned cache at the client to perform static workloads (e.g., gets and scans). The key idea is to decouple ML model retraining from index updating by maintaining a layer of indirection from logical to actual positions of key-value pairs. It allows a stale learned cache to continue predicting a correct position for a lookup key. XStore ensures correctness using a validation mechanism with a fallback path and further uses speculative execution to minimize the cost of cache misses. Evaluations with YCSB benchmarks and production workloads show that a single XStore server can achieve over 80 million read-only requests per second. This number outperforms state-of-the-art RDMA-based ordered key-value stores (namely, DrTM-Tree, Cell, and eRPC+Masstree) by up to 5.9× (from 3.7×). For workloads with inserts, XStore still provides up to 3.5× (from 2.7×) throughput speedup, achieving 53M reqs/s. The learned cache can also reduce client-side memory usage and further provides an efficient memory-performance tradeoff, e.g., saving 99% memory at the cost of 20% peak throughput.


2021 ◽  
Vol 16 (2) ◽  
pp. 1-9
Author(s):  
Stephanie Soldavini ◽  
Christian Pilato

The never-ending demand for high performance and energy efficiency is pushing designers towards an increasing level of heterogeneity and specialization in modern computing systems. In such systems, creating efficient memory architectures is one of the major opportunities for optimizing modern workloads (e.g., computer vision, machine learning, graph analytics, etc.) that are extremely data-driven. However, designers demand proper design methods to tackle the increasing design complexity and address several new challenges, like the security and privacy of the data to be elaborated.This paper overviews the current trend for the design of domain-specific memory architectures. Domain-specific architectures are tailored for the given application domain, with the introduction of hardware accelerators and custom memory modules while maintaining a certain level of flexibility. We describe the major components, the common challenges, and the state-of-the-art design methodologies for building domain-specific memory architectures. We also discuss the most relevant research projects, providing a classification based on our main topics.


Author(s):  
Liangchao Guo ◽  
Boyuan Mu ◽  
Ming-Zheng Li ◽  
Baidong Yang ◽  
Ruo-Si Chen ◽  
...  

Author(s):  
B Ajay Kumar

The DSP systems usually deal with a lot of multiplications as it is dealt with many discrete signals. The combinational circuits consume a lot of power as there are many intermediate blocks (i.e., usually full adders & and gates). The combinational circuits take more area and the delay is also more. Usually there is a tradeoff between area and delay. To make the multiplier more efficient we usually prefer memory-based multiplier. Different types of techniques are there in memory-based multipliers like the APC (anti-symmetric product coding), OMS (odd multiple storage) etc. In these techniques LUT based storage is used. The multiplied products are stored efficiently based on the technique used to store the data. To optimize the memory required we combine the APC and OMS technique for better storage and retrieval of data. In this project we show how combined technique increases the performance of multiplier. The suggested combined technique reduces the size of the LUT to one-fourth that of a standard LUT. It is demonstrated that the proposed LUT architecture for tiny input sizes can be used to execute high-precision multiplication with input operand decomposition in an efficient manner.


Algorithms ◽  
2021 ◽  
Vol 14 (7) ◽  
pp. 215
Author(s):  
Faraz Bhatti ◽  
Thomas Greiner

Plenoptic camera based system captures the light-field that can be exploited to estimate the 3D depth of the scene. This process generally consists of a significant number of recurrent operations, and thus requires high computation power. General purpose processor based system, due to its sequential architecture, consequently results in the problem of large execution time. A desktop graphics processing unit (GPU) can be employed to resolve this problem. However, it is an expensive solution with respect to power consumption and therefore cannot be used in mobile applications with low energy requirements. In this paper, we propose a modified plenoptic depth estimation algorithm that works on a single frame recorded by the camera and respective FPGA based hardware design. For this purpose, the algorithm is modified for parallelization and pipelining. In combination with efficient memory access, the results show good performance and lower power consumption compared to other systems.


Sign in / Sign up

Export Citation Format

Share Document