memory system
Recently Published Documents


TOTAL DOCUMENTS

1482
(FIVE YEARS 187)

H-INDEX

69
(FIVE YEARS 6)

2022 ◽  
Vol 19 (1) ◽  
pp. 1-23
Author(s):  
Yaosheng Fu ◽  
Evgeny Bolotin ◽  
Niladrish Chatterjee ◽  
David Nellans ◽  
Stephen W. Keckler

As GPUs scale their low-precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that a converged GPU design trying to address diverging architectural requirements between FP32 (or larger)-based HPC and FP16 (or smaller)-based DL workloads results in sub-optimal configurations for either of the application domains. We argue that a C omposable O n- PA ckage GPU (COPA-GPU) architecture to provide domain-specialized GPU products is the most practical solution to these diverging requirements. A COPA-GPU leverages multi-chip-module disaggregation to support maximal design reuse, along with memory system specialization per application domain. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4× higher off-die bandwidth, 32× larger on-package cache, and 2.3× higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs. This work explores the microarchitectural design necessary to enable composable GPUs and evaluates the benefits composability can provide to HPC, DL training, and DL inference. We show that when compared to a converged GPU design, a DL-optimized COPA-GPU featuring a combination of 16× larger cache capacity and 1.6× higher DRAM bandwidth scales per-GPU training and inference performance by 31% and 35%, respectively, and reduces the number of GPU instances by 50% in scale-out training scenarios.


2022 ◽  
Vol 21 (1) ◽  
pp. 1-18
Author(s):  
Fei Wen ◽  
Mian Qin ◽  
Paul Gratz ◽  
Narasimha Reddy

Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of current mobile applications. Recently emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D XPoint, have higher capacity density, minimal static power consumption and lower cost per GB. However, NVM has longer access latency and limited write endurance as opposed to DRAM. The different characteristics of distinct memory classes render a new challenge for memory system design. Ideally, pages should be placed or migrated between the two types of memories according to the data objects’ access properties. Prior system software approaches exploit the program information from OS but at the cost of high software latency incurred by related kernel processes. Hardware approaches can avoid these latencies, however, hardware’s vision is constrained to a short time window of recent memory requests, due to the limited on-chip resources. In this work, we propose OpenMem: a hardware-software cooperative approach that combines the execution time advantages of pure hardware approaches with the data object properties in a global scope. First, we built a hardware-based memory manager unit (HMMU) that can learn the short-term access patterns by online profiling, and execute data migration efficiently. Then, we built a heap memory manager for the heterogeneous memory systems that allows the programmer to directly customize each data object’s allocation to a favorable memory device within the presumed object life cycle. With the programmer’s hints guiding the data placement at allocation time, data objects with similar properties will be congregated to reduce unnecessary page migrations. We implemented the whole system on the FPGA board with embedded ARM processors. In testing under a set of benchmark applications from SPEC 2017 and PARSEC, experimental results show that OpenMem reduces 44.6% energy consumption with only a 16% performance degradation compared to the all-DRAM memory system. The amount of writes to the NVM is reduced by 14% versus the HMMU-only, extending the NVM device lifetime.


2022 ◽  
Vol 2022 ◽  
pp. 1-13
Author(s):  
Jianhua Li ◽  
Guanlong Liu ◽  
Zhiyuan Zhen ◽  
Zihao Shen ◽  
Shiliang Li ◽  
...  

Molecular docking aims to predict possible drug candidates for many diseases, and it is computationally intensive. Particularly, in simulating the ligand-receptor binding process, the binding pocket of the receptor is divided into subcubes, and when the ligand is docked into all cubes, there are many molecular docking tasks, which are extremely time-consuming. In this study, we propose a heterogeneous parallel scheme of molecular docking for the binding process of ligand to receptor to accelerate simulating. The parallel scheme includes two layers of parallelism, a coarse-grained layer of parallelism implemented in the message-passing interface (MPI) and a fine-grained layer of parallelism focused on the graphics processing unit (GPU). At the coarse-grain layer of parallelism, a docking task inside one lattice is assigned to one unique MPI process, and a grouped master-slave mode is used to allocate and schedule the tasks. Meanwhile, at the fine-gained layer of parallelism, GPU accelerators undertake the computationally intensive computing of scoring functions and related conformation spatial transformations in a single docking task. The results of the experiments for the ligand-receptor binding process show that on a multicore server with GPUs the parallel program has achieved a speedup ratio as high as 45 times in flexible docking and as high as 54.5 times in semiflexible docking, and on a distributed memory system, the docking time for flexible docking and that for semiflexible docking gradually decrease as the number of nodes used in the parallel program gradually increases. The scalability of the parallel program is also verified in multiple nodes on a distributed memory system and is approximately linear.


Author(s):  
Guan Wang ◽  
Hua Li ◽  
Qijian Zhang ◽  
Fengjuan Zhu ◽  
Junwei Yuan ◽  
...  

Artificial retinal materials play an important role in vision repair and robotics. However, the inability to convert characteristic visible light signals into electrical signals has seriously hindered the development and...


2021 ◽  
Author(s):  
Fei Wen ◽  
Mian Qin ◽  
Paul Gratz ◽  
Narasimha Reddy

Public Health ◽  
2021 ◽  
Vol 201 ◽  
pp. 108-114
Author(s):  
Xiaojun Guo ◽  
Houxue Shen ◽  
Sifeng Liu ◽  
Naiming Xie ◽  
Yingjie Yang ◽  
...  

Cortex ◽  
2021 ◽  
Author(s):  
Qing Ma ◽  
Edmund T. Rolls ◽  
Chu-Chung Huang ◽  
Wei Cheng ◽  
Jianfeng Feng

Sign in / Sign up

Export Citation Format

Share Document