memory bandwidth Latest Research Papers

With the technology trend of hardware and workload consolidation for embedded systems and the rapid development of edge computing, there has been increasing interest in supporting parallel real-time tasks to better utilize the multi-core platforms while meeting the stringent real-time constraints. For parallel real-time tasks, the federated scheduling paradigm, which assigns each parallel task a set of dedicated cores, achieves good theoretical bounds by ensuring exclusive use of processing resources to reduce interferences. However, because cores share the last-level cache and memory bandwidth resources, in practice tasks may still interfere with each other despite executing on dedicated cores. Such resource interferences due to concurrent accesses can be even more severe for embedded platforms or edge servers, where the computing power and cache/memory space are limited. To tackle this issue, in this work, we present a holistic resource allocation framework for parallel real-time tasks under federated scheduling. Under our proposed framework, in addition to dedicated cores, each parallel task is also assigned with dedicated cache and memory bandwidth resources. Further, we propose a holistic resource allocation algorithm that well balances the allocation between different resources to achieve good schedulability. Additionally, we provide a full implementation of our framework by extending the federated scheduling system with Intel’s Cache Allocation Technology and MemGuard. Finally, we demonstrate the practicality of our proposed framework via extensive numerical evaluations and empirical experiments using real benchmark programs.

Download Full-text

Response time analysis of memory-bandwidth-regulated multiframe mixed-criticality systems

Journal of Systems Architecture ◽

10.1016/j.sysarc.2021.102346 ◽

2021 ◽

pp. 102346

Author(s):

Ishfaq Hussain ◽

Muhammad Ali Awan ◽

Pedro F. Souto ◽

Konstantinos Bletsas ◽

Eduardo Tovar

Keyword(s):

Response Time ◽

Memory Bandwidth ◽

Response Time Analysis ◽

Time Analysis ◽

Mixed Criticality

Download Full-text

Integer-Only CNNs with 4 Bit Weights and Bit-Shift Quantization Scales at Full-Precision Accuracy

Electronics ◽

10.3390/electronics10222823 ◽

2021 ◽

Vol 10 (22) ◽

pp. 2823

Author(s):

Maarten Vandersteegen ◽

Kristof Van Beeck ◽

Toon Goedemé

Keyword(s):

Neural Networks ◽

High Precision ◽

State Of The Art ◽

Lookup Table ◽

Memory Bandwidth ◽

Quantization Scheme ◽

Hardware Cost ◽

Channel Quantization ◽

Diverse Data ◽

Hardware Platforms

Quantization of neural networks has been one of the most popular techniques to compress models for embedded (IoT) hardware platforms with highly constrained latency, storage, memory-bandwidth, and energy specifications. Limiting the number of bits per weight and activation has been the main focus in the literature. To avoid major degradation of accuracy, common quantization methods introduce additional scale factors to adapt the quantized values to the diverse data ranges, present in full-precision (floating-point) neural networks. These scales are usually kept in high precision, requiring the target compute engine to support a few high-precision multiplications, which is not desirable due to the larger hardware cost. Little effort has yet been invested in trying to avoid high-precision multipliers altogether, especially in combination with 4 bit weights. This work proposes a new quantization scheme, based on power-of-two quantization scales, that works on-par compared to uniform per-channel quantization with full-precision 32 bit quantization scales when using only 4 bit weights. This is done through the addition of a low-precision lookup-table that translates stored 4 bit weights into nonuniformly distributed 8 bit weights for internal computation. All our quantized ImageNet CNNs achieved or even exceeded the Top-1 accuracy of their full-precision counterparts, with ResNet18 exceeding its full-precision model by 0.35%. Our MobileNetV2 model achieved state-of-the-art performance with only a slight drop in accuracy of 0.51%.

Download Full-text

A GPU-Based Integrated Simulation Framework for Modelling of Complex Subsurface Applications

10.2118/204000-ms ◽

2021 ◽

Author(s):

Mark Khait ◽

Denis Voskov

Keyword(s):

Practical Interest ◽

Linear Interpolation ◽

Memory Bandwidth ◽

Complex Data ◽

Simulation Framework ◽

Integrated Simulation ◽

Simulation Performance ◽

Computational Performance ◽

System Solution ◽

System Memory

Abstract Alternative to CPU computing architectures, such as GPU, continue to evolve increasing the gap in peak memory bandwidth achievable on a conventional workstation or laptop. Such architectures are attractive for reservoir simulation, which performance is generally bounded by system memory bandwidth. However, to harvest the benefit of a new architecture, the source code has to be inevitably rewritten, sometimes almost completely. One of the biggest challenges here is to refactor the Jacobian assembly which typically involves large volumes of code and complex data processing. We demonstrate an effective and general way to simplify the linearization stage extracting complex physics-related computations from the main simulation loop and leaving only an algebraic multi-linear interpolation kernel instead. In this work, we provide the detailed description of simulation performance benefits from execution of the entire nonlinear loop on the GPU platform. We evaluate the computational performance of Delft Advanced Research Terra Simulator (DARTS) for various subsurface applications of practical interest on both CPU and GPU platforms, comparing particular workflow phases including Jacobian assembly and linear system solution with both stages of the Constraint Pressure Residual preconditioner.

Download Full-text

Lossless Compression Algorithm and Architecture for Reduced Memory Bandwidth Requirement with Improved Prediction Based on the Multiple DPCM Golomb-Rice Algorithm

Journal of Web Engineering ◽

10.13052/jwe1540-9589.2065 ◽

2021 ◽

Author(s):

Imjae Hwang ◽

Juwon Yun ◽

Woonam Chung ◽

Jaeshin Lee ◽

Cheong-Ghil Kim ◽

...

Keyword(s):

High Efficiency ◽

Lossless Compression ◽

Differential Pulse ◽

Compression Algorithm ◽

Memory Bandwidth ◽

High Efficiency Video Coding ◽

Pulse Code Modulation ◽

Differential Pulse Code Modulation ◽

Bandwidth Requirement ◽

Code Modulation

In a computing environment, higher resolutions generally require more memory bandwidth, which inevitably leads to the consumption more power. This may become critical for the overall performance of mobile devices and graphic processor units with increased amounts of memory access and memory bandwidth. This paper proposes a lossless compression algorithm with a multiple differential pulse-code modulation variable sign code Golomb-Rice to reduce the memory bandwidth requirement. The efficiency of the proposed multiple differential pulse-code modulation is enhanced by selecting the optimal differential pulse code modulation mode. The experimental results show compression ratio of 1.99 for high-efficiency video coding image sequences and that the proposed lossless compression hardware can reduce the bus bandwidth requirement.

Download Full-text

Data-Oriented Language Implementation of the Lattice–Boltzmann Method for Dense and Sparse Geometries

Applied Sciences ◽

10.3390/app11209495 ◽

2021 ◽

Vol 11 (20) ◽

pp. 9495

Author(s):

Tadeusz Tomczak

Keyword(s):

Lattice Boltzmann ◽

High Performance ◽

State Of The Art ◽

Source Code ◽

Memory Access ◽

Memory Bandwidth ◽

New Approach ◽

Current State ◽

Access Patterns ◽

Boltzmann Method

The performance of lattice–Boltzmann solver implementations usually depends mainly on memory access patterns. Achieving high performance requires then complex code which handles careful data placement and ordering of memory transactions. In this work, we analyse the performance of an implementation based on a new approach called the data-oriented language, which allows the combination of complex memory access patterns with simple source code. As a use case, we present and provide the source code of a solver for D2Q9 lattice and show its performance on GTX Titan Xp GPU for dense and sparse geometries up to 40962 nodes. The obtained results are promising, around 1000 lines of code allowed us to achieve performance in the range of 0.6 to 0.7 of maximum theoretical memory bandwidth (over 2.5 and 5.0 GLUPS for double and single precision, respectively) for meshes of sizes above 10242 nodes, which is close to the current state-of-the-art. However, we also observed relatively high and sometimes difficult to predict overheads, especially for sparse data structures. The additional issue was also a rather long compilation, which extended the time of short simulations, and a lack of access to low-level optimisation mechanisms.

Download Full-text

Lossless Image/Video Embedded Compression for Memory Bandwidth Saving of AI Applications

10.1109/icce-tw52618.2021.9603234 ◽

2021 ◽

Author(s):

Yu-Hsing Chiu ◽

Szu-Hsuan Lai ◽

Yu-Hsuan Lee

Keyword(s):

Memory Bandwidth ◽

Embedded Compression

Download Full-text

A memory bandwidth improvement with memory space partitioning for single-precision floating-point FFT on Stratix 10 FPGA

10.1109/cluster48925.2021.00117 ◽

2021 ◽

Author(s):

Takaaki Miyajima ◽

Kentaro Sano

Keyword(s):

Floating Point ◽

Memory Bandwidth ◽

Space Partitioning ◽

Single Precision ◽

Memory Space ◽

With Memory

Download Full-text

Generalizing QoS-Aware Memory Bandwidth Allocation to Multi-Socket Cloud Servers

10.1109/cloud53861.2021.00071 ◽

2021 ◽

Author(s):

David Gureya ◽

Joao Barreto ◽

Vladimir Vlassov

Keyword(s):

Bandwidth Allocation ◽

Memory Bandwidth ◽

Cloud Servers

Download Full-text

Case Study on Integrated Architecture for In-Memory and In-Storage Computing

Electronics ◽

10.3390/electronics10151750 ◽

2021 ◽

Vol 10 (15) ◽

pp. 1750

Author(s):

Manho Kim ◽

Sung-Ho Kim ◽

Hyuk-Jae Lee ◽

Chae-Eun Rhee

Keyword(s):

Processing Speed ◽

Data Access ◽

Massive Data ◽

Memory Bandwidth ◽

Proof Of Concept ◽

Integrated Architecture ◽

Workload Allocation ◽

Computing Performance ◽

Simulation Results

Since the advent of computers, computing performance has been steadily increasing. Moreover, recent technologies are mostly based on massive data, and the development of artificial intelligence is accelerating it. Accordingly, various studies are being conducted to increase the performance and computing and data access, together reducing energy consumption. In-memory computing (IMC) and in-storage computing (ISC) are currently the most actively studied architectures to deal with the challenges of recent technologies. Since IMC performs operations in memory, there is a chance to overcome the memory bandwidth limit. ISC can reduce energy by using a low power processor inside storage without an expensive IO interface. To integrate the host CPU, IMC and ISC harmoniously, appropriate workload allocation that reflects the characteristics of the target application is required. In this paper, the energy and processing speed are evaluated according to the workload allocation and system conditions. The proof-of-concept prototyping system is implemented for the integrated architecture. The simulation results show that IMC improves the performance by 4.4 times and reduces total energy by 4.6 times over the baseline host CPU. ISC is confirmed to significantly contribute to energy reduction.

Download Full-text

memory bandwidth
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Holistic Resource Allocation Under Federated Scheduling for Parallel Real-time Tasks

Response time analysis of memory-bandwidth-regulated multiframe mixed-criticality systems

Integer-Only CNNs with 4 Bit Weights and Bit-Shift Quantization Scales at Full-Precision Accuracy

A GPU-Based Integrated Simulation Framework for Modelling of Complex Subsurface Applications

Lossless Compression Algorithm and Architecture for Reduced Memory Bandwidth Requirement with Improved Prediction Based on the Multiple DPCM Golomb-Rice Algorithm

Data-Oriented Language Implementation of the Lattice–Boltzmann Method for Dense and Sparse Geometries

Lossless Image/Video Embedded Compression for Memory Bandwidth Saving of AI Applications

A memory bandwidth improvement with memory space partitioning for single-precision floating-point FFT on Stratix 10 FPGA

Generalizing QoS-Aware Memory Bandwidth Allocation to Multi-Socket Cloud Servers

Case Study on Integrated Architecture for In-Memory and In-Storage Computing

Export Citation Format

memory bandwidthRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Holistic Resource Allocation Under Federated Scheduling for Parallel Real-time Tasks

Response time analysis of memory-bandwidth-regulated multiframe mixed-criticality systems

Integer-Only CNNs with 4 Bit Weights and Bit-Shift Quantization Scales at Full-Precision Accuracy

A GPU-Based Integrated Simulation Framework for Modelling of Complex Subsurface Applications

Lossless Compression Algorithm and Architecture for Reduced Memory Bandwidth Requirement with Improved Prediction Based on the Multiple DPCM Golomb-Rice Algorithm

Data-Oriented Language Implementation of the Lattice–Boltzmann Method for Dense and Sparse Geometries

Lossless Image/Video Embedded Compression for Memory Bandwidth Saving of AI Applications

A memory bandwidth improvement with memory space partitioning for single-precision floating-point FFT on Stratix 10 FPGA

Generalizing QoS-Aware Memory Bandwidth Allocation to Multi-Socket Cloud Servers

Case Study on Integrated Architecture for In-Memory and In-Storage Computing

memory bandwidth
Recently Published Documents