memory bandwidth
Recently Published Documents


TOTAL DOCUMENTS

261
(FIVE YEARS 53)

H-INDEX

20
(FIVE YEARS 3)

2022 ◽  
Vol 21 (1) ◽  
pp. 1-29
Author(s):  
Lanshun Nie ◽  
Chenghao Fan ◽  
Shuang Lin ◽  
Li Zhang ◽  
Yajuan Li ◽  
...  

With the technology trend of hardware and workload consolidation for embedded systems and the rapid development of edge computing, there has been increasing interest in supporting parallel real-time tasks to better utilize the multi-core platforms while meeting the stringent real-time constraints. For parallel real-time tasks, the federated scheduling paradigm, which assigns each parallel task a set of dedicated cores, achieves good theoretical bounds by ensuring exclusive use of processing resources to reduce interferences. However, because cores share the last-level cache and memory bandwidth resources, in practice tasks may still interfere with each other despite executing on dedicated cores. Such resource interferences due to concurrent accesses can be even more severe for embedded platforms or edge servers, where the computing power and cache/memory space are limited. To tackle this issue, in this work, we present a holistic resource allocation framework for parallel real-time tasks under federated scheduling. Under our proposed framework, in addition to dedicated cores, each parallel task is also assigned with dedicated cache and memory bandwidth resources. Further, we propose a holistic resource allocation algorithm that well balances the allocation between different resources to achieve good schedulability. Additionally, we provide a full implementation of our framework by extending the federated scheduling system with Intel’s Cache Allocation Technology and MemGuard. Finally, we demonstrate the practicality of our proposed framework via extensive numerical evaluations and empirical experiments using real benchmark programs.


2021 ◽  
pp. 102346
Author(s):  
Ishfaq Hussain ◽  
Muhammad Ali Awan ◽  
Pedro F. Souto ◽  
Konstantinos Bletsas ◽  
Eduardo Tovar

Electronics ◽  
2021 ◽  
Vol 10 (22) ◽  
pp. 2823
Author(s):  
Maarten Vandersteegen ◽  
Kristof Van Beeck ◽  
Toon Goedemé

Quantization of neural networks has been one of the most popular techniques to compress models for embedded (IoT) hardware platforms with highly constrained latency, storage, memory-bandwidth, and energy specifications. Limiting the number of bits per weight and activation has been the main focus in the literature. To avoid major degradation of accuracy, common quantization methods introduce additional scale factors to adapt the quantized values to the diverse data ranges, present in full-precision (floating-point) neural networks. These scales are usually kept in high precision, requiring the target compute engine to support a few high-precision multiplications, which is not desirable due to the larger hardware cost. Little effort has yet been invested in trying to avoid high-precision multipliers altogether, especially in combination with 4 bit weights. This work proposes a new quantization scheme, based on power-of-two quantization scales, that works on-par compared to uniform per-channel quantization with full-precision 32 bit quantization scales when using only 4 bit weights. This is done through the addition of a low-precision lookup-table that translates stored 4 bit weights into nonuniformly distributed 8 bit weights for internal computation. All our quantized ImageNet CNNs achieved or even exceeded the Top-1 accuracy of their full-precision counterparts, with ResNet18 exceeding its full-precision model by 0.35%. Our MobileNetV2 model achieved state-of-the-art performance with only a slight drop in accuracy of 0.51%.


2021 ◽  
Author(s):  
Mark Khait ◽  
Denis Voskov

Abstract Alternative to CPU computing architectures, such as GPU, continue to evolve increasing the gap in peak memory bandwidth achievable on a conventional workstation or laptop. Such architectures are attractive for reservoir simulation, which performance is generally bounded by system memory bandwidth. However, to harvest the benefit of a new architecture, the source code has to be inevitably rewritten, sometimes almost completely. One of the biggest challenges here is to refactor the Jacobian assembly which typically involves large volumes of code and complex data processing. We demonstrate an effective and general way to simplify the linearization stage extracting complex physics-related computations from the main simulation loop and leaving only an algebraic multi-linear interpolation kernel instead. In this work, we provide the detailed description of simulation performance benefits from execution of the entire nonlinear loop on the GPU platform. We evaluate the computational performance of Delft Advanced Research Terra Simulator (DARTS) for various subsurface applications of practical interest on both CPU and GPU platforms, comparing particular workflow phases including Jacobian assembly and linear system solution with both stages of the Constraint Pressure Residual preconditioner.


Author(s):  
Imjae Hwang ◽  
Juwon Yun ◽  
Woonam Chung ◽  
Jaeshin Lee ◽  
Cheong-Ghil Kim ◽  
...  

In a computing environment, higher resolutions generally require more memory bandwidth, which inevitably leads to the consumption more power. This may become critical for the overall performance of mobile devices and graphic processor units with increased amounts of memory access and memory bandwidth. This paper proposes a lossless compression algorithm with a multiple differential pulse-code modulation variable sign code Golomb-Rice to reduce the memory bandwidth requirement. The efficiency of the proposed multiple differential pulse-code modulation is enhanced by selecting the optimal differential pulse code modulation mode. The experimental results show compression ratio of 1.99 for high-efficiency video coding image sequences and that the proposed lossless compression hardware can reduce the bus bandwidth requirement.


2021 ◽  
Vol 11 (20) ◽  
pp. 9495
Author(s):  
Tadeusz Tomczak

The performance of lattice–Boltzmann solver implementations usually depends mainly on memory access patterns. Achieving high performance requires then complex code which handles careful data placement and ordering of memory transactions. In this work, we analyse the performance of an implementation based on a new approach called the data-oriented language, which allows the combination of complex memory access patterns with simple source code. As a use case, we present and provide the source code of a solver for D2Q9 lattice and show its performance on GTX Titan Xp GPU for dense and sparse geometries up to 40962 nodes. The obtained results are promising, around 1000 lines of code allowed us to achieve performance in the range of 0.6 to 0.7 of maximum theoretical memory bandwidth (over 2.5 and 5.0 GLUPS for double and single precision, respectively) for meshes of sizes above 10242 nodes, which is close to the current state-of-the-art. However, we also observed relatively high and sometimes difficult to predict overheads, especially for sparse data structures. The additional issue was also a rather long compilation, which extended the time of short simulations, and a lack of access to low-level optimisation mechanisms.


Electronics ◽  
2021 ◽  
Vol 10 (15) ◽  
pp. 1750
Author(s):  
Manho Kim ◽  
Sung-Ho Kim ◽  
Hyuk-Jae Lee ◽  
Chae-Eun Rhee

Since the advent of computers, computing performance has been steadily increasing. Moreover, recent technologies are mostly based on massive data, and the development of artificial intelligence is accelerating it. Accordingly, various studies are being conducted to increase the performance and computing and data access, together reducing energy consumption. In-memory computing (IMC) and in-storage computing (ISC) are currently the most actively studied architectures to deal with the challenges of recent technologies. Since IMC performs operations in memory, there is a chance to overcome the memory bandwidth limit. ISC can reduce energy by using a low power processor inside storage without an expensive IO interface. To integrate the host CPU, IMC and ISC harmoniously, appropriate workload allocation that reflects the characteristics of the target application is required. In this paper, the energy and processing speed are evaluated according to the workload allocation and system conditions. The proof-of-concept prototyping system is implemented for the integrated architecture. The simulation results show that IMC improves the performance by 4.4 times and reduces total energy by 4.6 times over the baseline host CPU. ISC is confirmed to significantly contribute to energy reduction.


Sign in / Sign up

Export Citation Format

Share Document