MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling

Author(s):  
Akshay Venkatesh ◽  
Khaled Hamidouche ◽  
Sreeram Potluri ◽  
Davide Rosetti ◽  
Ching-Hsiang Chu ◽  
...  
2022 ◽  
Vol 15 (1) ◽  
pp. 1-32
Author(s):  
Lana Josipović ◽  
Shabnam Sheikhha ◽  
Andrea Guerrieri ◽  
Paolo Ienne ◽  
Jordi Cortadella

Commercial high-level synthesis tools typically produce statically scheduled circuits. Yet, effective C-to-circuit conversion of arbitrary software applications calls for dataflow circuits, as they can handle efficiently variable latencies (e.g., caches), unpredictable memory dependencies, and irregular control flow. Dataflow circuits exhibit an unconventional property: registers (usually referred to as “buffers”) can be placed anywhere in the circuit without changing its semantics, in strong contrast to what happens in traditional datapaths. Yet, although functionally irrelevant, this placement has a significant impact on the circuit’s timing and throughput. In this work, we show how to strategically place buffers into a dataflow circuit to optimize its performance. Our approach extracts a set of choice-free critical loops from arbitrary dataflow circuits and relies on the theory of marked graphs to optimize the buffer placement and sizing. Our performance optimization model supports important high-level synthesis features such as pipelined computational units, units with variable latency and throughput, and if-conversion. We demonstrate the performance benefits of our approach on a set of dataflow circuits obtained from imperative code.


Author(s):  
Gabor E. Gevay ◽  
Tilmann Rabl ◽  
Sebastian Bres ◽  
Lorand Madai-Tahy ◽  
Jorge-Arnulfo Quiane-Ruiz ◽  
...  

1994 ◽  
Vol 04 (03) ◽  
pp. 351-364 ◽  
Author(s):  
MAHER RAHMOUNI ◽  
KEVIN O’BRIEN ◽  
AHMED A. JERRAYA

This paper presents Dynamic Loop Scheduling (DLS), a loop-based algorithm that can efficiently schedule large, control-flow dominated designs. It compares favourably with results produced for traditional path-based approaches and at the same time requires much less overhead to implement. The high-performance of DLS is due mainly to the inclusion of loop feedback edges in the control-flow graph and the interruption of the path generation on the fly. The latter eliminates the generation of false paths thereby avoiding the path explosion problem.


Author(s):  
Lena Oden ◽  
Holger Fröning

Due to their massive parallelism and high performance per Watt, GPUs have gained high popularity in high-performance computing and are a strong candidate for future exascale systems. But communication and data transfer in GPU-accelerated systems remain a challenging problem. Since the GPU normally is not able to control a network device, a hybrid-programming model is preferred whereby the GPU is used for calculation and the CPU handles the communication. As a result, communication between distributed GPUs suffers from unnecessary overhead, introduced by switching control flow from GPUs to CPUs and vice versa. Furthermore, often a designated CPU thread is required to control GPU-related communication. In this work, we modify user space libraries and device drivers of GPUs and the InfiniBand network device in a way to enable the GPU to control an InfiniBand network device to independently source and sink communication requests without any involvement of the CPU. Our results show that complex networking protocols such as InfiniBand Verbs are better handled by CPUs, since overhead of work request generation cannot be parallelized and is not suitable for the highly parallel programming model of GPUs. The massive number of instructions and accesses to host memory that is required to source and sink a communication request on the GPU slows down the performance. Only through a massive reduction in the complexity of the InfiniBand protocol can some performance improvements be achieved.


2021 ◽  
Vol 14 (4) ◽  
pp. 1-32
Author(s):  
Sebastian Sabogal ◽  
Alan George ◽  
Gary Crum

Deep learning (DL) presents new opportunities for enabling spacecraft autonomy, onboard analysis, and intelligent applications for space missions. However, DL applications are computationally intensive and often infeasible to deploy on radiation-hardened (rad-hard) processors, which traditionally harness a fraction of the computational capability of their commercial-off-the-shelf counterparts. Commercial FPGAs and system-on-chips present numerous architectural advantages and provide the computation capabilities to enable onboard DL applications; however, these devices are highly susceptible to radiation-induced single-event effects (SEEs) that can degrade the dependability of DL applications. In this article, we propose Reconfigurable ConvNet (RECON), a reconfigurable acceleration framework for dependable, high-performance semantic segmentation for space applications. In RECON, we propose both selective and adaptive approaches to enable efficient SEE mitigation. In our selective approach, control-flow parts are selectively protected by triple-modular redundancy to minimize SEE-induced hangs, and in our adaptive approach, partial reconfiguration is used to adapt the mitigation of dataflow parts in response to a dynamic radiation environment. Combined, both approaches enable RECON to maximize system performability subject to mission availability constraints. We perform fault injection and neutron irradiation to observe the susceptibility of RECON and use dependability modeling to evaluate RECON in various orbital case studies to demonstrate a 1.5–3.0× performability improvement in both performance and energy efficiency compared to static approaches.


Author(s):  
Shikha Mehta ◽  
Parmeet Kaur

Workflows are a commonly used model to describe applications consisting of computational tasks with data or control flow dependencies. They are used in domains of bioinformatics, astronomy, physics, etc., for data-driven scientific applications. Execution of data-intensive workflow applications in a reasonable amount of time demands a high-performance computing environment. Cloud computing is a way of purchasing computing resources on demand through virtualization technologies. It provides the infrastructure to build and run workflow applications, which is called ‘Infrastructure as a Service.' However, it is necessary to schedule workflows on cloud in a way that reduces the cost of leasing resources. Scheduling tasks on resources is a NP hard problem and using meta-heuristic algorithms is an obvious choice for the same. This chapter presents application of nature-inspired algorithms: particle swarm optimization, shuffled frog leaping algorithm and grey wolf optimization algorithm to the workflow scheduling problem on the cloud. Simulation results prove the efficacy of the suggested algorithms.


Electronics ◽  
2021 ◽  
Vol 10 (4) ◽  
pp. 406
Author(s):  
Daniel Granhão ◽  
João Canas Ferreira

Heterogeneous platforms with FPGAs have started to be employed in the High-Performance Computing (HPC) field to improve performance and overall efficiency. These platforms allow the use of specialized hardware to accelerate software applications, but require the software to be adapted in what can be a prolonged and complex process. The main goal of this work is to describe and evaluate mechanisms that can transparently transfer the control flow between CPU and FPGA within the scope of HPC. Combining such a mechanism with transparent software profiling and accelerator configuration could lead to an automatic way of accelerating regular applications. In this work, a mechanism based on the ptrace system call is proposed, and its performance on the Intel Xeon+FPGA platform is evaluated. The feasibility of the proposed approach is demonstrated by a working prototype that performs the transparent control flow transfer of any function call to a matching hardware accelerator. This approach is more general than shared library interposition at the cost of a small time overhead in each accelerator use (about 1.3 ms in the prototype implementation).


2020 ◽  
Vol 38 (3-4) ◽  
pp. 1-30
Author(s):  
Rakesh Kumar ◽  
Boris Grot

The front-end bottleneck is a well-established problem in server workloads owing to their deep software stacks and large instruction footprints. Despite years of research into effective L1-I and BTB prefetching, state-of-the-art techniques force a trade-off between metadata storage cost and performance. Temporal Stream prefetchers deliver high performance but require a prohibitive amount of metadata to accommodate the temporal history. Meanwhile, BTB-directed prefetchers incur low cost by using the existing in-core branch prediction structures but fall short on performance due to BTB’s inability to capture the massive control flow working set of server applications. This work overcomes the fundamental limitation of BTB-directed prefetchers, which is capturing a large control flow working set within an affordable BTB storage budget. We re-envision the BTB organization to maximize its control flow coverage by observing that an application’s instruction footprint can be mapped as a combination of its unconditional branch working set and, for each unconditional branch, a spatial encoding of the cache blocks around the branch target. Effectively capturing a map of the application’s instruction footprint in the BTB enables highly effective BTB-directed prefetching that outperforms the state-of-the-art prefetchers by up to 10% for equivalent storage budget.


1997 ◽  
Vol 6 (1) ◽  
pp. 73-94 ◽  
Author(s):  
Eduard AyguadÉ ◽  
Jordi Garcia ◽  
MercÉ GironÈs ◽  
M. Luz Grande ◽  
JesÚs Labarta

This article describes the main features and implementation of our automatic data distribution research tool. The tool (DDT) accepts programs written in Fortran 77 and generates High Performance Fortran (HPF) directives to map arrays onto the memories of the processors and parallelize loops, and executable statements to remap these arrays. DDT works by identifying a set of computational phases (procedures and loops). The algorithm builds a search space of candidate solutions for these phases which is explored looking for the combination that minimizes the overall cost; this cost includes data movement cost and computation cost. The movement cost reflects the cost of accessing remote data during the execution of a phase and the remapping costs that have to be paid in order to execute the phase with the selected mapping. The computation cost includes the cost of executing a phase in parallel according to the selected mapping and the owner computes rule. The tool supports interprocedural analysis and uses control flow information to identify how phases are sequenced during the execution of the application.


2021 ◽  
Vol 2021 ◽  
pp. 1-19
Author(s):  
Yan Wang ◽  
Peng Jia ◽  
Cheng Huang ◽  
Jiayong Liu ◽  
Peisong He

Binary code similarity comparison is the technique that determines if two functions are similar by only considering their compiled form, which has many applications, including clone detection, malware classification, and vulnerability discovery. However, it is challenging to design a robust code similarity comparison engine since different compilation settings that make logically similar assembly functions appear to be very different. Moreover, existing approaches suffer from high-performance overheads, lower robustness, or poor scalability. In this paper, a novel solution HBinSim is proposed by employing the multiview features of the function to address these challenges. It first extracts the syntactic and semantic features of each basic block by static analysis. HBinSim further analyzes the function and constructs a syntactic attribute control flow graph and a semantic attribute control flow graph for each function. Then, a hierarchical attention graph embedding network is designed for graph-structured data processing. The network model has a hierarchical structure that mirrors the hierarchical structure of the function. It has three levels of attention mechanisms applied at the instruction, basic block, and function level, enabling it to attend differentially to more and less critical content when constructing the function representation. We conduct extensive experiments to evaluate its effectiveness and efficiency. The results show that our tool outperforms the state-of-the-art binary code similarity comparison tools by a large margin against compilation diversity clone searching. A real-world vulnerabilities search case further demonstrates the usefulness of our system.


Sign in / Sign up

Export Citation Format

Share Document