Iteratively solving sparse linear system based on PaRSEC task scheduling

Author(s):  
Tieqiang Mo ◽  
Renfa Li

With the new architecture and new programming paradigms such as task-based scheduling emerging in the parallel high performance computing area, it is of great importance to utilize these features to tune the monolithic computing codes. In this article, the classical conjugate gradient algorithms targeting at sparse linear system Ax = b in Krylov subspace are pipelining to execute interdependent tasks on Parallel Runtime Scheduling and Execution Controller (PaRSEC) runtime. Firstly, the sparse matrix A is split in rows to unfold more coarse-grained parallelism. Secondly, the partitioned sub-vectors are not assembled into one full vector in RAM to run sparse matrix–vector product (SpMV) operations for eliminating the communication overhead. Moreover, in the SpMV computation, if all elements of one column in the split sub-matrix are zeros, the corresponding product operations of these elements may be removed by reorganizing sub-vectors. Finally, the latency of migrating sub-vector is partially overlapped by the duration of performing SpMV operations through the further splitting in columns of sparse matrix on GPUs. In experiments, a series of tests demonstrate that optimal speedup and higher pipelining efficiency has been achieved for the pipelined task scheduling on PaRSEC runtime. Fusing SpMV concurrency and dot product pipelining can achieve higher speedup and efficiency.

Author(s):  
Aydın Buluç ◽  
John R Gilbert

This paper presents a scalable high-performance software library to be used for graph analysis and data mining. Large combinatorial graphs appear in many applications of high-performance computing, including computational biology, informatics, analytics, web search, dynamical systems, and sparse matrix methods. Graph computations are difficult to parallelize using traditional approaches due to their irregular nature and low operational intensity. Many graph computations, however, contain sufficient coarse-grained parallelism for thousands of processors, which can be uncovered by using the right primitives. We describe the parallel Combinatorial BLAS, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications. We provide an extensible library interface and some guiding principles for future development. The library is evaluated using two important graph algorithms, in terms of both performance and ease-of-use. The scalability and raw performance of the example applications, using the Combinatorial BLAS, are unprecedented on distributed memory clusters.


Author(s):  
Shanshan Yu ◽  
Jicheng Zhang ◽  
Ju Liu ◽  
Xiaoqing Zhang ◽  
Yafeng Li ◽  
...  

AbstractIn order to solve the problem of distributed denial of service (DDoS) attack detection in software-defined network, we proposed a cooperative DDoS attack detection scheme based on entropy and ensemble learning. This method sets up a coarse-grained preliminary detection module based on entropy in the edge switch to monitor the network status in real time and report to the controller if any abnormality is found. Simultaneously, a fine-grained precise attack detection module is designed in the controller, and a ensemble learning-based algorithm is utilized to further identify abnormal traffic accurately. In this framework, the idle computing capability of edge switches is fully utilized with the design idea of edge computing to offload part of the detection task from the control plane to the data plane innovatively. Simulation results of two common DDoS attack methods, ICMP and SYN, show that the system can effectively detect DDoS attacks and greatly reduce the southbound communication overhead and the burden of the controller as well as the detection delay of the attacks.


2021 ◽  
Vol 20 (5s) ◽  
pp. 1-25
Author(s):  
Michael Canesche ◽  
Westerley Carvalho ◽  
Lucas Reis ◽  
Matheus Oliveira ◽  
Salles Magalhães ◽  
...  

Coarse-grained reconfigurable architecture (CGRA) mapping involves three main steps: placement, routing, and timing. The mapping is an NP-complete problem, and a common strategy is to decouple this process into its independent steps. This work focuses on the placement step, and its aim is to propose a technique that is both reasonably fast and leads to high-performance solutions. Furthermore, a near-optimal placement simplifies the following routing and timing steps. Exact solutions cannot find placements in a reasonable execution time as input designs increase in size. Heuristic solutions include meta-heuristics, such as Simulated Annealing (SA) and fast and straightforward greedy heuristics based on graph traversal. However, as these approaches are probabilistic and have a large design space, it is not easy to provide both run-time efficiency and good solution quality. We propose a graph traversal heuristic that provides the best of both: high-quality placements similar to SA and the execution time of graph traversal approaches. Our placement introduces novel ideas based on “you only traverse twice” (YOTT) approach that performs a two-step graph traversal. The first traversal generates annotated data to guide the second step, which greedily performs the placement, node per node, aided by the annotated data and target architecture constraints. We introduce three new concepts to implement this technique: I/O and reconvergence annotation, degree matching, and look-ahead placement. Our analysis of this approach explores the placement execution time/quality trade-offs. We point out insights on how to analyze graph properties during dataflow mapping. Our results show that YOTT is 60.6 , 9.7 , and 2.3 faster than a high-quality SA, bounding box SA VPR, and multi-single traversal placements, respectively. Furthermore, YOTT reduces the average wire length and the maximal FIFO size (additional timing requirement on CGRAs) to avoid delay mismatches in fully pipelined architectures.


Sign in / Sign up

Export Citation Format

Share Document