data parallelism Latest Research Papers

Graph structures are a natural representation of important and pervasive data. While graph applications have significant parallelism, their characteristic pointer indirect loads to neighbor data hinder scalability to large datasets on multicore systems. A scalable and efficient system must tolerate latency while leveraging data parallelism across millions of vertices. Modern Out-of-Order (OoO) cores inherently tolerate a fraction of long latencies, but become clogged when running severely memory-bound applications. Combined with large power/area footprints, this limits their parallel scaling potential and, consequently, the gains that existing software frameworks can achieve. Conversely, accelerator and memory hierarchy designs provide performant hardware specializations, but cannot support diverse application demands. To address these shortcomings, we present GraphAttack, a hardware-software data supply approach that accelerates graph applications on in-order multicore architectures. GraphAttack proposes compiler passes to (1) identify idiomatic long-latency loads and (2) slice programs along these loads into data Producer/ Consumer threads to map onto pairs of parallel cores. Each pair shares a communication queue; the Producer asynchronously issues long-latency loads, whose results are buffered in the queue and used by the Consumer. This scheme drastically increases memory-level parallelism (MLP) to mitigate latency bottlenecks. In equal-area comparisons, GraphAttack outperforms OoO cores, do-all parallelism, prefetching, and prior decoupling approaches, achieving a 2.87× speedup and 8.61× gain in energy efficiency across a range of graph applications. These improvements scale; GraphAttack achieves a 3× speedup over 64 parallel cores. Lastly, it has pragmatic design principles; it enhances in-order architectures that are gaining increasing open-source support.

Download Full-text

Pointer-Based Divergence Analysis for OpenCL 2.0 Programs

ACM Transactions on Parallel Computing ◽

10.1145/3470644 ◽

2021 ◽

Vol 8 (4) ◽

pp. 1-23

Author(s):

Shao-Chung Wang ◽

Lin-Ya Yu ◽

Li-An Her ◽

Yuan-Shin Hwang ◽

Jenq-Kuen Lee

Keyword(s):

Special Functions ◽

Previous Analysis ◽

Data Parallelism ◽

Fixed Size ◽

Static Single Assignment ◽

Analysis Scheme ◽

And Performance ◽

Relation Graph ◽

High Degree ◽

Gpu Architecture

A modern GPU is designed with many large thread groups to achieve a high throughput and performance. Within these groups, the threads are grouped into fixed-size SIMD batches in which the same instruction is applied to vectors of data in a lockstep. This GPU architecture is suitable for applications with a high degree of data parallelism, but its performance degrades seriously when divergence occurs. Many optimizations for divergence have been proposed, and they vary with the divergence information about variables and branches. A previous analysis scheme viewed pointers and return values from functions as divergence directly, and only focused on OpenCL 1.x. In this article, we present a novel scheme that reports the divergence information for pointer-intensive OpenCL programs. The approach is based on extended static single assignment (SSA) and adds some special functions and annotations from memory SSA and gated SSA. The proposed scheme first constructs extended SSA, which is then used to build a divergence relation graph that includes all of the possible points-to relationships of the pointers and initialized divergence states. The divergence state of the pointers can be determined by propagating the divergence state of the divergence relation graph. The scheme is further extended for interprocedural cases by considering function-related statements. The proposed scheme was implemented in an LLVM compiler and can be applied to OpenCL programs. We analyzed 10 programs with 24 kernels, with a total analyzed program size of 1,306 instructions in an LLVM intermediate representation, with 885 variables, 108 branches, and 313 pointer-related statements. The total number of divergent pointers detected was 146 for the proposed scheme, 200 for the scheme in which the pointer was always divergent, and 155 for the current LLVM default scheme; the total numbers of divergent variables detected were 458, 519, and 482, respectively, with 31, 34, and 32 divergent branches. These experimental results indicate that the proposed scheme is more precise than both a scheme in which a pointer is always divergent and the current LLVM default scheme.

Download Full-text

A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity Processors

Electronics ◽

10.3390/electronics10232960 ◽

2021 ◽

Vol 10 (23) ◽

pp. 2960

Author(s):

Youngbin Son ◽

Seokwon Kang ◽

Hongjun Um ◽

Seokho Lee ◽

Jonghyun Ham ◽

...

Keyword(s):

Code Generation ◽

Optimization Technique ◽

Data Parallelism ◽

Processing Unit ◽

Fast Computation ◽

Performance Improvements ◽

Central Processing ◽

Field Programmable ◽

Operation Unit ◽

Better Than

Most modern processors contain a vector accelerator or internal vector units for the fast computation of large target workloads. However, accelerating applications using vector units is difficult because the underlying data parallelism should be uncovered explicitly using vector-specific instructions. Therefore, vector units are often underutilized or remain idle because of the challenges faced in vector code generation. To solve this underutilization problem of existing vector units, we propose the Vector Offloader for executing scalar programs, which considers the vector unit as a scalar operation unit. By using vector masking, an appropriate partition of the vector unit can be utilized to support scalar instructions. To efficiently utilize all execution units, including the vector unit, the Vector Offloader suggests running the target applications concurrently in both the central processing unit (CPU) and the decoupled vector units, by offloading some parts of the program to the vector unit. Furthermore, a profile-guided optimization technique is employed to determine the optimal offloading ratio for balancing the load between the CPU and the vector unit. We implemented the Vector Offloader on a RISC-V infrastructure with a Hwacha vector unit, and evaluated its performance using a Polybench benchmark set. Experimental results showed that the proposed technique achieved performance improvements up to 1.31× better than the simple, CPU-only execution on a field programmable gate array (FPGA)-level evaluation.

Download Full-text

High-Level Stream and Data Parallelism in C++ for Multi-Cores

10.1145/3475061.3475078 ◽

2021 ◽

Author(s):

Junior Loff ◽

Renato B. Hoffman ◽

Dalvan Griebler ◽

Luiz G. Fernandes

Keyword(s):

Data Parallelism ◽

High Level

Download Full-text

Fiber Clustering Acceleration With a Modified Kmeans++ Algorithm Using Data Parallelism

Frontiers in Neuroinformatics ◽

10.3389/fninf.2021.727859 ◽

2021 ◽

Vol 15 ◽

Author(s):

Isaac Goicovich ◽

Paulo Olivares ◽

Claudio Román ◽

Andrea Vázquez ◽

Cyril Poupon ◽

...

Keyword(s):

White Matter ◽

Brain Research ◽

Data Parallelism ◽

Processing Unit ◽

Brain White Matter ◽

Fiber Clustering ◽

New Variant ◽

Clustering Quality ◽

Kmeans Algorithm ◽

The Impact

Fiber clustering methods are typically used in brain research to study the organization of white matter bundles from large diffusion MRI tractography datasets. These methods enable exploratory bundle inspection using visualization and other methods that require identifying brain white matter structures in individuals or a population. Some applications, such as real-time visualization and inter-subject clustering, need fast and high-quality intra-subject clustering algorithms. This work proposes a parallel algorithm using a General Purpose Graphics Processing Unit (GPGPU) for fiber clustering based on the FFClust algorithm. The proposed GPGPU implementation exploits data parallelism using both multicore and GPU fine-grained parallelism present in commodity architectures, including current laptops and desktop computers. Our approach implements all FFClust steps in parallel, improving execution times in all of them. In addition, our parallel approach includes a parallel Kmeans++ algorithm implementation and defines a new variant of Kmeans++ to reduce the impact of choosing outliers as initial centroids. The results show that our approach provides clustering quality results very similar to FFClust, and it requires an execution time of 3.5 s for processing about a million fibers, achieving a speedup of 11.5 times compared to FFClust.

Download Full-text

An order-aware dataflow model for parallel Unix pipelines

Proceedings of the ACM on Programming Languages ◽

10.1145/3473570 ◽

2021 ◽

Vol 5 (ICFP) ◽

pp. 1-28

Author(s):

Shivam Handa ◽

Konstantinos Kallas ◽

Nikos Vasilakis ◽

Martin C. Rinard

Keyword(s):

Data Parallelism ◽

Dataflow Graph ◽

Shell Script ◽

Dataflow Model

We present a dataflow model for modelling parallel Unix shell pipelines. To accurately capture the semantics of complex Unix pipelines, the dataflow model is order-aware, i.e., the order in which a node in the dataflow graph consumes inputs from different edges plays a central role in the semantics of the computation and therefore in the resulting parallelization. We use this model to capture the semantics of transformations that exploit data parallelism available in Unix shell computations and prove their correctness. We additionally formalize the translations from the Unix shell to the dataflow model and from the dataflow model back to a parallel shell script. We implement our model and transformations as the compiler and optimization passes of a system parallelizing shell pipelines, and use it to evaluate the speedup achieved on 47 pipelines.

Download Full-text

Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining ◽

10.1145/3447548.3467080 ◽

2021 ◽

Author(s):

Vipul Gupta ◽

Dhruv Choudhary ◽

Peter Tang ◽

Xiaohan Wei ◽

Xing Wang ◽

...

Keyword(s):

Recommender Systems ◽

Data Parallelism

Download Full-text

On Data Parallelism Code Restructuring for HLS Targeting FPGAs

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) ◽

10.1109/ipdpsw52791.2021.00029 ◽

2021 ◽

Author(s):

Renato Campos ◽

Joao M.P. Cardoso

Keyword(s):

Data Parallelism

Download Full-text

Code Generation from Simulink Models with Task and Data Parallelism

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v21i.9004 ◽

2021 ◽

Vol 21 ◽

pp. 1-13

Author(s):

Pin Xu ◽

Masato Edahiro ◽

Kondo Masaki

Keyword(s):

Hierarchical Clustering ◽

Code Generation ◽

Heterogeneous Computing ◽

Parallel Programs ◽

Data Parallelism ◽

Task Parallelism ◽

Clustering Method ◽

Computing Environment ◽

Sequential Programs ◽

Data Parallel

In this paper, we propose a method to automatically generate parallelized code from Simulink models, while exploiting both task and data parallelism. Building on previous research, we propose a model-based parallelizer (MBP) that exploits task parallelism and assigns tasks to CPU cores using a hierarchical clustering method. We also propose amethod in which data-parallel SYCL code is generated from Simulink models; computations with data parallelism are expressed in the form of S-Function Builder blocks and are executed in a heterogeneous computing environment. Most parts of the procedure can be automated with scripts, and the two methods can be applied together. In the evaluation, the data-parallel programs generated using our proposed method achieved a maximum speedup of approximately 547 times, compared to sequential programs, without observable differences in the computed results. In addition, the programs generated while exploiting both task and data parallelism were confirmed to have achieved better performance than those exploiting either one of the two.

Download Full-text

Location based Continuous Query Processing over Geo-streaming Data

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i1s.1583 ◽

2021 ◽

Vol 12 (1S) ◽

pp. 106-114

Author(s):

K. V. Metre

Keyword(s):

Query Processing ◽

Database Systems ◽

Streaming Data ◽

Continuous Query ◽

Data Parallelism ◽

Stream Data ◽

Process Stream ◽

Data Intensive ◽

Index Maintenance ◽

Continuous Query Processing

In recent years, many data-intensive and location based applications have emerged that need to process stream data in applications such as network monitoring, telecommunications data management, and sensor networks. Unlike regular queries, a continuous query exists for certain period of time and need to be continuously processed during this time. The algorithms used for data processing for the traditional database systems are not suited to tackle complex and various continuous queries over dynamic streaming data. The indexing for finite queries is preferred to indexing on infinite data to avoid expensive operations of index maintenance. Previous related work focused on moving queries on static objects or static queries on moving object. But now-a-days queries as well as objects are dynamic. So, hybrid indexing for queries significantly reduces the space costs and scales well with the increasing data. To deal with the speed of unbounded data, it is necessary to use data parallelism in query processing. The data parallelism in query processing offers better performance, availability and scalability.

Download Full-text

data parallelism
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

GraphAttack

Pointer-Based Divergence Analysis for OpenCL 2.0 Programs

A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity Processors

High-Level Stream and Data Parallelism in C++ for Multi-Cores

Fiber Clustering Acceleration With a Modified Kmeans++ Algorithm Using Data Parallelism

An order-aware dataflow model for parallel Unix pipelines

Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism

On Data Parallelism Code Restructuring for HLS Targeting FPGAs

Code Generation from Simulink Models with Task and Data Parallelism

Location based Continuous Query Processing over Geo-streaming Data

Export Citation Format

data parallelismRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

GraphAttack

Pointer-Based Divergence Analysis for OpenCL 2.0 Programs

A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity Processors

High-Level Stream and Data Parallelism in C++ for Multi-Cores

Fiber Clustering Acceleration With a Modified Kmeans++ Algorithm Using Data Parallelism

An order-aware dataflow model for parallel Unix pipelines

Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism

On Data Parallelism Code Restructuring for HLS Targeting FPGAs

Code Generation from Simulink Models with Task and Data Parallelism

Location based Continuous Query Processing over Geo-streaming Data

data parallelism
Recently Published Documents