GraphAttack

Graph structures are a natural representation of important and pervasive data. While graph applications have significant parallelism, their characteristic pointer indirect loads to neighbor data hinder scalability to large datasets on multicore systems. A scalable and efficient system must tolerate latency while leveraging data parallelism across millions of vertices. Modern Out-of-Order (OoO) cores inherently tolerate a fraction of long latencies, but become clogged when running severely memory-bound applications. Combined with large power/area footprints, this limits their parallel scaling potential and, consequently, the gains that existing software frameworks can achieve. Conversely, accelerator and memory hierarchy designs provide performant hardware specializations, but cannot support diverse application demands. To address these shortcomings, we present GraphAttack, a hardware-software data supply approach that accelerates graph applications on in-order multicore architectures. GraphAttack proposes compiler passes to (1) identify idiomatic long-latency loads and (2) slice programs along these loads into data Producer/ Consumer threads to map onto pairs of parallel cores. Each pair shares a communication queue; the Producer asynchronously issues long-latency loads, whose results are buffered in the queue and used by the Consumer. This scheme drastically increases memory-level parallelism (MLP) to mitigate latency bottlenecks. In equal-area comparisons, GraphAttack outperforms OoO cores, do-all parallelism, prefetching, and prior decoupling approaches, achieving a 2.87× speedup and 8.61× gain in energy efficiency across a range of graph applications. These improvements scale; GraphAttack achieves a 3× speedup over 64 parallel cores. Lastly, it has pragmatic design principles; it enhances in-order architectures that are gaining increasing open-source support.

Download Full-text

The Parallel Tiled WZ Factorization Algorithm for Multicore Architectures

International Journal of Applied Mathematics and Computer Science ◽

10.2478/amcs-2019-0030 ◽

2019 ◽

Vol 29 (2) ◽

pp. 407-419

Author(s):

Beata Bylina ◽

Jarosław Bylina

Keyword(s):

Shared Memory ◽

Linear Algebra ◽

Multicore Architectures ◽

Numerical Accuracy ◽

Factorization Algorithm ◽

Computational Performance ◽

Parallel Implementations ◽

Diagonally Dominant Matrices ◽

Diagonally Dominant ◽

Level Parallelism

Abstract The aim of this paper is to investigate dense linear algebra algorithms on shared memory multicore architectures. The design and implementation of a parallel tiled WZ factorization algorithm which can fully exploit such architectures are presented. Three parallel implementations of the algorithm are studied. The first one relies only on exploiting multithreaded BLAS (basic linear algebra subprograms) operations. The second implementation, except for BLAS operations, employs the OpenMP standard to use the loop-level parallelism. The third implementation, except for BLAS operations, employs the OpenMP task directive with the depend clause. We report the computational performance and the speedup of the parallel tiled WZ factorization algorithm on shared memory multicore architectures for dense square diagonally dominant matrices. Then we compare our parallel implementations with the respective LU factorization from a vendor implemented LAPACK library. We also analyze the numerical accuracy. Two of our implementations can be achieved with near maximal theoretical speedup implied by Amdahl’s law.

Download Full-text

Hadoop Framework for Handling Big Data Needs

Advances in Data Mining and Database Management - Handbook of Research on Big Data Storage and Visualization Techniques ◽

10.4018/978-1-5225-3142-5.ch004 ◽

2018 ◽

pp. 101-122

Author(s):

Rupali Ahuja

Keyword(s):

Big Data ◽

Data Analysis ◽

Big Data Analysis ◽

Large Datasets ◽

Software Frameworks ◽

Distributed Framework ◽

Visualization Tools ◽

Processing And Storage ◽

And Storage ◽

Hadoop Framework

The data generated today has outgrown the storage as well as computing capabilities of traditional software frameworks. Large volumes of data if aggregated and analyzed properly may provide useful insights to predict human behavior, to increase revenues, get or retain customers, improve operations, combat crime, cure diseases, etc. In conclusion, the results of effective Big Data analysis can be used to provide actionable intelligence for humans, as well as for machine consumption. New tools, techniques, technologies and methods are being developed to store, retrieve, manage, aggregate, correlate and analyze Big Data. Hadoop is a popular software framework for handling Big Data needs. Hadoop provides a distributed framework for processing and storage of large datasets. This chapter discusses in detail the Hadoop framework, its features, applications and popular distributions, and its Storage and Visualization tools.

Download Full-text

Distributed Machine Learning Using Data Parallelism on Mobile Platform

Journal of Mobile Multimedia ◽

10.13052/jmm1550-4646.1633 ◽

2020 ◽

Author(s):

Máté Szabó

Keyword(s):

Machine Learning ◽

Mobile Devices ◽

Large Datasets ◽

Software Implementation ◽

Data Parallelism ◽

Client Server ◽

Distributed Training ◽

Using Data ◽

Server Architecture ◽

Distributed Machine Learning

Machine learning has many challenges, and one of them is to deal with large datasets, because the size of them grows continuously year by year. One solution to this problem is data parallelism. This paper investigates the expansion of data parallelism to mobile, which became the most popular platform. Special client-server architecture was created for this purpose. The software implementation of this problem measures the mobile devices training capabilities and the efficiency of the whole system. The results show that doing distributed training on mobile cluster is possible and safe, but its performance depends on the algorithm’s implementation.

Download Full-text

EFFECTIVENESS OF COMPILER-DIRECTED PREFETCHING ON DATA MINING BENCHMARKS

Journal of Circuits System and Computers ◽

10.1142/s0218126612400063 ◽

2012 ◽

Vol 21 (02) ◽

pp. 1240006 ◽

Cited By ~ 1

Author(s):

RAGAVENDRA NATARAJAN ◽

VINEETH MEKKAT ◽

WEI-CHUNG HSU ◽

ANTONIA ZHAI

Keyword(s):

Data Mining ◽

Dynamic Performance ◽

Control Flow ◽

Memory Access ◽

Performance Tuning ◽

Multicore Systems ◽

Long Latency ◽

Memory Accesses ◽

Access Patterns ◽

Level Parallelism

For today's increasingly power-constrained multicore systems, integrating simpler and more energy-efficient in-order cores becomes attractive. However, since in-order processors lack complex hardware support for tolerating long-latency memory accesses, developing compiler technologies to hide such latencies becomes critical. Compiler-directed prefetching has been demonstrated effective on some applications. On the application side, a large class of data centric applications has emerged to explore the underlying properties of the explosively growing data. These applications, in contrast to traditional benchmarks, are characterized by substantial thread-level parallelism, complex and unpredictable control flow, as well as intensive and irregular memory access patterns. These applications are expected to be the dominating workloads on future microprocessors. Thus, in this paper, we investigated the effectiveness of compiler-directed prefetching on data mining applications in in-order multicore systems. Our study reveals that although properly inserted prefetch instructions can often effectively reduce memory access latencies for data mining applications, the compiler is not always able to exploit this potential. Compiler-directed prefetching can become inefficient in the presence of complex control flow and memory access patterns; and architecture dependent behaviors. The integration of multithreaded execution onto a single die makes it even more difficult for the compiler to insert prefetch instructions, since optimizations that are effective for single-threaded execution may or may not be effective in multithreaded execution. Thus, compiler-directed prefetching must be judiciously deployed to avoid creating performance bottlenecks that otherwise do not exist. Our experiences suggest that dynamic performance tuning techniques that adjust to the behaviors of a program can potentially facilitate the deployment of aggressive optimizations in data mining applications.

Download Full-text

Exploiting thread-level and instruction-level parallelism to cluster mass spectrometry data using multicore architectures

Network Modeling Analysis in Health Informatics and Bioinformatics ◽

10.1007/s13721-014-0054-1 ◽

2014 ◽

Vol 3 (1) ◽

Author(s):

Fahad Saeed ◽

Jason D. Hoffert ◽

Trairak Pisitkun ◽

Mark A. Knepper

Keyword(s):

Mass Spectrometry ◽

Mass Spectrometry Data ◽

Instruction Level Parallelism ◽

Multicore Architectures ◽

Cluster Mass ◽

Level Parallelism

Download Full-text

Scheduling Two-Sided Transformations Using Tile Algorithms on Multicore Architectures

Scientific Programming ◽

10.1155/2010/574728 ◽

2010 ◽

Vol 18 (1) ◽

pp. 35-50 ◽

Cited By ~ 4

Author(s):

Hatem Ltaief ◽

Jakub Kurzak ◽

Jack Dongarra ◽

Rosa M. Badia

Keyword(s):

Linear Algebra ◽

High Performance ◽

Eigenvalue Problems ◽

Multicore Processors ◽

Multicore Architectures ◽

Band Matrices ◽

Fine Grain ◽

Dataflow Model ◽

Singular Value Decompositions ◽

Level Parallelism

The objective of this paper is to describe, in the context of multicore architectures, three different scheduler implementations for the two-sided linear algebra transformations, in particular the Hessenberg and Bidiagonal reductions which are the first steps for the standard eigenvalue problems and the singular value decompositions respectively. State-of-the-art dense linear algebra softwares, such as the LAPACK and ScaLAPACK libraries, suffer performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time the fine-grain dataflow model gains popularity as a paradigm for programming multicore architectures. Buttari et al. (Parellel Comput. Syst. Appl. 35 (2009), 38–53) introduced the concept oftile algorithmsin which parallelism is no longer hidden inside Basic Linear Algebra Subprograms but is brought to the fore to yield much better performance. Along with efficient scheduling mechanisms for data-driven execution, these tile two-sided reductions achieve high performance computing by reaching up to 75% of the DGEMM peak on a 12000×12000 matrix with 16 Intel Tigerton 2.4 GHz processors. The main drawback of thetile algorithmsapproach for two-sided transformations is that the full reduction cannot be obtained in one stage. Other methods have to be considered to further reduce the band matrices to the required forms.

Download Full-text