Gretch

Data-dependent memory accesses (DDAs) pose an important challenge for high-performance graph analytics (GA). This is because such memory accesses do not exhibit enough temporal and spatial locality resulting in low cache performance. Prior efforts that focused on improving the performance of DDAs for GA are not applicable across various GA frameworks. This is because (1) they only focus on one particular graph representation, and (2) they require workload changes to communicate specific information to the hardware for their effective operation. In this work, we propose a hardware-only solution to improving the performance of DDAs for GA across multiple GA frameworks. We present a hardware prefetcher for GA called Gretch, that addresses the above limitations. An important observation we make is that identifying certain DDAs without hardware-software communication is sensitive to the instruction scheduling. A key contribution of this work is a hardware mechanism that activates Gretch to identify DDAs when using either in-order or out-of-order instruction scheduling. Our evaluation shows that Gretch provides an average speedup of 38% over no prefetching, 25% over conventional stride prefetcher, and outperforms prior DDAs prefetchers by 22% with only 1% increase in power consumption when executed on different GA workloads and frameworks.

Download Full-text

Automatic Sublining for Efficient Sparse Memory Accesses

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3452141 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1-23

Author(s):

Wim Heirman ◽

Stijn Eyerman ◽

Kristof Du Bois ◽

Ibrahim Hur

Keyword(s):

Dynamic Environment ◽

Large Data ◽

Main Memory ◽

Single Element ◽

Graph Analytics ◽

Available Bandwidth ◽

Processor Architectures ◽

Spatial Locality ◽

Potential Impact ◽

Memory Accesses

Sparse memory accesses, which are scattered accesses to single elements of a large data structure, are a challenge for current processor architectures. Their lack of spatial and temporal locality and their irregularity makes caches and traditional stream prefetchers useless. Furthermore, performing standard caching and prefetching on sparse accesses wastes precious memory bandwidth and thrashes caches, deteriorating performance for regular accesses. Bypassing prefetchers and caches for sparse accesses, and fetching only a single element (e.g., 8 B) from main memory (subline access), can solve these issues. Deciding which accesses to handle as sparse accesses and which as regular cached accesses, is a challenging task, with a large potential impact on performance. Not only is performance reduced by treating sparse accesses as regular accesses, not caching accesses that do have locality also negatively impacts performance by significantly increasing their latency and bandwidth consumption. Furthermore, this decision depends on the dynamic environment, such as input set characteristics and system load, making a static decision by the programmer or compiler suboptimal. We propose the Instruction Spatial Locality Estimator ( ISLE ), a hardware detector that finds instructions that access isolated words in a sea of unused data. These sparse accesses are dynamically converted into uncached subline accesses, while keeping regular accesses cached. ISLE does not require modifying source code or binaries, and adapts automatically to a changing environment (input data, available bandwidth, etc.). We apply ISLE to a graph analytics processor running sparse graph workloads, and show that ISLE outperforms the performance of no subline accesses, manual sublining, and prior work on detecting sparse accesses.

Download Full-text

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

ACM Transactions on Mathematical Software ◽

10.1145/3441850 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-28

Author(s):

Goran Flegar ◽

Hartwig Anzt ◽

Terry Cojean ◽

Enrique S. Quintana-Ortí

Keyword(s):

Linear Algebra ◽

Graphics Processing Units ◽

High Performance ◽

Numerical Algorithms ◽

Mixed Precision ◽

Before And After ◽

Memory Accesses ◽

Specialized Hardware ◽

The Individual ◽

Graphics Processing

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.

Download Full-text

Request, Coalesce, Serve, and Forget: Miss-Optimized Memory Systems for Bandwidth-Bound Cache-Unfriendly Applications on FPGAs

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3466823 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-33

Author(s):

Mikhail Asiatici ◽

Paolo Ienne

Keyword(s):

Large Scale ◽

Sparse Matrix ◽

Memory Systems ◽

Graph Analytics ◽

Matrix Vector Multiplication ◽

Area Reduction ◽

Cache Line ◽

Speed Up ◽

Memory Accesses ◽

On Chip

Applications such as large-scale sparse linear algebra and graph analytics are challenging to accelerate on FPGAs due to the short irregular memory accesses, resulting in low cache hit rates. Nonblocking caches reduce the bandwidth required by misses by requesting each cache line only once, even when there are multiple misses corresponding to it. However, such reuse mechanism is traditionally implemented using an associative lookup. This limits the number of misses that are considered for reuse to a few tens, at most. In this article, we present an efficient pipeline that can process and store thousands of outstanding misses in cuckoo hash tables in on-chip SRAM with minimal stalls. This brings the same bandwidth advantage as a larger cache for a fraction of the area budget, because outstanding misses do not need a data array, which can significantly speed up irregular memory-bound latency-insensitive applications. In addition, we extend nonblocking caches to generate variable-length bursts to memory, which increases the bandwidth delivered by DRAMs and their controllers. The resulting miss-optimized memory system provides up to 25% speedup with 24× area reduction on 15 large sparse matrix-vector multiplication benchmarks evaluated on an embedded and a datacenter FPGA system.

Download Full-text

HPGA: A High-Performance Graph Analytics Framework on the GPU

2018 International Conference on Information Systems and Computer Aided Education (ICISCAE) ◽

10.1109/iciscae.2018.8666877 ◽

2018 ◽

Author(s):

Haoduo Yang ◽

Huayou Su ◽

Mei Wen ◽

Chunyuan Zhang

Keyword(s):

High Performance ◽

Graph Analytics

Download Full-text

From Ephemeral Events to Multiple Legacies: An International Comparison of Festival Demarcations and Management Approaches

Event Management ◽

10.3727/152599519x15506259856192 ◽

2020 ◽

Vol 24 (5) ◽

pp. 579-596 ◽

Cited By ~ 1

Author(s):

Jasper Eshuis ◽

Bonno Pel ◽

J. Andres Coca-Stefaniak

Keyword(s):

Process Management ◽

Local Communities ◽

Socially Constructed ◽

Cultural Festivals ◽

Important Challenge ◽

Temporal And Spatial ◽

Management Approaches ◽

Constructivist Approaches ◽

Interpretive Flexibility ◽

Different Parts

Festivals have come to play an important role in tourism and managing their legacy has become an important challenge for governments and the events industry. Festivals typically take place over limited periods of time, but they also bring longer lasting legacies for the economy, local communities, and the environment. Festival legacies are characterized by interpretive flexibility; they are interpreted differently by various actors. This complicates attempts to adapt the management of festivals in such a way that aspired legacies are realized and unwanted (negative) legacies minimized. This article elicits the recursive relationship between the ways in which event legacies are socially constructed, and how events are managed. Building on constructivist approaches to governance and management and drawing on the empirical variety of six cultural festivals in different parts of Europe, this contribution shows how event legacy can be unpacked along actors' diverse cognitive, social, temporal, and spatial demarcations, and how these understandings relate to particular repertoires of management and governance. Highlighting how event legacies are pursued through combinations of control-oriented project management and more broadly scoped process management approaches, the study concludes with strategic reflections on the possibilities for elevating ephemeral events into vehicles for social change.

Download Full-text

A Hybrid Scheme Based on Pipelining and Multitasking in Mobile Application Processors for Advanced Video Coding

Scientific Programming ◽

10.1155/2015/197843 ◽

2015 ◽

Vol 2015 ◽

pp. 1-16

Author(s):

Muhammad Asif ◽

Imtiaz A. Taj ◽

S. M. Ziauddin ◽

Maaz Bin Ahmad ◽

M. Tahir

Keyword(s):

Video Processing ◽

Mobile Application ◽

High Performance ◽

Optimization Techniques ◽

Processing Unit ◽

Software Modules ◽

Computationally Intensive ◽

Hardware Processing ◽

Memory Accesses ◽

Advanced Video Coding

One of the key requirements for mobile devices is to provide high-performance computing at lower power consumption. The processors used in these devices provide specific hardware resources to handle computationally intensive video processing and interactive graphical applications. Moreover, processors designed for low-power applications may introduce limitations on the availability and usage of resources, which present additional challenges to the system designers. Owing to the specific design of the JZ47x series of mobile application processors, a hybrid software-hardware implementation scheme for H.264/AVC encoder is proposed in this work. The proposed scheme distributes the encoding tasks among hardware and software modules. A series of optimization techniques are developed to speed up the memory access and data transferring among memories. Moreover, an efficient data reusage design is proposed for the deblock filter video processing unit to reduce the memory accesses. Furthermore, fine grained macroblock (MB) level parallelism is effectively exploited and a pipelined approach is proposed for efficient utilization of hardware processing cores. Finally, based on parallelism in the proposed design, encoding tasks are distributed between two processing cores. Experiments show that the hybrid encoder is 12 times faster than a highly optimized sequential encoder due to proposed techniques.

Download Full-text

ELASTOMERIC NANOPARTICLES: EFFECTIVE ADDITIVE FOR HIGH PERFORMANCE RUBBER NANOCOMPOSITES

Rubber Chemistry and Technology ◽

10.5254/rct.20.80366 ◽

2020 ◽

pp. 000-000 ◽

Cited By ~ 1

Author(s):

Xiang Wang ◽

Jinliang Qiao ◽

Zhifeng Zhou ◽

Jianming Gao ◽

Guicun Qi ◽

...

Keyword(s):

Automobile Industry ◽

High Performance ◽

Rolling Resistance ◽

Skid Resistance ◽

Rubber Composites ◽

Wear Life ◽

Important Challenge ◽

The Relationship ◽

Wet Skid Resistance

ABSTRACT The “magic triangle” is the most important challenge to rubber composites for the automobile industry. According to the magic triangle, it is difficult to improve the rolling resistance (energy saving), wet skid resistance (safety), and wear (life) of a tire simultaneously. However, ∼5% decrease of rolling resistance, >20% increase of wet skid resistance, and 15% decrease of wear were achieved after adding a small amount of elastomeric nanoparticle (ENP). The effect of ENP on the performances of rubber composites was expounded by characterization of the dispersion of filler and the relationship between filler and rubber. The main difference between ENPs and other nanoparticles was that ENPs acted as not only a part of filler but also as a part of rubber in rubber composites.

Download Full-text

A Transparent Runtime Data Distribution Engine for OpenMP

Scientific Programming ◽

10.1155/2000/417570 ◽

2000 ◽

Vol 8 (3) ◽

pp. 143-162 ◽

Cited By ~ 4

Author(s):

Dimitrios S. Nikolopoulos ◽

Theodore S. Papatheodorou ◽

Constantine D. Polychronopoulos ◽

Jesús Labarta ◽

Eduard Ayguadé

Keyword(s):

High Performance ◽

Programming Model ◽

Data Distribution ◽

Data Locality ◽

Remote Memory ◽

Main Body ◽

Performance Loss ◽

Page Migration ◽

Runtime Environment ◽

Memory Accesses

This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of contemporary NUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution, incur modest performance losses. Second, the paper presents a transparent, user-level page migration engine with an ability to gain back any performance loss that stems from suboptimal placement of pages in iterative OpenMP programs. The main body of the paper describes how our OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results verify the effectiveness of the proposed framework and provide a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or the portability of the programming model.

Download Full-text

High-Performance and Energy-Efficient Network-on-Chip Architectures for Graph Analytics

ACM Transactions on Embedded Computing Systems ◽

10.1145/2961027 ◽

2016 ◽

Vol 15 (4) ◽

pp. 1-26 ◽

Cited By ~ 11

Author(s):

Karthi Duraisamy ◽

Hao Lu ◽

Partha Pratim Pande ◽

Ananth Kalyanaraman

Keyword(s):

Energy Efficient ◽

High Performance ◽

Network On Chip ◽

Graph Analytics ◽

On Chip

Download Full-text

Nonconventional Materials (NOCMAT) for Ecological and Sustainable Development

MRS Advances ◽

10.1557/adv.2016.613 ◽

2016 ◽

Vol 1 (53) ◽

pp. 3553-3564 ◽

Cited By ~ 1

Author(s):

Khosrow Ghavami ◽

Arash Azadeh

Keyword(s):

High Performance ◽

Human Life ◽

Natural Fibers ◽

Construction Materials ◽

Cost Effective ◽

Ongoing Research ◽

Vegetable Fibers ◽

The People ◽

Important Challenge

ABSTRACTFour decades of advanced research about Non-Conventional Materials and Technologies (NOCMAT) such as bamboo and composites reinforced with natural fibers have shown that it is now possible to produce and use high performance NOCMAT. Bamboo and composites reinforced with vegetable fibers are capable, meeting most engineering demand in terms of strength, stiffness, toughness and energy absorption capability. The greatest challenge of the 21st century is the need for cost-effective, durable and eco-friendly construction materials that will meet the global needs of infrastructure regeneration and rehabilitation which alone can enhance the quality of life for all the people of the world. This paper summarizes some results of judicious combination of different matrix reinforced with vegetable fibers, especially bamboo. These sustainable ecological materials are strong, ductile and capable of absorbing large amounts of energy. They could find extensive applications in the engineering particularly in developing countries. Specifically, the development of durable composites reinforced with vegetable fibers and bamboo poses an important challenge to the science and skills of engineering. This challenge could create the most useful, eco-friendly construction materials backed by an endless supply of renewable natural resources. In addition the paper presents results of some ongoing research concerning bamboo and how vegetable fibers such as hemp plant, before the invention of Nylon was the most used materials in all aspects of human life around the globe and why it was banned.

Download Full-text