scholarly journals A Theoretical Model for Global Optimization of Parallel Algorithms

Mathematics ◽  
2021 ◽  
Vol 9 (14) ◽  
pp. 1685
Author(s):  
Julian Miller ◽  
Lukas Trümper ◽  
Christian Terboven ◽  
Matthias S. Müller

With the quickly evolving hardware landscape of high-performance computing (HPC) and its increasing specialization, the implementation of efficient software applications becomes more challenging. This is especially prevalent for domain scientists and may hinder the advances in large-scale simulation software. One idea to overcome these challenges is through software abstraction. We present a parallel algorithm model that allows for global optimization of their synchronization and dataflow and optimal mapping to complex and heterogeneous architectures. The presented model strictly separates the structure of an algorithm from its executed functions. It utilizes a hierarchical decomposition of parallel design patterns as well-established building blocks for algorithmic structures and captures them in an abstract pattern tree (APT). A data-centric flow graph is constructed based on the APT, which acts as an intermediate representation for rich and automated structural transformations. We demonstrate the applicability of this model to three representative algorithms and show runtime speedups between 1.83 and 2.45 on a typical heterogeneous CPU/GPU architecture.

2012 ◽  
Vol 22 (02) ◽  
pp. 1240006 ◽  
Author(s):  
M. ALDINUCCI ◽  
M. DANELUTTO ◽  
P. KILPATRICK ◽  
M. TORQUATI

We propose a data flow based run time system as an efficient tool for supporting execution of parallel code on heterogeneous architectures hosting both multicore CPUs and GPUs. We discuss how the proposed run time system may be the target of both structured parallel applications developed using algorithmic skeletons/parallel design patterns and also more "domain specific" programming models. Experimental results demonstrating the feasibility of the approach are presented.


Author(s):  
D. E. Keyes ◽  
H. Ltaief ◽  
G. Turkiyyah

A traditional goal of algorithmic optimality, squeezing out flops, has been superseded by evolution in architecture. Flops no longer serve as a reasonable proxy for all aspects of complexity. Instead, algorithms must now squeeze memory, data transfers, and synchronizations, while extra flops on locally cached data represent only small costs in time and energy. Hierarchically low-rank matrices realize a rarely achieved combination of optimal storage complexity and high-computational intensity for a wide class of formally dense linear operators that arise in applications for which exascale computers are being constructed. They may be regarded as algebraic generalizations of the fast multipole method. Methods based on these hierarchical data structures and their simpler cousins, tile low-rank matrices, are well proportioned for early exascale computer architectures, which are provisioned for high processing power relative to memory capacity and memory bandwidth. They are ushering in a renaissance of computational linear algebra. A challenge is that emerging hardware architecture possesses hierarchies of its own that do not generally align with those of the algorithm. We describe modules of a software toolkit, hierarchical computations on manycore architectures, that illustrate these features and are intended as building blocks of applications, such as matrix-free higher-order methods in optimization and large-scale spatial statistics. Some modules of this open-source project have been adopted in the software libraries of major vendors. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.


Author(s):  
Anish Varghese ◽  
Bob Edwards ◽  
Gaurav Mitra ◽  
Alistair P Rendell

Energy efficiency is the primary impediment in the path to exascale computing. Consequently, the high-performance computing community is increasingly interested in low-power high-performance embedded systems as building blocks for large-scale high-performance systems. The Adapteva Epiphany architecture integrates low-power RISC cores on a 2D mesh network and promises up to 70 GFLOPS/Watt of theoretical performance. However, with just 32 KB of memory per eCore for storing both data and code, programming the Epiphany system presents significant challenges. In this paper we evaluate the performance of a 64-core Epiphany system with a variety of basic compute and communication micro-benchmarks. Further, we implemented two well known application kernels, 5-point star-shaped heat stencil with a peak performance of 65.2 GFLOPS and matrix multiplication with 65.3 GFLOPS in single precision across 64 Epiphany cores. We discuss strategies for implementing high-performance computing application kernels on such memory constrained low-power devices and compare the Epiphany with competing low-power systems. With future Epiphany revisions expected to house thousands of cores on a single chip, understanding the merits of such an architecture is of prime importance to the exascale initiative.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Shichao Zhang ◽  
Hui Liu ◽  
Jianyong Yu ◽  
Bingyun Li ◽  
Bin Ding

Abstract Two-dimensional network-structured carbon nanoscale building blocks, going beyond graphene, are of fundamental importance, and creating such structures and developing their applications have broad implications in environment, electronics and energy. Here, we report a facile route, based on electro-spraying/netting, to self-assemble two-dimensional carbon nanostructured networks on a large scale. Manipulation of the dynamic ejection, deformation and assembly of charged droplets by control of Taylor cone instability and micro-electric field, enables the creation of networks with characteristics combining nanoscale diameters of one-dimensional carbon nanotube and lateral infinity of two-dimensional graphene. The macro-sized (meter-level) carbon nanostructured networks show extraordinary nanostructural properties, remarkable flexibility (soft polymeric mechanics having hard inorganic matrix), nanoscale-level conductivity, and outstanding performances in distinctly different areas like filters, separators, absorbents, and wearable electrodes, supercapacitors and cells. This work should make possible the innovative design of high-performance, multi-functional carbon nanomaterials for various applications.


2014 ◽  
Vol 11 (3) ◽  
pp. 88-98 ◽  
Author(s):  
Shima Soroushnia ◽  
Masoud Daneshtalab ◽  
Juha Plosila ◽  
Tapio Pahikkala ◽  
Pasi Liljeberg

Summary Pattern discovery is one of the fundamental tasks in bioinformatics and pattern recognition is a powerful technique for searching sequence patterns in the biological sequence databases. Fast and high performance algorithms are highly demanded in many applications in bioinformatics and computational molecular biology since the significant increase in the number of DNA and protein sequences expand the need for raising the performance of pattern matching algorithms. For this purpose, heterogeneous architectures can be a good choice due to their potential for high performance and energy efficiency. In this paper we present an efficient implementation of Aho-Corasick (AC) which is a well known exact pattern matching algorithm with linear complexity, and Parallel Failureless Aho-Corasick (PFAC) algorithm which is the massively parallelized version of AC algorithm without failure transitions, on a heterogeneous CPU/GPU architecture. We progressively redesigned the algorithms and data structures to fit on the GPU architecture. Our results on different protein sequence data sets show that the new implementation runs 15 times faster compared to the original implementation of the PFAC algorithm.


2021 ◽  
Author(s):  
Pau Andrio ◽  
Adam Hospital ◽  
Cristian Ramon-Cortes ◽  
Javier Conejero ◽  
Daniele Lezzi ◽  
...  

AbstractThe usage of workflows has led to progress in many fields of science, where the need to process large amounts of data is coupled with difficulty in accessing and efficiently using High Performance Computing platforms. On the one hand, scientists are focused on their problem and concerned with how to process their data. On top of that, the applications typically have different parts and use different tools for each part, thus complicating the distribution and the reproducibility of the simulations. On the other hand, computer scientists concentrate on how to develop frameworks for the deployment of workflows on HPC or HTC resources; often providing separate solutions for the computational aspects and the data analytic ones.In this paper we present an approach to support biomolecular researchers in the development of complex workflows that i) allow them to compose pipelines of individual simulations built from different tools and interconnected by data dependencies, ii) run them seamlessly on different computational platforms, and iii) scale them up to the large number of cores provided by modern supercomputing infrastructures. Our approach is based on the orchestration of computational building blocks for Molecular Dynamics simulations through an efficient workflow management system that has already been adopted in many scientific fields to run applications on multitudes of computing backends.Results demonstrate the validity of the proposed solution through the execution of massively parallel runs in a supercomputer facility.


2016 ◽  
Vol 11 (1) ◽  
pp. 57-68
Author(s):  
Shruti Kalra ◽  
A. B. Bhattacharyya

Aggressive technological scaling continues to drive ultra-large-scale-integrated chips to higher clock speed. This causes large power consumption leading to considerable thermal generation and on-chip temperature gradient. Though much of the research has been focused on low power design, thermal issues still persist and need attention for enhanced integrated circuit reliability. The present paper outlines a methodology for a first hand estimating effect of temperature on basic CMOS building blocks at ultra deep submicron technology nodes utilizing modified α-power law based MOSFET model. The generalized α-power model is further applied for calculating Zero Temperature Coefficient (ZTC) point that provides temperature-independent operation of high performance and low power digital circuits without the use of conditioning circuits. The performance of basic digital circuits such as Inverter, NAND, NOR and XOR gate has been analyzed and results are compared with BSIM4 with respect to temperature up to 32nm technology node. The error lies within an acceptable range of 5-10%.


Author(s):  
C.K. Wu ◽  
P. Chang ◽  
N. Godinho

Recently, the use of refractory metal silicides as low resistivity, high temperature and high oxidation resistance gate materials in large scale integrated circuits (LSI) has become an important approach in advanced MOS process development (1). This research is a systematic study on the structure and properties of molybdenum silicide thin film and its applicability to high performance LSI fabrication.


Author(s):  
В.В. ГОРДЕЕВ ◽  
В.Е. ХАЗАНОВ

При выборе типа доильной установки и ее размера необходимо учитывать максимальное планируемое поголовье дойных коров и размер технологической группы, кратность и время одного доения, продолжительность рабочей смены дояров. Анализ технико-экономических показателей наиболее распространенных на сегодняшний день типов доильных установок одинакового технического уровня свидетельствует, что наилучшие удельные показатели имеет установка типа «Карусель» (1), а установка типа «Елочка» (2) требует более высоких затрат труда и средств. Установка «Параллель» (3) занимает промежуточное положение. Из анализа пропускной способности и количества необходимых операторов: установка 2 рекомендована для ферм с поголовьем дойного стада до 600 голов, 3 — не более 1200 дойных коров, 1 — более 1200 дойных коров. «Карусель» — наиболее рациональный, высокопроизводительный, легко автоматизируемый и, следовательно, перспективный способ доения в залах, особенно для крупных молочных ферм. The choice of the proper type and size of milking installations needs to take into account the maximum planned number of dairy cows, the size of a technological group, the number of milkings per day, and the duration of one milking and the operator's working shift. The analysis of technical and economic indicators of currently most common types of milking machines of the same technical level revealed that the Carousel installation had the best specific indicators while the Herringbone installation featured higher labour inputs and cash costs. The Parallel installation was found somewhere in between. In terms of the throughput and the required number of operators Herringbone is recommended for farms with up to 600 dairy cows, Parallel — below 1200 dairy cows, Carousel — above 1200 dairy cows. Carousel was found the most practical, high-performance, easily automated and, therefore, promising milking system for milking parlours, especially on the large-scale dairy farms.


Sign in / Sign up

Export Citation Format

Share Document