Applying EMD/HHT analysis to power traces of applications executed on systems with Intel Xeon Phi

Power draw is a complex physical response to the workload of a given application on the hardware, which is difficult to model, in part, due to its variability. The empirical mode decomposition and Hilbert–Huang transform (EMD/HHT) is a method commonly applied to physical systems varying with time to analyze their complex behavior. In authors’ work, the EMD/HHT is considered for the first time to study power usage of high-performance applications. Here, this method is applied to the power measurement sequences (called here power traces) collected on three different computing platforms featuring two generations of Intel Xeon Phi, which are an attractive solution under the power budget constraints. The high-performance applications explored in this work are codesign molecular synamics and general atomic and molecular electronic structure system—which exhibit different power draw characteristics—to showcase strengths and limitations of the EMD/HHT analysis. Specifically, EMD/HHT measures intensity of an execution, which shows the concentration of power draw with respect to execution time and provides insights into performance bottlenecks. This article compares intensity among executions, noting on a relationship between intensity and execution characteristics, such as computation amount and data movement. In general, this article concludes that the EMD/HHT method is a viable tool to compare application power usage and performance over the entire execution and that it has much potential in selecting most appropriate execution configurations.

Download Full-text

The VOLNA-OP2 tsunami code (version 1.5)

Geoscientific Model Development ◽

10.5194/gmd-11-4621-2018 ◽

2018 ◽

Vol 11 (11) ◽

pp. 4621-4635 ◽

Cited By ~ 7

Author(s):

Istvan Z. Reguly ◽

Daniel Giles ◽

Devaraj Gopinathan ◽

Laure Quivy ◽

Joakim H. Beck ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Shallow Water Equation ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Central Processing ◽

Domain Specific ◽

Computing Platforms ◽

Graphics Processing ◽

Intel Xeon

Abstract. In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite-volume non-linear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high-performance computing platforms: central processing units (CPUs), the Intel Xeon Phi, and graphics processing units (GPUs). This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years; here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity as well as performance and portability to its users across a number of platforms.

Download Full-text

Effective SIMD Vectorization for Intel Xeon Phi Coprocessors

Scientific Programming ◽

10.1155/2015/269764 ◽

2015 ◽

Vol 2015 ◽

pp. 1-14 ◽

Cited By ~ 8

Author(s):

Xinmin Tian ◽

Hideki Saito ◽

Serguei V. Preis ◽

Eric N. Garcia ◽

Sergey S. Kozhukhov ◽

...

Keyword(s):

High Performance ◽

Xeon Phi ◽

Performance Gain ◽

Intel Xeon Phi ◽

Performance Study ◽

Seamless Integration ◽

Small Matrix ◽

Performance Results ◽

Intel Mic ◽

Intel Xeon

Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel Xeon Phi coprocessors. In this paper, we present several effective SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel MIC specific alignment optimization, and small matrix transpose/multiplication 2D vectorization implemented in the Intel C/C++ and Fortran production compilers for Intel Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel Xeon Phi coprocessor. We also demonstrate a 2000x performance speedup from the seamless integration of SIMD vectorization and parallelization.

Download Full-text

Implementation of an Agent-Based Parallel Tissue Modelling Framework for the Intel MIC Architecture

Scientific Programming ◽

10.1155/2017/8721612 ◽

2017 ◽

Vol 2017 ◽

pp. 1-11 ◽

Cited By ~ 5

Author(s):

Maciej Cytowski ◽

Zuzanna Szymańska ◽

Piotr Umiński ◽

Grzegorz Andrejczuk ◽

Krzysztof Raszkowski

Keyword(s):

High Performance ◽

Large Scale ◽

Spatial Scales ◽

Biological Processes ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Variable Environment ◽

Modelling Framework ◽

Computational Performance ◽

Intel Xeon

Timothy is a novel large scale modelling framework that allows simulating of biological processes involving different cellular colonies growing and interacting with variable environment. Timothy was designed for execution on massively parallel High Performance Computing (HPC) systems. The high parallel scalability of the implementation allows for simulations of up to 109 individual cells (i.e., simulations at tissue spatial scales of up to 1 cm3 in size). With the recent advancements of the Timothy model, it has become critical to ensure appropriate performance level on emerging HPC architectures. For instance, the introduction of blood vessels supplying nutrients to the tissue is a very important step towards realistic simulations of complex biological processes, but it greatly increased the computational complexity of the model. In this paper, we describe the process of modernization of the application in order to achieve high computational performance on HPC hybrid systems based on modern Intel® MIC architecture. Experimental results on the Intel Xeon Phi™ coprocessor x100 and the Intel Xeon Phi processor x200 are presented.

Download Full-text

Intel Xeon Phi Coprocessor High Performance Programming

10.1016/c2011-0-06997-1 ◽

2013 ◽

Keyword(s):

High Performance ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Performance Programming ◽

Intel Xeon

Download Full-text

Cache Locality-Centric Parallel String Matching on Many-Core Accelerator Chips

Scientific Programming ◽

10.1155/2015/937694 ◽

2015 ◽

Vol 2015 ◽

pp. 1-20 ◽

Cited By ~ 1

Author(s):

Nhat-Phuong Tran ◽

Myungho Lee ◽

Dong Hoon Choi

Keyword(s):

High Performance ◽

Parallel Implementation ◽

String Matching ◽

Processing Unit ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Multiple Threads ◽

The Many ◽

Many Core ◽

Intel Xeon

Aho-Corasick (AC) algorithm is a multiple patterns string matching algorithm commonly used in computer and network security and bioinformatics, among many others. In order to meet the highly demanding computational requirements imposed on these applications, achieving high performance for the AC algorithm is crucial. In this paper, we present a high performance parallelization of the AC on the many-core accelerator chips such as the Graphic Processing Unit (GPU) from Nvidia and the Intel Xeon Phi. Our parallelization approach significantly improves the cache locality of the AC by partitioning a given set of string patterns into multiple smaller sets of patterns in a space-efficient way. Using the multiple pattern sets, intensive pattern matching operations are concurrently conducted with respect to the whole input text data. Compared with the previous approaches where the input data is partitioned amongst multiple threads instead of partitioning the pattern set, our approach significantly improves the performance. Experimental results show that our approach leads up to 2.73 times speedup on the Nvidia K20 GPU and 2.00 times speedup on the Intel Xeon Phi compared with the previous approach. Our parallel implementation delivers up to 693 Gbps throughput performance on the K20.

Download Full-text

Intel Xeon Phi Processor High Performance Programming

10.1016/c2015-0-00549-4 ◽

2016 ◽

Keyword(s):

High Performance ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Performance Programming ◽

Intel Xeon

Download Full-text

High-performance IP lookup using Intel Xeon Phi: a Bloom filters based approach

Journal of Internet Services and Applications ◽

10.1186/s13174-017-0075-y ◽

2018 ◽

Vol 9 (1) ◽

Cited By ~ 2

Author(s):

Alexandre Lucchesi ◽

André C. Drummond ◽

George Teodoro

Keyword(s):

High Performance ◽

Bloom Filters ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Ip Lookup ◽

Intel Xeon

Download Full-text

High-performance simulations of turbulent boundary layer flow using Intel Xeon Phi many-core processors

The Journal of Supercomputing ◽

10.1007/s11227-021-03642-6 ◽

2021 ◽

Author(s):

Ji-Hoon Kang ◽

Jinyul Hwang ◽

Hyung Jin Sung ◽

Hoon Ryu

Keyword(s):

Boundary Layer ◽

Turbulent Boundary Layer ◽

Boundary Layer Flow ◽

High Performance ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Turbulent Boundary Layer Flow ◽

Layer Flow ◽

Many Core ◽

Intel Xeon

Download Full-text

On the Use of Large Intel Xeon Phi Clusters for GEANT4-Based Simulations

Cybernetics and Information Technologies ◽

10.1515/cait-2017-0059 ◽

2017 ◽

Vol 17 (5) ◽

pp. 101-109

Author(s):

Nevena Ilieva ◽

Elena Lilkova ◽

Leandar Litov ◽

Borislav Pavlov ◽

Peicho Petkov

Keyword(s):

High Performance Computing ◽

High Performance ◽

Particle Interactions ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Supplementary Material ◽

Large Clusters ◽

Precise Simulation ◽

Performance Computing ◽

Intel Xeon

Abstract GEANT4 is the basic software for fast and precise simulation of particle interactions with matter. Along the way towards enabling the execution of GEANT4 based simulations on hybrid High Performance Computing (HPC) architectures with large clusters of Intel Xeon Phi co-processors, we study the performance of this software suit on the supercomputer system Avitohol@BAS, Some practical scripts are collected in the supplementary material shown in the appendix.

Download Full-text

Using Intel Xeon Phi coprocessors for execution of natural join on compressed data

Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie) ◽

10.26089/nummet.v16r450 ◽

2015 ◽

pp. 534-542

Author(s):

Е.В. Иванова ◽

Л.Б. Соколинский

Keyword(s):

High Performance ◽

Data Exchange ◽

Cluster Computing ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Computing Systems ◽

Parallel Decomposition ◽

Many Core ◽

Compressed Data ◽

Intel Xeon

В статье описывается сопроцессор баз данных для высокопроизводительных кластерных вычислительных систем с многоядерными ускорителями, использующий распределенные колоночные индексы с интервальной фрагментацией. Работа сопроцессора рассматривается на примере выполнения операции естественного соединения. Параллельная декомпозиция естественного соединения выполняется на основе использования распределенных колоночных индексов. Предложенный подход позволяет выполнять реляционные операции на кластерных вычислительных системах без массовых обменов данными. Приводятся результаты вычислительных экспериментов с использованием сопроцессоров Intel Xeon Phi, подтверждающие эффективность разработанных методов и алгоритмов. A database coprocessor for high-performance cluster computing systems with many-core accelerators is described. This coprocessor uses distributed columnar indexes with interval fragmentation. The operation of the coprocessor engine is considered by an example of natural join processing. The parallel decomposition of natural join operator is performed using distributed columnar indexes. The proposed approach allow one to perform relational operators on computing clusters without massive data exchange. The results of computational experiments on Intel Xeon Phi confirm the efficiency of the developed methods and algorithms.

Download Full-text