MILC Code Performance on High End CPU and GPU Supercomputer Clusters

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

Download Full-text

A parallel algorithm of Euclidean distance matrix computation for the Intel Xeon Phi Knights Landing many-core processor

Bulletin of the South Ural State University Series Computational Mathematics and Software Engineering ◽

10.14529/cmse180305 ◽

2018 ◽

Vol 7 (3) ◽

Keyword(s):

Parallel Algorithm ◽

Euclidean Distance ◽

Distance Matrix ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Euclidean Distance Matrix ◽

Matrix Computation ◽

Knights Landing ◽

Many Core ◽

Intel Xeon

Download Full-text

The use of MPI and OpenMP technologies for subsequence similarity search in very long time series on a computer cluster system with nodes based on the Intel Xeon Phi Knights Landing many-core processor

Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie) ◽

10.26089/nummet.v20r104 ◽

2019 ◽

pp. 29-44

Author(s):

Я.А. Краева ◽

М.Л. Цымблер

Keyword(s):

Time Series ◽

Similarity Search ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Time Warping ◽

Knights Landing ◽

Dynamic Time ◽

Many Core ◽

Intel Mic ◽

Intel Xeon

В настоящее время поиск похожих подпоследовательностей требуется в широком спектре приложений интеллектуального анализа временных рядов: моделирование климата, финансовые прогнозы, медицинские исследования и др. В большинстве указанных приложений при поиске используется мера схожести Dynamic Time Warping (DTW), поскольку на сегодняшний день научное сообщество признает меру DTW одной из лучших для большинства предметных областей. Мера DTW имеет квадратичную вычислительную сложность относительно длины искомой подпоследовательности, в силу чего разработан ряд параллельных алгоритмов ее вычисления на устройствах FPGA и многоядерных ускорителях с архитектурами GPU и Intel MIC. В настоящей статье предлагается новый параллельный алгоритм для поиска похожих подпоследовательностей в сверхбольших временных рядах на кластерных системах с узлами на базе многоядерных процессоров Intel Xeon Phi поколения Knights Landing (KNL). Вычисления распараллеливаются на двух уровнях: на уровне всех узлов кластера - с помощью технологии MPI и в рамках одного узла кластера - с помощью технологии OpenMP. Алгоритм предполагает использование дополнительных структур данных и избыточных вычислений, позволяющих эффективно задействовать возможности векторизации вычислений на процессорных системах Phi KNL. Эксперименты, проведенные на синтетических и реальных наборах данных, показали хорошую масштабируемость алгоритма. Nowadays, the subsequence similarity search is required in a wide range of time series mining applications: climate modeling, financial forecasts, medical research, etc. In most of these applications, the Dynamic Time Warping (DTW) similarity measure is used, since DTW is empirically confirmed as one of the best similarity measures for the majority of subject domains. Since the DTW measure has a quadratic computational complexity with respect to the length of query subsequence, a number of parallel algorithms for various many-core architectures are developed, namely FPGA, GPU, and Intel MIC. In this paper we propose a new parallel algorithm for subsequence similarity search in very large time series on computer cluster systems with nodes based on Intel Xeon Phi Knights Landing (KNL) many-core processors. Computations are parallelized on two levels as follows: by MPI at the level of all cluster nodes and by OpenMP within a single cluster node. The algorithm involves additional data structures and redundant computations, which make it possible to efficiently use the capabilities of vector computations on Phi KNL. Experimental evaluation of the algorithm on real-world and synthetic datasets shows that the proposed algorithm is highly scalable.

Download Full-text

Performance Evaluation of Scientific Applications on Intel Xeon Phi Knights Landing Clusters

2018 International Conference on High Performance Computing & Simulation (HPCS) ◽

10.1109/hpcs.2018.00063 ◽

2018 ◽

Cited By ~ 4

Author(s):

Ji-Hoon Kang ◽

Oh-Kyoung Kwon ◽

Hoon Ryu ◽

Jinwoo Jeong ◽

Kyunghun Lim

Keyword(s):

Performance Evaluation ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Scientific Applications ◽

Knights Landing ◽

Intel Xeon

Download Full-text

Simulating Multiphase Flows in Porous Media Using OpenFOAM on Intel Xeon Phi Knights Landing Processors

Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact - PEARC17 ◽

10.1145/3093338.3093350 ◽

2017 ◽

Cited By ~ 1

Author(s):

Zhi Shang ◽

Honggao Liu

Keyword(s):

Porous Media ◽

Multiphase Flows ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Flows In Porous Media ◽

Knights Landing ◽

Intel Xeon

Download Full-text

Long-time simulations with complex code using multiple nodes of Intel Xeon Phi Knights Landing

Journal of Computational and Applied Mathematics ◽

10.1016/j.cam.2017.12.050 ◽

2018 ◽

Vol 337 ◽

pp. 18-36 ◽

Cited By ~ 1

Author(s):

Jonathan S. Graf ◽

Matthias K. Gobbert ◽

Samuel Khuvis

Keyword(s):

Xeon Phi ◽

Intel Xeon Phi ◽

Long Time ◽

Knights Landing ◽

Intel Xeon

Download Full-text

Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi Knights Landing

2017 Fifth International Symposium on Computing and Networking (CANDAR) ◽

10.1109/candar.2017.66 ◽

2017 ◽

Cited By ~ 1

Author(s):

Issaku Kanamori ◽

Hideo Matsufuru

Keyword(s):

Lattice Qcd ◽

Practical Implementation ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Knights Landing ◽

Intel Xeon

Download Full-text

Accelerating Seismic Simulations Using the Intel Xeon Phi Knights Landing Processor

Lecture Notes in Computer Science - High Performance Computing ◽

10.1007/978-3-319-58667-0_8 ◽

2017 ◽

pp. 139-157 ◽

Cited By ~ 5

Author(s):

Josh Tobin ◽

Alexander Breuer ◽

Alexander Heinecke ◽

Charles Yount ◽

Yifeng Cui

Keyword(s):

Xeon Phi ◽

Intel Xeon Phi ◽

Knights Landing ◽

Intel Xeon

Download Full-text

Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor

2014 IEEE International Parallel & Distributed Processing Symposium Workshops ◽

10.1109/ipdpsw.2014.194 ◽

2014 ◽

Cited By ~ 13

Author(s):

Lei Jin ◽

Zhaokang Wang ◽

Rong Gu ◽

Chunfeng Yuan ◽

Yihua Huang

Keyword(s):

Neural Networks ◽

Large Scale ◽

Deep Neural Networks ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Many Core ◽

Intel Xeon

Download Full-text

Parallelization of Molecular-Dynamics Simulations Using Tasks

MRS Proceedings ◽

10.1557/opl.2015.113 ◽

2015 ◽

Vol 1753 ◽

Cited By ~ 2

Author(s):

Ralf Meyer ◽

Chris M. Mangiardi

Keyword(s):

Molecular Dynamics ◽

Molecular Dynamics Simulations ◽

Shared Memory ◽

Md Simulations ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Novel Algorithms ◽

Dynamics Simulations ◽

Many Core ◽

Intel Xeon

ABSTRACTThis article discusses novel algorithms for molecular-dynamics (MD) simulations with short-ranged forces on modern multi- and many-core processors like the Intel Xeon Phi. A task-based approach to the parallelization of MD on shared-memory computers and a tiling scheme to facilitate the SIMD vectorization of the force calculations is described. The algorithms have been tested with three different potentials and the resulting speed-ups on Intel Xeon Phi coprocessors are shown.

Download Full-text