Accelerating time series subsequence matching on the Intel Xeon Phi many-core coprocessor

В настоящее время поиск похожих подпоследовательностей требуется в широком спектре приложений интеллектуального анализа временных рядов: моделирование климата, финансовые прогнозы, медицинские исследования и др. В большинстве указанных приложений при поиске используется мера схожести Dynamic Time Warping (DTW), поскольку на сегодняшний день научное сообщество признает меру DTW одной из лучших для большинства предметных областей. Мера DTW имеет квадратичную вычислительную сложность относительно длины искомой подпоследовательности, в силу чего разработан ряд параллельных алгоритмов ее вычисления на устройствах FPGA и многоядерных ускорителях с архитектурами GPU и Intel MIC. В настоящей статье предлагается новый параллельный алгоритм для поиска похожих подпоследовательностей в сверхбольших временных рядах на кластерных системах с узлами на базе многоядерных процессоров Intel Xeon Phi поколения Knights Landing (KNL). Вычисления распараллеливаются на двух уровнях: на уровне всех узлов кластера - с помощью технологии MPI и в рамках одного узла кластера - с помощью технологии OpenMP. Алгоритм предполагает использование дополнительных структур данных и избыточных вычислений, позволяющих эффективно задействовать возможности векторизации вычислений на процессорных системах Phi KNL. Эксперименты, проведенные на синтетических и реальных наборах данных, показали хорошую масштабируемость алгоритма. Nowadays, the subsequence similarity search is required in a wide range of time series mining applications: climate modeling, financial forecasts, medical research, etc. In most of these applications, the Dynamic Time Warping (DTW) similarity measure is used, since DTW is empirically confirmed as one of the best similarity measures for the majority of subject domains. Since the DTW measure has a quadratic computational complexity with respect to the length of query subsequence, a number of parallel algorithms for various many-core architectures are developed, namely FPGA, GPU, and Intel MIC. In this paper we propose a new parallel algorithm for subsequence similarity search in very large time series on computer cluster systems with nodes based on Intel Xeon Phi Knights Landing (KNL) many-core processors. Computations are parallelized on two levels as follows: by MPI at the level of all cluster nodes and by OpenMP within a single cluster node. The algorithm involves additional data structures and redundant computations, which make it possible to efficiently use the capabilities of vector computations on Phi KNL. Experimental evaluation of the algorithm on real-world and synthetic datasets shows that the proposed algorithm is highly scalable.

Download Full-text

MILC Code Performance on High End CPU and GPU Supercomputer Clusters

EPJ Web of Conferences ◽

10.1051/epjconf/201817502009 ◽

2018 ◽

Vol 175 ◽

pp. 02009

Author(s):

Carleton DeTar ◽

Steven Gottlieb ◽

Ruizi Li ◽

Doug Toussaint

Keyword(s):

Conjugate Gradient ◽

Memory Hierarchy ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Code Performance ◽

Recent Developments ◽

Knights Landing ◽

Many Core ◽

Intel Xeon

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

Download Full-text

Accelerating time series motif discovery in the Intel Xeon Phi KNL processor

The Journal of Supercomputing ◽

10.1007/s11227-019-02923-5 ◽

2019 ◽

Vol 75 (11) ◽

pp. 7053-7075 ◽

Cited By ~ 2

Author(s):

Ivan Fernandez ◽

Alejandro Villegas ◽

Eladio Gutierrez ◽

Oscar Plata

Keyword(s):

Time Series ◽

Motif Discovery ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Intel Xeon

Download Full-text

Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor

2014 IEEE International Parallel & Distributed Processing Symposium Workshops ◽

10.1109/ipdpsw.2014.194 ◽

2014 ◽

Cited By ~ 13

Author(s):

Lei Jin ◽

Zhaokang Wang ◽

Rong Gu ◽

Chunfeng Yuan ◽

Yihua Huang

Keyword(s):

Neural Networks ◽

Large Scale ◽

Deep Neural Networks ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Many Core ◽

Intel Xeon

Download Full-text

Parallelization of Molecular-Dynamics Simulations Using Tasks

MRS Proceedings ◽

10.1557/opl.2015.113 ◽

2015 ◽

Vol 1753 ◽

Cited By ~ 2

Author(s):

Ralf Meyer ◽

Chris M. Mangiardi

Keyword(s):

Molecular Dynamics ◽

Molecular Dynamics Simulations ◽

Shared Memory ◽

Md Simulations ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Novel Algorithms ◽

Dynamics Simulations ◽

Many Core ◽

Intel Xeon

ABSTRACTThis article discusses novel algorithms for molecular-dynamics (MD) simulations with short-ranged forces on modern multi- and many-core processors like the Intel Xeon Phi. A task-based approach to the parallelization of MD on shared-memory computers and a tiling scheme to facilitate the SIMD vectorization of the force calculations is described. The algorithms have been tested with three different potentials and the resulting speed-ups on Intel Xeon Phi coprocessors are shown.

Download Full-text

Many-core needs fine-grained scheduling: A case study of query processing on Intel Xeon Phi processors

Journal of Parallel and Distributed Computing ◽

10.1016/j.jpdc.2017.09.005 ◽

2018 ◽

Vol 120 ◽

pp. 395-404 ◽

Cited By ~ 3

Author(s):

Xuntao Cheng ◽

Bingsheng He ◽

Mian Lu ◽

Chiew Tong Lau

Keyword(s):

Query Processing ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Fine Grained ◽

Many Core ◽

Intel Xeon

Download Full-text

A parallel algorithm of Euclidean distance matrix computation for the Intel Xeon Phi Knights Landing many-core processor

Bulletin of the South Ural State University Series Computational Mathematics and Software Engineering ◽

10.14529/cmse180305 ◽

2018 ◽

Vol 7 (3) ◽

Keyword(s):

Parallel Algorithm ◽

Euclidean Distance ◽

Distance Matrix ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Euclidean Distance Matrix ◽

Matrix Computation ◽

Knights Landing ◽

Many Core ◽

Intel Xeon

Download Full-text

Performance Evaluation of an OpenCL Implementation of the Lattice Boltzmann Method on the Intel Xeon Phi

Parallel Processing Letters ◽

10.1142/s0129626415410017 ◽

2015 ◽

Vol 25 (03) ◽

pp. 1541001 ◽

Cited By ~ 1

Author(s):

Christian Obrecht ◽

Bernard Tourancheau ◽

Frédéric Kuznik

Keyword(s):

Lattice Boltzmann Method ◽

Lattice Boltzmann ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Hardware Architectures ◽

Nvidia Gpu ◽

Many Core ◽

Hardware Platforms ◽

Boltzmann Method ◽

Intel Xeon

A portable OpenCL implementation of the lattice Boltzmann method targeting emerging many-core architectures is described. The main purpose of this work is to evaluate and compare the performance of this code on three mainstream hardware architectures available today, namely an Intel CPU, an Nvidia GPU, and the Intel Xeon Phi. Because of the similarities between OpenCL and CUDA, we chose to follow some of the strategies devised to implement efficient lattice Boltzmann solvers on Nvidia GPU, while remaining as generic as possible. Being fairly configurable, this program makes possible to ascertain the best options for each hardware platforms. The achieved performance is quite satisfactory for both the CPU and the GPU. For the Xeon Phi however, the results are below expectations. Nevertheless, comparison with data from the literature shows that on this architecture the code seems memory-bound.

Download Full-text

Evaluating the Support of MTC Applications on Intel Xeon Phi Many-Core Accelerators

2015 IEEE International Conference on Cluster Computing ◽

10.1109/cluster.2015.87 ◽

2015 ◽

Author(s):

Poornima Nookala ◽

Serapheim Dimitropoulos ◽

Karl Stough ◽

Ioan Raicu

Keyword(s):

Xeon Phi ◽

Intel Xeon Phi ◽

Many Core ◽

Intel Xeon

Download Full-text

Cache Locality-Centric Parallel String Matching on Many-Core Accelerator Chips

Scientific Programming ◽

10.1155/2015/937694 ◽

2015 ◽

Vol 2015 ◽

pp. 1-20 ◽

Cited By ~ 1

Author(s):

Nhat-Phuong Tran ◽

Myungho Lee ◽

Dong Hoon Choi

Keyword(s):

High Performance ◽

Parallel Implementation ◽

String Matching ◽

Processing Unit ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Multiple Threads ◽

The Many ◽

Many Core ◽

Intel Xeon

Aho-Corasick (AC) algorithm is a multiple patterns string matching algorithm commonly used in computer and network security and bioinformatics, among many others. In order to meet the highly demanding computational requirements imposed on these applications, achieving high performance for the AC algorithm is crucial. In this paper, we present a high performance parallelization of the AC on the many-core accelerator chips such as the Graphic Processing Unit (GPU) from Nvidia and the Intel Xeon Phi. Our parallelization approach significantly improves the cache locality of the AC by partitioning a given set of string patterns into multiple smaller sets of patterns in a space-efficient way. Using the multiple pattern sets, intensive pattern matching operations are concurrently conducted with respect to the whole input text data. Compared with the previous approaches where the input data is partitioned amongst multiple threads instead of partitioning the pattern set, our approach significantly improves the performance. Experimental results show that our approach leads up to 2.73 times speedup on the Nvidia K20 GPU and 2.00 times speedup on the Intel Xeon Phi compared with the previous approach. Our parallel implementation delivers up to 693 Gbps throughput performance on the K20.

Download Full-text