Many-core needs fine-grained scheduling: A case study of query processing on Intel Xeon Phi processors

2018 ◽  
Vol 120 ◽  
pp. 395-404 ◽  
Author(s):  
Xuntao Cheng ◽  
Bingsheng He ◽  
Mian Lu ◽  
Chiew Tong Lau
2018 ◽  
Vol 175 ◽  
pp. 02009
Author(s):  
Carleton DeTar ◽  
Steven Gottlieb ◽  
Ruizi Li ◽  
Doug Toussaint

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.


2015 ◽  
Vol 1753 ◽  
Author(s):  
Ralf Meyer ◽  
Chris M. Mangiardi

ABSTRACTThis article discusses novel algorithms for molecular-dynamics (MD) simulations with short-ranged forces on modern multi- and many-core processors like the Intel Xeon Phi. A task-based approach to the parallelization of MD on shared-memory computers and a tiling scheme to facilitate the SIMD vectorization of the force calculations is described. The algorithms have been tested with three different potentials and the resulting speed-ups on Intel Xeon Phi coprocessors are shown.


2015 ◽  
Vol 25 (03) ◽  
pp. 1541001 ◽  
Author(s):  
Christian Obrecht ◽  
Bernard Tourancheau ◽  
Frédéric Kuznik

A portable OpenCL implementation of the lattice Boltzmann method targeting emerging many-core architectures is described. The main purpose of this work is to evaluate and compare the performance of this code on three mainstream hardware architectures available today, namely an Intel CPU, an Nvidia GPU, and the Intel Xeon Phi. Because of the similarities between OpenCL and CUDA, we chose to follow some of the strategies devised to implement efficient lattice Boltzmann solvers on Nvidia GPU, while remaining as generic as possible. Being fairly configurable, this program makes possible to ascertain the best options for each hardware platforms. The achieved performance is quite satisfactory for both the CPU and the GPU. For the Xeon Phi however, the results are below expectations. Nevertheless, comparison with data from the literature shows that on this architecture the code seems memory-bound.


Author(s):  
Poornima Nookala ◽  
Serapheim Dimitropoulos ◽  
Karl Stough ◽  
Ioan Raicu
Keyword(s):  
Xeon Phi ◽  

2015 ◽  
Vol 2015 ◽  
pp. 1-20 ◽  
Author(s):  
Nhat-Phuong Tran ◽  
Myungho Lee ◽  
Dong Hoon Choi

Aho-Corasick (AC) algorithm is a multiple patterns string matching algorithm commonly used in computer and network security and bioinformatics, among many others. In order to meet the highly demanding computational requirements imposed on these applications, achieving high performance for the AC algorithm is crucial. In this paper, we present a high performance parallelization of the AC on the many-core accelerator chips such as the Graphic Processing Unit (GPU) from Nvidia and the Intel Xeon Phi. Our parallelization approach significantly improves the cache locality of the AC by partitioning a given set of string patterns into multiple smaller sets of patterns in a space-efficient way. Using the multiple pattern sets, intensive pattern matching operations are concurrently conducted with respect to the whole input text data. Compared with the previous approaches where the input data is partitioned amongst multiple threads instead of partitioning the pattern set, our approach significantly improves the performance. Experimental results show that our approach leads up to 2.73 times speedup on the Nvidia K20 GPU and 2.00 times speedup on the Intel Xeon Phi compared with the previous approach. Our parallel implementation delivers up to 693 Gbps throughput performance on the K20.


2017 ◽  
Author(s):  
David Jenkins ◽  
Alastair Basden ◽  
Richard Myers

Sign in / Sign up

Export Citation Format

Share Document