Accelerating IDCT Algorithm on Xeon Phi Coprocessor

Inverse Discrete Cosine Transform (IDCT) is an important operation for image and videos decompression. How to accelerate the IDCT algorithm has been frequently studied. Recently Intel has proposed Xeon Phi coprocessors based on the many integrated core (MIC) architecture. Xeon Phi is integrated with 61 cores and 512-bit SIMD extension within each core, thus providing very high performance. In this paper, we employ the Knights Corner (a beta version of Xeon Phi) to accelerate the IDCT algorithm. By employing the 512-bit SIMD instruction and data pre-fetching optimization, our implementation achieves (1) averagely 5.82 speedup over the none-SIMD version, (2) averagely 27.3% performance benefit with the data pre-fetching optimization, and (3) averagely 1.53 speedup on one Knights Corner coprocessor over the implementation on one octal-core Intel Xeon E5-2670 CPU.

Download Full-text

Cache Locality-Centric Parallel String Matching on Many-Core Accelerator Chips

Scientific Programming ◽

10.1155/2015/937694 ◽

2015 ◽

Vol 2015 ◽

pp. 1-20 ◽

Cited By ~ 1

Author(s):

Nhat-Phuong Tran ◽

Myungho Lee ◽

Dong Hoon Choi

Keyword(s):

High Performance ◽

Parallel Implementation ◽

String Matching ◽

Processing Unit ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Multiple Threads ◽

The Many ◽

Many Core ◽

Intel Xeon

Aho-Corasick (AC) algorithm is a multiple patterns string matching algorithm commonly used in computer and network security and bioinformatics, among many others. In order to meet the highly demanding computational requirements imposed on these applications, achieving high performance for the AC algorithm is crucial. In this paper, we present a high performance parallelization of the AC on the many-core accelerator chips such as the Graphic Processing Unit (GPU) from Nvidia and the Intel Xeon Phi. Our parallelization approach significantly improves the cache locality of the AC by partitioning a given set of string patterns into multiple smaller sets of patterns in a space-efficient way. Using the multiple pattern sets, intensive pattern matching operations are concurrently conducted with respect to the whole input text data. Compared with the previous approaches where the input data is partitioned amongst multiple threads instead of partitioning the pattern set, our approach significantly improves the performance. Experimental results show that our approach leads up to 2.73 times speedup on the Nvidia K20 GPU and 2.00 times speedup on the Intel Xeon Phi compared with the previous approach. Our parallel implementation delivers up to 693 Gbps throughput performance on the K20.

Download Full-text

Systolic architecture for inverse discrete cosine transform

Electronics Letters ◽

10.1049/el:19951276 ◽

1995 ◽

Vol 31 (21) ◽

pp. 1809-1811 ◽

Cited By ~ 2

Author(s):

Yu-Tai Chang ◽

Chin-Liang Wang

Keyword(s):

Discrete Cosine Transform ◽

Systolic Architecture ◽

Cosine Transform ◽

Inverse Discrete Cosine Transform

Download Full-text

A Structured Fast Algorithm for the VLSI Pipeline Implementation of Inverse Discrete Cosine Transform

Circuits Systems and Signal Processing ◽

10.1007/s00034-021-01718-5 ◽

2021 ◽

Author(s):

Doru Florin Chiper

Keyword(s):

Discrete Cosine Transform ◽

Fast Algorithm ◽

Cosine Transform ◽

Inverse Discrete Cosine Transform ◽

Pipeline Implementation

Download Full-text

A high-performance low-power asynchronous matrix-vector multiplier for discrete cosine transform

AP-ASIC'99. First IEEE Asia Pacific Conference on ASICs (Cat. No.99EX360) ◽

10.1109/apasic.1999.824046 ◽

2003 ◽

Cited By ~ 2

Author(s):

Kyeounsoo Kim ◽

P.A. Beerel

Keyword(s):

Low Power ◽

Discrete Cosine Transform ◽

High Performance ◽

Cosine Transform ◽

Matrix Vector

Download Full-text

A flexible high performance 2-D discrete cosine transform IC

10.1109/iscas.1989.100428 ◽

2003 ◽

Cited By ~ 6

Author(s):

L. Matterne ◽

D. Chong ◽

B. McSweeney ◽

R. Woudsma

Keyword(s):

Discrete Cosine Transform ◽

High Performance ◽

Cosine Transform

Download Full-text

mAMBER: Accelerating Explicit Solvent Molecular Dynamic with Intel Xeon Phi Many-Integrated Core Coprocessors

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing ◽

10.1109/ccgrid.2015.66 ◽

2015 ◽

Cited By ~ 1

Author(s):

Xin Liu ◽

Shaoliang Peng ◽

Canqun Yang ◽

Chengkun Wu ◽

Haiqiang Wang ◽

...

Keyword(s):

Molecular Dynamic ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Explicit Solvent ◽

Many Integrated Core ◽

Intel Xeon

Download Full-text

Iterative clipping and filtering based on discrete cosine transform/inverse discrete cosine transform for intensity modulator direct detection optical orthogonal frequency division multiplexing system

Optical Engineering ◽

10.1117/1.oe.52.6.065001 ◽

2013 ◽

Vol 52 (6) ◽

pp. 065001 ◽

Cited By ~ 6

Author(s):

Fall Mangone ◽

Jin Tang ◽

Ming Chen ◽

Jiangnan Xiao ◽

Li Fan ◽

...

Keyword(s):

Discrete Cosine Transform ◽

Orthogonal Frequency Division Multiplexing ◽

Direct Detection ◽

Frequency Division Multiplexing ◽

Frequency Division ◽

Cosine Transform ◽

Inverse Discrete Cosine Transform ◽

Intensity Modulator ◽

Clipping And Filtering

Download Full-text

IEEE Standard Specifications for the Implementations of 8X8 Inverse Discrete Cosine Transform

10.1109/ieeestd.1991.101047 ◽

1990 ◽

Cited By ~ 3

Keyword(s):

Discrete Cosine Transform ◽

Cosine Transform ◽

Inverse Discrete Cosine Transform ◽

Ieee Standard

Download Full-text

The VOLNA-OP2 tsunami code (version 1.5)

Geoscientific Model Development ◽

10.5194/gmd-11-4621-2018 ◽

2018 ◽

Vol 11 (11) ◽

pp. 4621-4635 ◽

Cited By ~ 7

Author(s):

Istvan Z. Reguly ◽

Daniel Giles ◽

Devaraj Gopinathan ◽

Laure Quivy ◽

Joakim H. Beck ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Shallow Water Equation ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Central Processing ◽

Domain Specific ◽

Computing Platforms ◽

Graphics Processing ◽

Intel Xeon

Abstract. In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite-volume non-linear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high-performance computing platforms: central processing units (CPUs), the Intel Xeon Phi, and graphics processing units (GPUs). This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years; here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity as well as performance and portability to its users across a number of platforms.

Download Full-text