A High Performance FFT Processor Based on Conflict-Free Memory Access

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

Download Full-text

A proposal for very high performance FFT processor architectures

10.1109/icassp.1985.1168168 ◽

2005 ◽

Cited By ~ 1

Author(s):

K. Siomalas ◽

B. Bowen

Keyword(s):

High Performance ◽

Processor Architectures ◽

Fft Processor ◽

Very High

Download Full-text

Study and Optimization of T-Tree Index in Main Memory Database

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.427-429.2531 ◽

2013 ◽

Vol 427-429 ◽

pp. 2531-2535 ◽

Cited By ~ 1

Author(s):

Feng Dong Sun ◽

Quan Guo ◽

Lan Wang

Keyword(s):

High Performance ◽

Main Memory ◽

Memory Access ◽

Index Structure ◽

Index Structures ◽

Clock Speed ◽

Main Memory Database ◽

Tree Index ◽

Overall Performance

The bottleneck is not the disk I/O but CUP clock speed faster than the memory speed in main memory database .In order to achieve high performance in main memory database ,it is a good approach to design new index structures to improve the memory access speed .This chapter presents a T-tree index structure and its algorithms in main memory database firstly .Then presents two results on Optimization of T-tree index ,including T-tail tree and TTB-tree. Our results indicate that the T-Tree provides good overall performance in main memory.

Download Full-text

Data structures access model for remote shared memory

E3S Web of Conferences ◽

10.1051/e3sconf/202124407001 ◽

2021 ◽

Vol 244 ◽

pp. 07001

Author(s):

Anatoliy Nyrkov ◽

Konstantin Ianiushkin ◽

Andrey Nyrkov ◽

Yulia Romanova ◽

Vagiz Gaskarov

Keyword(s):

Shared Memory ◽

Data Structures ◽

Data Model ◽

High Performance ◽

Direct Memory Access ◽

Performance Comparison ◽

Memory Access ◽

Memory Storage ◽

Race Conditions ◽

Performance Computing

Recent achievements in high-performance computing significantly narrow the performance gap between single and multi-node computing, and open up opportunities for systems with remote shared memory. The combination of in-memory storage, remote direct memory access and remote calls requires rethinking how data organized, protected and queried in distributed systems. Reviewed models let us implement new interpretations of distributed algorithms allowing us to validate different approaches to avoid race conditions, decrease resource acquisition or synchronization time. In this paper, we describe the data model for mixed memory access with analysis of optimized data structures. We also provide the result of experiments, which contain a performance comparison of data structures, operating with different approaches, evaluate the limitations of these models, and show that the model does not always meet expectations. The purpose of this paper to assist developers in designing data structures that will help to achieve architectural benefits or improve the design of existing distributed system.

Download Full-text

High-Performance Pipelined FFT Processor Based on Radix-22 for OFDM Applications

Intelligent Computing and Information and Communication - Advances in Intelligent Systems and Computing ◽

10.1007/978-981-10-7245-1_15 ◽

2018 ◽

pp. 143-151

Author(s):

Manish Bansal ◽

Sangeeta Nakhate

Keyword(s):

High Performance ◽

Fft Processor

Download Full-text

Intelligent High Performance Memory Access Technique in Aspect of DDR3

IOSR Journal of VLSI and Signal processing ◽

10.9790/4200-0321722 ◽

2013 ◽

Vol 3 (2) ◽

pp. 17-22

Author(s):

Jahid Hasan ◽

Keyword(s):

High Performance ◽

Memory Access ◽

Access Technique

Download Full-text

High-performance VLSI architecture for three-dimensional instrumentation based on a new concurrent memory-access scheme

Proceedings of APCCAS'96 - Asia Pacific Conference on Circuits and Systems ◽

10.1109/apcas.1996.569323 ◽

2002 ◽

Cited By ~ 1

Author(s):

S. Lee ◽

M. Hariyama ◽

M. Kameyama

Keyword(s):

High Performance ◽

Three Dimensional ◽

Vlsi Architecture ◽

Memory Access ◽

Access Scheme

Download Full-text

The Effect of Mesh Reordering on Laplacian Smoothing for Nonuniform Memory Access Architecture-based High Performance Computing Systems

Journal of the Institute of Electronics and Information Engineers ◽

10.5573/ieie.2014.51.3.082 ◽

2014 ◽

Vol 51 (3) ◽

pp. 82-88

Author(s):

Jbium Kim

Keyword(s):

High Performance Computing ◽

High Performance ◽

Memory Access ◽

Computing Systems ◽

Laplacian Smoothing ◽

Performance Computing ◽

Mesh Reordering

Download Full-text

Communicating Efficiently on Cluster-Based Remote Direct Memory Access (RDMA) over InfiniBand Protocol

Applied Sciences ◽

10.3390/app8112034 ◽

2018 ◽

Vol 8 (11) ◽

pp. 2034

Author(s):

Masoud Hemmatpour ◽

Bartolomeo Montrucchio ◽

Maurizio Rebaudengo

Keyword(s):

Distributed Systems ◽

Real World ◽

High Performance ◽

Direct Memory Access ◽

Distributed Applications ◽

Memory Access ◽

Experimental Results ◽

Distributed Application ◽

Communication Paradigm

Distributed systems are commonly built under the assumption that the network is the primary bottleneck, however this assumption no longer holds by emerging high-performance RDMA enabled protocols in datacenters. Designing distributed applications over such protocols requires a fundamental rethinking in communication components in comparison with traditional protocols (i.e., TCP/IP). In this paper, communication paradigms in existing systems and new possible paradigms have been investigated. Advantages and drawbacks of each paradigm have been comprehensively analyzed and experimentally evaluated. The experimental results show that writing the requests to server and reading the response presents up to 10 times better performance comparing to other communication paradigms. To further expand the investigation, the proposed communication paradigm has been substituted in a real-world distributed application, and the performance has been enhanced up to seven times.

Download Full-text

A High Performance FFT Processor Based on Conflict-Free Memory Access

Design of a high performance FFT processor based on FPGA

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

A proposal for very high performance FFT processor architectures

Study and Optimization of T-Tree Index in Main Memory Database

Data structures access model for remote shared memory

High-Performance Pipelined FFT Processor Based on Radix-22 for OFDM Applications

Intelligent High Performance Memory Access Technique in Aspect of DDR3

High-performance VLSI architecture for three-dimensional instrumentation based on a new concurrent memory-access scheme

The Effect of Mesh Reordering on Laplacian Smoothing for Nonuniform Memory Access Architecture-based High Performance Computing Systems

Communicating Efficiently on Cluster-Based Remote Direct Memory Access (RDMA) over InfiniBand Protocol

Export Citation Format