A High Performance FFT Processor Based on Conflict-Free Memory Access

Author(s):  
Long Pang Long Pang ◽  
Xin Qi Xin Qi ◽  
Yue-dong Luo Yue-dong Luo ◽  
Yi-zhuang Xie Yi-zhuang Xie
Electronics ◽  
2021 ◽  
Vol 10 (16) ◽  
pp. 1984
Author(s):  
Wei Zhang ◽  
Zihao Jiang ◽  
Zhiguang Chen ◽  
Nong Xiao ◽  
Yang Ou

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.


2013 ◽  
Vol 427-429 ◽  
pp. 2531-2535 ◽  
Author(s):  
Feng Dong Sun ◽  
Quan Guo ◽  
Lan Wang

The bottleneck is not the disk I/O but CUP clock speed faster than the memory speed in main memory database .In order to achieve high performance in main memory database ,it is a good approach to design new index structures to improve the memory access speed .This chapter presents a T-tree index structure and its algorithms in main memory database firstly .Then presents two results on Optimization of T-tree index ,including T-tail tree and TTB-tree. Our results indicate that the T-Tree provides good overall performance in main memory.


2021 ◽  
Vol 244 ◽  
pp. 07001
Author(s):  
Anatoliy Nyrkov ◽  
Konstantin Ianiushkin ◽  
Andrey Nyrkov ◽  
Yulia Romanova ◽  
Vagiz Gaskarov

Recent achievements in high-performance computing significantly narrow the performance gap between single and multi-node computing, and open up opportunities for systems with remote shared memory. The combination of in-memory storage, remote direct memory access and remote calls requires rethinking how data organized, protected and queried in distributed systems. Reviewed models let us implement new interpretations of distributed algorithms allowing us to validate different approaches to avoid race conditions, decrease resource acquisition or synchronization time. In this paper, we describe the data model for mixed memory access with analysis of optimized data structures. We also provide the result of experiments, which contain a performance comparison of data structures, operating with different approaches, evaluate the limitations of these models, and show that the model does not always meet expectations. The purpose of this paper to assist developers in designing data structures that will help to achieve architectural benefits or improve the design of existing distributed system.


2018 ◽  
Vol 8 (11) ◽  
pp. 2034
Author(s):  
Masoud Hemmatpour ◽  
Bartolomeo Montrucchio ◽  
Maurizio Rebaudengo

Distributed systems are commonly built under the assumption that the network is the primary bottleneck, however this assumption no longer holds by emerging high-performance RDMA enabled protocols in datacenters. Designing distributed applications over such protocols requires a fundamental rethinking in communication components in comparison with traditional protocols (i.e., TCP/IP). In this paper, communication paradigms in existing systems and new possible paradigms have been investigated. Advantages and drawbacks of each paradigm have been comprehensively analyzed and experimentally evaluated. The experimental results show that writing the requests to server and reading the response presents up to 10 times better performance comparing to other communication paradigms. To further expand the investigation, the proposed communication paradigm has been substituted in a real-world distributed application, and the performance has been enhanced up to seven times.


Sign in / Sign up

Export Citation Format

Share Document