multicore processors
Recently Published Documents


TOTAL DOCUMENTS

660
(FIVE YEARS 35)

H-INDEX

31
(FIVE YEARS 0)

2021 ◽  
Author(s):  
Srijeeta Maity ◽  
Anirban Ghose ◽  
Soumyajit Dey ◽  
Sangyoung Park ◽  
Samarjit Chakrabarty




PLoS ONE ◽  
2021 ◽  
Vol 16 (9) ◽  
pp. e0257047
Author(s):  
Adrián Lamela ◽  
Óscar G. Ossorio ◽  
Guillermo Vinuesa ◽  
Benjamín Sahelices

Non-volatile memory technology is now available in commodity hardware. This technology can be used as a backup memory for an external dram cache memory without needing to modify the software. However, the higher read and write latencies of non-volatile memory may exacerbate the memory wall problem. In this work we present a novel off-chip prefetch technique based on a Hidden Markov Model that specifically deals with the latency problem caused by complexity of off-chip memory access patterns. Firstly, we present a thorough analysis of off-chip memory access patterns to identify its complexity in multicore processors. Based on this study, we propose a prefetching module located in the llc which uses two small tables, and where the computational complexity of which is linear with the number of computing threads. Our Markov-based technique is able to keep track and make clustering of several simultaneous groups of memory accesses coming from multiple simultaneous threads in a multicore processor. It can quickly identify complex address groups and trigger prefetch with very high accuracy. Our simulations show an improvement of up to 76% in the hit ratio of an off-chip dram cache for multicore architecture over the conventional prefetch technique (g/dc). Also, the overhead of prefetch requests (failed prefetches) is reduced by 48% in single core simulations and by 83% in multicore simulations.



Mathematics ◽  
2021 ◽  
Vol 9 (18) ◽  
pp. 2248
Author(s):  
Gaku Ishii ◽  
Yusaku Yamamoto ◽  
Takeshi Takaishi

We aim to accelerate the linear equation solver for crack growth simulation based on the phase field model. As a first step, we analyze the properties of the coefficient matrices and prove that they are symmetric positive definite. This justifies the use of the conjugate gradient method with the efficient incomplete Cholesky preconditioner. We then parallelize this preconditioner using so-called block multi-color ordering and evaluate its performance on multicore processors. The experimental results show that our solver scales well and achieves an acceleration of several times over the original solver based on the diagonally scaled CG method.





Electronics ◽  
2021 ◽  
Vol 10 (16) ◽  
pp. 1984
Author(s):  
Wei Zhang ◽  
Zihao Jiang ◽  
Zhiguang Chen ◽  
Nong Xiao ◽  
Yang Ou

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.



Symmetry ◽  
2021 ◽  
Vol 13 (8) ◽  
pp. 1488
Author(s):  
Basharat Mahmood ◽  
Naveed Ahmad ◽  
Majid Iqbal Khan ◽  
Adnan Akhunzada

The use of real-time systems is growing at an increasing rate. This raises the power efficiency as the main challenge for system designers. Power asymmetric multicore processors provide a power-efficient platform for building complex real-time systems. The utilization of this efficient platform can be further enhanced by adopting proficient scheduling policies. Unfortunately, the research on real-time scheduling of power asymmetric multicore processors is in its infancy. In this research, we have addressed this problem and added new results. We have proposed a dynamic-priority semi-partitioned algorithm named: Earliest-Deadline First with C=D Task Splitting (EDFwC=D-TS) for scheduling real-time applications on power asymmetric multicore processors. EDFwC=D-TS outclasses its counterparts in terms of system utilization. The simulation results show that EDFwC=D-TS schedules up to 67% more tasks with heavy workloads. Furthermore, it improves the processor utilization up to 11% and on average uses 14% less cores to schedule the given workload.





Sign in / Sign up

Export Citation Format

Share Document