High Performance Parallelization of COMPSYN on a Cluster of Multicore Processors with GPUs

High Performance Topology-Aware Communication in Multicore Processors

Chapman & Hall/CRC Computational Science - Scientific Computing with Multicore and Accelerators ◽

10.1201/b10376-30 ◽

2010 ◽

pp. 443-460

Author(s):

Hari Subramoni ◽

Fabrizio Petrini ◽

Virat Agarwal ◽

Davide Pasetto

Keyword(s):

High Performance ◽

Multicore Processors

Download Full-text

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Electronics ◽

10.3390/electronics10161984 ◽

2021 ◽

Vol 10 (16) ◽

pp. 1984

Author(s):

Wei Zhang ◽

Zihao Jiang ◽

Zhiguang Chen ◽

Nong Xiao ◽

Yang Ou

Keyword(s):

Energy Efficiency ◽

High Performance ◽

Multicore Processors ◽

Matrix Multiplication ◽

Memory Access ◽

Double Precision ◽

Competitive Performance ◽

General Matrix ◽

Remarkable Improvement ◽

Task Independence

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

Download Full-text

Designing of High Performance Multicore Processor with Improved Cache Configuration and Interconnect

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Emerging Research Surrounding Power Consumption and Performance Issues in Utility Computing ◽

10.4018/978-1-4666-8853-7.ch009 ◽

2016 ◽

pp. 204-219

Author(s):

Ram Prasad Mohanty ◽

Ashok Kumar Turuk ◽

Bibhudatta Sahoo

Keyword(s):

High Performance ◽

Multicore Processors ◽

Multicore Processor ◽

Cache Size ◽

L2 Cache ◽

Internal Network ◽

On Chip ◽

L1 And L2 ◽

The Impact ◽

Cache Configuration

The growing number of cores increases the demand for a powerful memory subsystem which leads to enhancement in the size of caches in multicore processors. Caches are responsible for giving processing elements a faster, higher bandwidth local memory to work with. In this chapter, an attempt has been made to analyze the impact of cache size on performance of Multi-core processors by varying L1 and L2 cache size on the multicore processor with internal network (MPIN) referenced from NIAGRA architecture. As the number of core's increases, traditional on-chip interconnects like bus and crossbar proves to be low in efficiency as well as suffer from poor scalability. In order to overcome the scalability and efficiency issues in these conventional interconnect, ring based design has been proposed. The effect of interconnect on the performance of multicore processors has been analyzed and a novel scalable on-chip interconnection mechanism (INOC) for multicore processors has been proposed. The benchmark results are presented by using a full system simulator. Results show that, using the proposed INoC, compared with the MPIN; the execution time are significantly reduced.

Download Full-text

Parallel Skyline Computation Exploiting the Lattice Structure

Journal of Database Management ◽

10.4018/jdm.2015100102 ◽

2015 ◽

Vol 26 (4) ◽

pp. 18-43 ◽

Cited By ~ 7

Author(s):

Markus Endres ◽

Werner Kießling

Keyword(s):

High Performance ◽

Lattice Structure ◽

Multicore Processors ◽

Optimization Techniques ◽

Skyline Query ◽

Research Attention ◽

Evaluation Strategies ◽

Hardware Architectures ◽

Parallel Evaluation ◽

New Algorithms

The problem of Skyline computation has attracted considerable research attention in the last decade. A Skyline query selects those tuples from a dataset that are optimal with respect to a set of designated preference attributes. Since multicore processors are going mainstream, it has become imperative to develop parallel algorithms, which fully exploit the advantages of such modern hardware architectures. In this paper, the authors present high-performance parallel Skyline algorithms based on the lattice structure generated by a Skyline query. For this, they propose different evaluation strategies and compare several data structures for the parallel evaluation of Skyline queries. The authors present novel optimization techniques for lattice based Skyline algorithms based on pruning and removing one unrestricted attribute domain. They demonstrate through comprehensive experiments on synthetic and real datasets that their new algorithms outperform state-of-the-art multicore Skyline techniques for low-cardinality domains. The authors' algorithms have linear runtime complexity and fully play on modern hardware architectures.

Download Full-text

High Performance Memory Requests Scheduling Technique for Multicore Processors

2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems ◽

10.1109/hpcc.2012.26 ◽

2012 ◽

Cited By ~ 5

Author(s):

Walid El-Reedy ◽

Ali A. El-Moursy ◽

Hossam A.H. Fahmy

Keyword(s):

High Performance ◽

Multicore Processors ◽

Scheduling Technique

Download Full-text

High Performance and Portable Convolution Operators for Multicore Processors

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) ◽

10.1109/sbac-pad49847.2020.00023 ◽

2020 ◽

Author(s):

Pablo San Juan ◽

Adrian Castello ◽

Manuel F. Dolz ◽

Pedro Alonso-Jorda ◽

Enrique S. Quintana-Orti

Keyword(s):

High Performance ◽

Multicore Processors ◽

Convolution Operators

Download Full-text

Implementation of High Performance EEG Based Seizure Detection And Analysis On Multicore Platform

International Journal of Engineering Technology and Management Sciences ◽

10.46647/ijetms.2020.v04i05.002 ◽

2020 ◽

Vol 4 (5) ◽

pp. 5-9

Author(s):

NAGASHYAM P ◽

VIJAY KUMAR T

Keyword(s):

Power Consumption ◽

High Performance ◽

Detection System ◽

Multicore Processors ◽

Seizure Detection ◽

Extraction Process ◽

Process Data ◽

Detection Algorithms ◽

Eeg Data ◽

Two Stages

About 50 million people worldwide suﬀer from epilepsy, the neurological disorder characterized by seizures. The primary tool for diagnosis of an epileptic seizure is an electroencephalography (EEG) which records the brain’s spontaneous electrical activity. This requires the placement of a minimum of 16 electrodes on the scalp with each electrode being interpreted as a channel. The classification of seizure detection and analysis techniques mainly work in two stages, where features are extracted from raw EEG data in the first stage and then the obtained features are used as input for the classification process in the second stage. Traditionally the Seizure detection algorithms were implemented using DSP Processor or FPGAs. But these single core platforms are constrained with respect to speed of operation and power consumption. There is a greater need to reduce the power consumption as well to increase the speed of EEG seizure detection system. This problem can be addressed using the Multicore Processors, which process data simultaneously. This project presents a high performance multicore platform for EEG based seizure detection and analysis. This platform performs continuous multichannel detection and analysis of seizures for epilepsy patients. The detection unit will detect the seizures based on feature extraction process once the seizure detection is done enables the analysis circuit that process the data based Uridva Triyabhakyam based 128 point FFT and transmits energy and frequency contents of EEG data. All proposed blocks are simulated and synthesized using Xilinx ISE and coding is done in Verilog.

Download Full-text

Analysis on the Cooling Efficiency of High-Performance Multicore Processors according to Cooling Methods

Journal of the Korea Society of Computer and Information ◽

10.9708/jksci.2011.16.7.001 ◽

2011 ◽

Vol 16 (7) ◽

pp. 1-11

Author(s):

Seung-Gu Kang ◽

Hong-Jun Choi ◽

Jin-Woo Ahn ◽

Jae-Hyung Park ◽

Jong-Myon Kim ◽

...

Keyword(s):

High Performance ◽

Multicore Processors ◽

Cooling Efficiency ◽

Cooling Methods

Download Full-text

High performance FFT on multicore processors

Proceedings of the 5th International ICST Conference on Cognitive Radio Oriented Wireless Networks and Communications ◽

10.4108/icst.crowncom2010.9283 ◽

2010 ◽

Cited By ~ 4

Author(s):

J. Barhen ◽

C. Kotas ◽

T.S. Humble ◽

P. Mitra ◽

N. Imam ◽

...

Keyword(s):

High Performance ◽

Multicore Processors

Download Full-text

A framework for genomic sequencing on clusters of multicore and manycore processors

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016653243 ◽

2016 ◽

Vol 32 (3) ◽

pp. 393-406 ◽

Cited By ~ 1

Author(s):

Héctor Martínez ◽

Sergio Barrachina ◽

Maribel Castillo ◽

Joaquín Tárraga ◽

Ignacio Medina ◽

...

Keyword(s):

Message Passing ◽

Graphics Processing Units ◽

High Performance ◽

Message Passing Interface ◽

Multicore Processors ◽

Global Alignment ◽

Genomic Sequencing ◽

Single Node ◽

Strongly Coupled ◽

Two Stages

The advances in genomic sequencing during the past few years have motivated the development of fast and reliable software for DNA/RNA sequencing on current high performance architectures. Most of these efforts target multicore processors, only a few can also exploit graphics processing units, and a much smaller set will run in clusters equipped with any of these multi-threaded architecture technologies. Furthermore, the examples that can be used on clusters today are all strongly coupled with a particular aligner. In this paper we introduce an alignment framework that can be leveraged to coordinately run any “single-node” aligner, taking advantage of the resources of a cluster without having to modify any portion of the original software. The key to our transparent migration lies in hiding the complexity associated with the multi-node execution (such as coordinating the processes running in the cluster nodes) inside the generic-aligner framework. Moreover, following the design and operation in our Message Passing Interface (MPI) version of HPG Aligner RNA BWT, we organize the framework into two stages in order to be able to execute different aligners in each one of them. With this configuration, for example, the first stage can ideally apply a fast aligner to accelerate the process, while the second one can be tuned to act as a refinement stage that further improves the global alignment process with little cost.

Download Full-text