scholarly journals High Performance Parallelization of COMPSYN on a Cluster of Multicore Processors with GPUs

2012 ◽  
Vol 9 ◽  
pp. 966-975
Author(s):  
Ferdinando Alessi ◽  
Annalisa Massini ◽  
Roberto Basili
Electronics ◽  
2021 ◽  
Vol 10 (16) ◽  
pp. 1984
Author(s):  
Wei Zhang ◽  
Zihao Jiang ◽  
Zhiguang Chen ◽  
Nong Xiao ◽  
Yang Ou

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.


Author(s):  
Ram Prasad Mohanty ◽  
Ashok Kumar Turuk ◽  
Bibhudatta Sahoo

The growing number of cores increases the demand for a powerful memory subsystem which leads to enhancement in the size of caches in multicore processors. Caches are responsible for giving processing elements a faster, higher bandwidth local memory to work with. In this chapter, an attempt has been made to analyze the impact of cache size on performance of Multi-core processors by varying L1 and L2 cache size on the multicore processor with internal network (MPIN) referenced from NIAGRA architecture. As the number of core's increases, traditional on-chip interconnects like bus and crossbar proves to be low in efficiency as well as suffer from poor scalability. In order to overcome the scalability and efficiency issues in these conventional interconnect, ring based design has been proposed. The effect of interconnect on the performance of multicore processors has been analyzed and a novel scalable on-chip interconnection mechanism (INOC) for multicore processors has been proposed. The benchmark results are presented by using a full system simulator. Results show that, using the proposed INoC, compared with the MPIN; the execution time are significantly reduced.


2015 ◽  
Vol 26 (4) ◽  
pp. 18-43 ◽  
Author(s):  
Markus Endres ◽  
Werner Kießling

The problem of Skyline computation has attracted considerable research attention in the last decade. A Skyline query selects those tuples from a dataset that are optimal with respect to a set of designated preference attributes. Since multicore processors are going mainstream, it has become imperative to develop parallel algorithms, which fully exploit the advantages of such modern hardware architectures. In this paper, the authors present high-performance parallel Skyline algorithms based on the lattice structure generated by a Skyline query. For this, they propose different evaluation strategies and compare several data structures for the parallel evaluation of Skyline queries. The authors present novel optimization techniques for lattice based Skyline algorithms based on pruning and removing one unrestricted attribute domain. They demonstrate through comprehensive experiments on synthetic and real datasets that their new algorithms outperform state-of-the-art multicore Skyline techniques for low-cardinality domains. The authors' algorithms have linear runtime complexity and fully play on modern hardware architectures.


Author(s):  
NAGASHYAM P ◽  
VIJAY KUMAR T

About 50 million people worldwide suffer from epilepsy, the neurological disorder characterized by seizures. The primary tool for diagnosis of an epileptic seizure is an electroencephalography (EEG) which records the brain’s spontaneous electrical activity. This requires the placement of a minimum of 16 electrodes on the scalp with each electrode being interpreted as a channel. The classification of seizure detection and analysis techniques mainly work in two stages, where features are extracted from raw EEG data in the first stage and then the obtained features are used as input for the classification process in the second stage. Traditionally the Seizure detection algorithms were implemented using DSP Processor or FPGAs. But these single core platforms are constrained with respect to speed of operation and power consumption. There is a greater need to reduce the power consumption as well to increase the speed of EEG seizure detection system. This problem can be addressed using the Multicore Processors, which process data simultaneously. This project presents a high performance multicore platform for EEG based seizure detection and analysis. This platform performs continuous multichannel detection and analysis of seizures for epilepsy patients. The detection unit will detect the seizures based on feature extraction process once the seizure detection is done enables the analysis circuit that process the data based Uridva Triyabhakyam based 128 point FFT and transmits energy and frequency contents of EEG data. All proposed blocks are simulated and synthesized using Xilinx ISE and coding is done in Verilog.


2011 ◽  
Vol 16 (7) ◽  
pp. 1-11
Author(s):  
Seung-Gu Kang ◽  
Hong-Jun Choi ◽  
Jin-Woo Ahn ◽  
Jae-Hyung Park ◽  
Jong-Myon Kim ◽  
...  

Author(s):  
Héctor Martínez ◽  
Sergio Barrachina ◽  
Maribel Castillo ◽  
Joaquín Tárraga ◽  
Ignacio Medina ◽  
...  

The advances in genomic sequencing during the past few years have motivated the development of fast and reliable software for DNA/RNA sequencing on current high performance architectures. Most of these efforts target multicore processors, only a few can also exploit graphics processing units, and a much smaller set will run in clusters equipped with any of these multi-threaded architecture technologies. Furthermore, the examples that can be used on clusters today are all strongly coupled with a particular aligner. In this paper we introduce an alignment framework that can be leveraged to coordinately run any “single-node” aligner, taking advantage of the resources of a cluster without having to modify any portion of the original software. The key to our transparent migration lies in hiding the complexity associated with the multi-node execution (such as coordinating the processes running in the cluster nodes) inside the generic-aligner framework. Moreover, following the design and operation in our Message Passing Interface (MPI) version of HPG Aligner RNA BWT, we organize the framework into two stages in order to be able to execute different aligners in each one of them. With this configuration, for example, the first stage can ideally apply a fast aligner to accelerate the process, while the second one can be tuned to act as a refinement stage that further improves the global alignment process with little cost.


Sign in / Sign up

Export Citation Format

Share Document