scholarly journals Optimization of throughput, fairness and energy efficiency on asymmetric multicore systems via OS scheduling.

2018 ◽  
Vol 18 (01) ◽  
pp. e09
Author(s):  
Adrian Pousa

Most of chip multiprocessors (CMPs) are symmetric, i.e. they are composed of identical cores. These CMPs may consist of complex cores (e.g., Intel Haswell or IBM Power8) or simple and lower-power cores (e.g. ARM Cortex A9 or Intel Xeon Phi). Cores in the former approach have advanced microarchitectural features, such as out-of-order super-scalar pipelines, and they are suitable for running sequential applications which use them efficiently. Cores in the latter approach have a simple microarchitecture and are good for running applications with high thread-level parallelism (TLP).

Author(s):  
А.А. Федоров ◽  
А.Н. Быков

Приводится описание метода двухуровневого распараллеливания прогонки (на общей памяти средствами OpenMP и на распределенной памяти средствами MPI) для решения трехдиагональных линейных систем, возникающих при моделировании двумерных и трехмерных физических процессов. Анализируются особенности реализации метода как на ЭВМ с универсальными процессорами, так и на гибридных ЭВМ с многоядерными сопроцессорами Intel Xeon Phi. Оценивается арифметическая сложность реализованного метода. Обсуждаются результаты численных экспериментов по исследованию масштабируемости метода. A method of two-level parallelization of the Thomas algorithm for solving tridiagonal linear systems (the thread-level parallelism using OpenMP and the process-level parallelism using MPI) arising when modeling two-dimensional and three-dimensional physical processes is described. The features of its implementation for parallel multiprocessor systems and for hybrid multiprocessor systems with multicore coprocessors Intel Xeon Phi are analyzed. The arithmetic complexity of this method is estimated. Some numerical results obtained when studying its scalability are discussed.


Author(s):  
Ramon Amela ◽  
Cristian Ramon-Cortes ◽  
Jorge Ejarque ◽  
Javier Conejero ◽  
Rosa M. Badia

Python is a popular programming language due to the simplicity of its syntax, while still achieving a good performance even being an interpreted language. The adoption from multiple scientific communities has evolved in the emergence of a large number of libraries and modules, which has helped to put Python on the top of the list of the programming languages [1]. Task-based programming has been proposed in the recent years as an alternative parallel programming model. PyCOMPSs follows such approach for Python, and this paper presents its extensions to combine task-based parallelism and thread-level parallelism. Also, we present how PyCOMPSs has been adapted to support heterogeneous architectures, including Xeon Phi and GPUs. Results obtained with linear algebra benchmarks demonstrate that significant performance can be obtained with a few lines of Python.


2021 ◽  
Author(s):  
Anita Tino

As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, com- puting systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core. These processors effectively improve the throughput and utilization of a core, however often at the expense of single-thread performance as threads per core scale. Accordingly, applications which require higher single-thread performance must often resort to single-thread core multi-processor systems which incur additional area overhead and power dissipation. In attempts to improve single- and multi-thread core efficiency, this work introduces the concept of a Configurable Simultaneously Single-Threaded (Multi-)Engine Processor (ConSSTEP). ConSSTEP is a nuanced approach to multi- threaded processors, achieving performance gains and energy efficiency by invoking low overhead reconfigurable properties with full software compatibility. Experimen- tal results demonstrate that ConSSTEP is able to increase single-thread Instruc- tions Per Cycle (IPC) up to 1.39x and 2.4x for 2-thread and 4-thread workloads, respectively, improving throughput and providing up to 2x energy efficiency when compared to a conventional SMT processor.


2021 ◽  
Author(s):  
Anita Tino

As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, com- puting systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core. These processors effectively improve the throughput and utilization of a core, however often at the expense of single-thread performance as threads per core scale. Accordingly, applications which require higher single-thread performance must often resort to single-thread core multi-processor systems which incur additional area overhead and power dissipation. In attempts to improve single- and multi-thread core efficiency, this work introduces the concept of a Configurable Simultaneously Single-Threaded (Multi-)Engine Processor (ConSSTEP). ConSSTEP is a nuanced approach to multi- threaded processors, achieving performance gains and energy efficiency by invoking low overhead reconfigurable properties with full software compatibility. Experimen- tal results demonstrate that ConSSTEP is able to increase single-thread Instruc- tions Per Cycle (IPC) up to 1.39x and 2.4x for 2-thread and 4-thread workloads, respectively, improving throughput and providing up to 2x energy efficiency when compared to a conventional SMT processor.


2018 ◽  
Vol 175 ◽  
pp. 02009
Author(s):  
Carleton DeTar ◽  
Steven Gottlieb ◽  
Ruizi Li ◽  
Doug Toussaint

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.


Sign in / Sign up

Export Citation Format

Share Document