Optimization of throughput, fairness and energy efficiency on asymmetric multicore systems via OS scheduling.

Приводится описание метода двухуровневого распараллеливания прогонки (на общей памяти средствами OpenMP и на распределенной памяти средствами MPI) для решения трехдиагональных линейных систем, возникающих при моделировании двумерных и трехмерных физических процессов. Анализируются особенности реализации метода как на ЭВМ с универсальными процессорами, так и на гибридных ЭВМ с многоядерными сопроцессорами Intel Xeon Phi. Оценивается арифметическая сложность реализованного метода. Обсуждаются результаты численных экспериментов по исследованию масштабируемости метода. A method of two-level parallelization of the Thomas algorithm for solving tridiagonal linear systems (the thread-level parallelism using OpenMP and the process-level parallelism using MPI) arising when modeling two-dimensional and three-dimensional physical processes is described. The features of its implementation for parallel multiprocessor systems and for hybrid multiprocessor systems with multicore coprocessors Intel Xeon Phi are analyzed. The arithmetic complexity of this method is estimated. Some numerical results obtained when studying its scalability are discussed.

Get full-text (via PubEx)

Energy Efficiency Evaluation of Workload Execution on Intel Xeon Phi Coprocessor

Trustworthy Computing and Services - Communications in Computer and Information Science ◽

10.1007/978-3-662-43908-1_34 ◽

2014 ◽

pp. 268-275

Author(s):

Qi Zhao ◽

Hailong Yang ◽

Guang Wei ◽

Zhongzhi Luan ◽

Depei Qian

Keyword(s):

Energy Efficiency ◽

Efficiency Evaluation ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Energy Efficiency Evaluation ◽

Intel Xeon

Get full-text (via PubEx)

Executing linear algebra kernels in heterogeneous distributed infrastructures with PyCOMPSs

Oil & Gas Science and Technology – Revue d’IFP Energies nouvelles ◽

10.2516/ogst/2018047 ◽

2018 ◽

Vol 73 ◽

pp. 47 ◽

Cited By ~ 3

Author(s):

Ramon Amela ◽

Cristian Ramon-Cortes ◽

Jorge Ejarque ◽

Javier Conejero ◽

Rosa M. Badia

Keyword(s):

Programming Languages ◽

Linear Algebra ◽

Programming Model ◽

Xeon Phi ◽

Scientific Communities ◽

Heterogeneous Architectures ◽

Parallel Programming Model ◽

Significant Performance ◽

Thread Level Parallelism ◽

Level Parallelism

Python is a popular programming language due to the simplicity of its syntax, while still achieving a good performance even being an interpreted language. The adoption from multiple scientific communities has evolved in the emergence of a large number of libraries and modules, which has helped to put Python on the top of the list of the programming languages [1]. Task-based programming has been proposed in the recent years as an alternative parallel programming model. PyCOMPSs follows such approach for Python, and this paper presents its extensions to combine task-based parallelism and thread-level parallelism. Also, we present how PyCOMPSs has been adapted to support heterogeneous architectures, including Xeon Phi and GPUs. Results obtained with linear algebra benchmarks demonstrate that significant performance can be obtained with a few lines of Python.

Get full-text (via PubEx)

Configurable simultaneously single-threaded (multi-)engine processor

10.32920/ryerson.14644953.v1 ◽

2021 ◽

Author(s):

Anita Tino

Keyword(s):

Energy Efficiency ◽

Power Dissipation ◽

Single Thread ◽

Parallel Performance ◽

Area Overhead ◽

Expected Performance ◽

Thread Level Parallelism ◽

Performance Gains ◽

Level Parallelism ◽

Additional Area

As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, com- puting systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core. These processors effectively improve the throughput and utilization of a core, however often at the expense of single-thread performance as threads per core scale. Accordingly, applications which require higher single-thread performance must often resort to single-thread core multi-processor systems which incur additional area overhead and power dissipation. In attempts to improve single- and multi-thread core efficiency, this work introduces the concept of a Configurable Simultaneously Single-Threaded (Multi-)Engine Processor (ConSSTEP). ConSSTEP is a nuanced approach to multi- threaded processors, achieving performance gains and energy efficiency by invoking low overhead reconfigurable properties with full software compatibility. Experimen- tal results demonstrate that ConSSTEP is able to increase single-thread Instruc- tions Per Cycle (IPC) up to 1.39x and 2.4x for 2-thread and 4-thread workloads, respectively, improving throughput and providing up to 2x energy efficiency when compared to a conventional SMT processor.

Get full-text (via PubEx)

Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the Intel Xeon Phi

Computers & Electrical Engineering ◽

10.1016/j.compeleceng.2015.06.009 ◽

2015 ◽

Vol 46 ◽

pp. 95-111 ◽

Cited By ~ 6

Author(s):

Manuel F. Dolz ◽

Francisco D. Igual ◽

Thomas Ludwig ◽

Luis Piñuel ◽

Enrique S. Quintana-Ortí

Keyword(s):

Energy Consumption ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Improve Performance ◽

Matrix Computations ◽

Level Parallelism ◽

Data Level ◽

Intel Xeon

Get full-text (via PubEx)

Configurable simultaneously single-threaded (multi-)engine processor

10.32920/ryerson.14644953 ◽

2021 ◽

Author(s):

Anita Tino

Keyword(s):

Energy Efficiency ◽

Power Dissipation ◽

Single Thread ◽

Parallel Performance ◽

Area Overhead ◽

Expected Performance ◽

Thread Level Parallelism ◽

Performance Gains ◽

Level Parallelism ◽

Additional Area

As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, com- puting systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core. These processors effectively improve the throughput and utilization of a core, however often at the expense of single-thread performance as threads per core scale. Accordingly, applications which require higher single-thread performance must often resort to single-thread core multi-processor systems which incur additional area overhead and power dissipation. In attempts to improve single- and multi-thread core efficiency, this work introduces the concept of a Configurable Simultaneously Single-Threaded (Multi-)Engine Processor (ConSSTEP). ConSSTEP is a nuanced approach to multi- threaded processors, achieving performance gains and energy efficiency by invoking low overhead reconfigurable properties with full software compatibility. Experimen- tal results demonstrate that ConSSTEP is able to increase single-thread Instruc- tions Per Cycle (IPC) up to 1.39x and 2.4x for 2-thread and 4-thread workloads, respectively, improving throughput and providing up to 2x energy efficiency when compared to a conventional SMT processor.

Get full-text (via PubEx)

Power-Performance Implications of Thread-level Parallelism on Chip Multiprocessors

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. ◽

10.1109/ispass.2005.1430567 ◽

2005 ◽

Cited By ~ 15

Author(s):

Jian Li ◽

J.F. Martinez

Keyword(s):

Chip Multiprocessors ◽

Power Performance ◽

Thread Level Parallelism ◽

On Chip ◽

Level Parallelism

Get full-text (via PubEx)

MILC Code Performance on High End CPU and GPU Supercomputer Clusters

EPJ Web of Conferences ◽

10.1051/epjconf/201817502009 ◽

2018 ◽

Vol 175 ◽

pp. 02009

Author(s):

Carleton DeTar ◽

Steven Gottlieb ◽

Ruizi Li ◽

Doug Toussaint

Keyword(s):

Conjugate Gradient ◽

Memory Hierarchy ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Code Performance ◽

Recent Developments ◽

Knights Landing ◽

Many Core ◽

Intel Xeon

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

Get full-text (via PubEx)

Sequential Monte Carlo based parameter estimation for structural health monitoring with an Intel Xeon Phi optimized ultrasound kernel

10.1063/1.5099739 ◽

2019 ◽

Author(s):

William C. Schneck ◽

Heather Reed ◽

Elizabeth D. Gregory ◽

Cara A. C. Leckey

Keyword(s):

Monte Carlo ◽

Parameter Estimation ◽

Structural Health Monitoring ◽

Health Monitoring ◽

Sequential Monte Carlo ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Structural Health ◽

Intel Xeon

Get full-text (via PubEx)

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture - ISCA '09 ◽

10.1145/1555754.1555775 ◽

2009 ◽

Cited By ~ 256

Author(s):

Sunpyo Hong ◽

Hyesoon Kim

Keyword(s):

Analytical Model ◽

Thread Level Parallelism ◽

Level Parallelism ◽

Gpu Architecture ◽

With Memory

Get full-text (via PubEx)