Energy Profiling and Analysis of the HPC Challenge Benchmarks

Future high performance systems must use energy efficiently to achieve petaFLOPS computational speeds and beyond. To address this challenge, we must first understand the power and energy characteristics of high performance computing applications. In this paper, we use a power-performance profiling framework called Power-Pack to study the power and energy profiles of the HPC Challenge benchmarks. We present detailed experimental results along with in-depth analysis of how each benchmark's workload characteristics affect power consumption and energy efficiency. This paper summarizes various findings using the HPC Challenge benchmarks, including but not limited to: 1) identifying application power profiles by function and component in a high performance cluster; 2) correlating applications' memory access patterns to power consumption for these benchmarks; and 3) exploring how energy consumption scales with system size and workload.

Download Full-text

Extending PowerPack for Profiling and Analysis of High-Performance Accelerator-Based Systems

Parallel Processing Letters ◽

10.1142/s0129626414420018 ◽

2014 ◽

Vol 24 (04) ◽

pp. 1442001

Author(s):

Bo Li ◽

Hung-Ching Chang ◽

Shuaiwen Song ◽

Chun-Yi Su ◽

Timmy Meyer ◽

...

Keyword(s):

High Performance ◽

Energy Use ◽

Individual Component ◽

Experimental Studies ◽

Xeon Phi ◽

Hardware Support ◽

Power Performance ◽

Host Processor ◽

High Performance Systems ◽

Power And Energy

Accelerators offer a substantial increase in efficiency for high-performance systems offering speedups for computational applications that leverage hardware support for highly-parallel codes. However, the power use of some accelerators exceeds 200 watts at idle which means use at exascale comes at a significant increase in power at a time when we face a power ceiling of about 20 megawatts. Despite the growing domination of accelerator-based systems in the Top500 and Green500 lists of fastest and most efficient supercomputers, there are few detailed studies comparing the power and energy use of common accelerators. In this work, we conduct detailed experimental studies of the power usage and distribution of Xeon-Phi-based systems in comparison to the NVIDIA Tesla and an Intel Sandy Bridge multicore host processor. In contrast to previous work, we focus on separating individual component power and correlating power use to code behavior. Our results help explain the causes of power-performance scalability for a set of HPC applications.

Download Full-text

Introduction to Digital I\/O: Constraining I\/O Power Consumption in High-Performance Systems

IEEE Solid-State Circuits Magazine ◽

10.1109/mssc.2015.2476016 ◽

2015 ◽

Vol 7 (4) ◽

pp. 14-22 ◽

Cited By ~ 10

Author(s):

Tony Chan Carusone

Keyword(s):

Power Consumption ◽

High Performance ◽

High Performance Systems

Download Full-text

Power, Performance, and Thermal Management for High-Performance Systems

2007 IEEE International Parallel and Distributed Processing Symposium ◽

10.1109/ipdps.2007.370541 ◽

2007 ◽

Cited By ~ 5

Author(s):

Heather Hanson ◽

Stephen W. Keckler ◽

Karthick Rajamani ◽

Soraya Ghiasi ◽

Freeman Rawson ◽

...

Keyword(s):

Thermal Management ◽

High Performance ◽

Power Performance ◽

High Performance Systems

Download Full-text

High-Performance Low-Power 5:2 Compressor With 30 CNTFETs Using 32 nm Technology

International Journal of Sensors Wireless Communications and Control ◽

10.2174/2210327909666190206144601 ◽

2019 ◽

Vol 9 (4) ◽

pp. 462-467

Author(s):

Jitendra Kumar Saini ◽

Avireni Srinivasulu ◽

Renu Kumawat

Keyword(s):

Power Consumption ◽

High Speed ◽

High Performance ◽

Cmos Technology ◽

Vital Role ◽

Arithmetic Circuit ◽

Voltage Supply ◽

Big Data Applications ◽

High Performance Systems ◽

Power Delay Product

Background: The advent of High Performance Computing (HPC) applications and big data applications has made it imparitive to develop hardware that can match the computing demands. In such high performance systems, the high speed multipliers are the most sought after components. A compressor is an important part of the multiplier; it plays a vital role in the performance of multiplier, also it contributes to the efficiency enhancement of an arithmetic circuit. The 5:2 compressor circuit design proposed here improves overall performance and efficiency of the arithmetic circuits in terms of power consumption, delay and power delay product. The proposed 5:2 compressor circuit was implemented using both CMOS and Carbon Nano Tube Field Effect Transistor (CNTFET) technologies and it was observed that the proposed circuit has yielded better results with CNTFETs as compared to MOSFETs. Methods/Results: The proposed 5:2 compressor circuit was designed with CMOS technology simulated at 45 nm with voltage supply 1.0 V and compared it with the existing 5:2 compressor designes to validate the improvements. Thereafter, the proposed design was implemented with CNTFET technology at 32 nm and simulated with voltage supply 0.6 V. The comparision results of proposed 5:2 compressor with existing designs implemented using CMOS. The results also compare the proposed design on CMOS and CNTFET technologies for parameters like power, delay, power delay product. Conclusion: It can be concluded that the proposed 5:2 compressor gives better results as compared to the existing 5:2 compressor designs implemeted using CMOS. The improvement in power, delay and power delay product is approx 30%, 15% and 40% respectively. The proposed circuit of 5:2 compressor is also implemented using CNTFET technology and compared, which further enhances the results by 30% (power consumption and PDP). Hence, the proposed circuit implemented using CNTFET gives substantial improvements over the existing circuits.

Download Full-text

Reducing Idle Power Consumption in High Performance Systems

2017 International Conference on Computational Science and Computational Intelligence (CSCI) ◽

10.1109/csci.2017.283 ◽

2017 ◽

Author(s):

Vaibhav Sundriyal ◽

Masha Sosonkina

Keyword(s):

Power Consumption ◽

High Performance ◽

Idle Power ◽

High Performance Systems

Download Full-text

Efficient Instruction and Data Caching for High Performance Embedded Processors

Jornada de Jóvenes Investigadores del I3A ◽

10.26754/jji-i3a.201201788 ◽

1970 ◽

pp. 9

Author(s):

A. Ferrerón Labari ◽

D. Suárez Gracia ◽

V. Viñals Yúfera

Keyword(s):

Embedded Systems ◽

Power Consumption ◽

Low Power ◽

Interconnection Networks ◽

High Performance ◽

Critical Issue ◽

Content Management ◽

Structure Design ◽

Portable Devices ◽

On Chip

In the last years, embedded systems have evolved so that they offer capabilities we could only find before in high performance systems. Portable devices already have multiprocessors on-chip (such as PowerPC 476FP or ARM Cortex A9 MP), usually multi-threaded, and a powerful multi-level cache memory hierarchy on-chip. As most of these systems are battery-powered, the power consumption becomes a critical issue. Achieving high performance and low power consumption is a high complexity challenge where some proposals have been already made. Suarez et al. proposed a new cache hierarchy on-chip, the LP-NUCA (Low Power NUCA), which is able to reduce the access latency taking advantage of NUCA (Non-Uniform Cache Architectures) properties. The key points are decoupling the functionality, and utilizing three specialized networks on-chip. This structure has been proved to be efficient for data hierarchies, achieving a good performance and reducing the energy consumption. On the other hand, instruction caches have different requirements and characteristics than data caches, contradicting the low-power embedded systems requirements, especially in SMT (simultaneous multi-threading) environments. We want to study the benefits of utilizing small tiled caches for the instruction hierarchy, so we propose a new design, ID-LP-NUCAs. Thus, we need to re-evaluate completely our previous design in terms of structure design, interconnection networks (including topologies, flow control and routing), content management (with special interest in hardware/software content allocation policies), and structure sharing. In CMP environments (chip multiprocessors) with parallel workloads, coherence plays an important role, and must be taken into consideration.

Download Full-text

Low Power Wide Fan-in Domino OR Gate Using CN-MOSFETs

International Journal of Sensors Wireless Communications and Control ◽

10.2174/2210327909666190207163639 ◽

2020 ◽

Vol 10 (1) ◽

pp. 55-62

Author(s):

Deepika Bansal ◽

Bal Chand Nagar ◽

Brahamdeo Prasad Singh ◽

Ajay Kumar

Keyword(s):

Power Consumption ◽

High Performance ◽

Dynamic Logic ◽

Clock Frequency ◽

Charge Sharing ◽

Benchmark Circuit ◽

Domino Circuit ◽

Power Delay Product ◽

Domino Circuits ◽

Or Gate

Background & Objective: In this paper, a modified pseudo domino configuration has been proposed to improve the leakage power consumption and Power Delay Product (PDP) of dynamic logic using Carbon Nanotube MOSFETs (CN-MOSFETs). The simulations for proposed and published domino circuits are verified by using Synopsys HSPICE simulator with 32nm CN-MOSFET technology which is provided by Stanford. Methods: The simulation results of the proposed technique are validated for improvement of wide fan-in domino OR gate as a benchmark circuit at 500 MHz clock frequency. Results: The proposed configuration is suitable for cascading of the high performance wide fan-in circuits without any charge sharing. Conclusion: The performance analysis of 8-input OR gate demonstrate that the proposed circuit provides lower static and dynamic power consumption up to 62 and 40% respectively, and PDP improvement is 60% as compared to standard domino circuit.

Download Full-text

МЕТОДЫ ДОСТИЖЕНИЯ МАКСИМАЛЬНОЙ ЭФФЕКТИВНОСТИ ПЛАТФОРМЫ ПРОТОТИПИРОВАНИЯ ВЫСОКОПРОИЗВОДИТЕЛЬНЫХ СИСТЕМ НА КРИСТАЛЛЕ НА ЗАДАЧАХ ИСКУССТВЕННОГО ИНТЕЛЛЕКТА

Nanoindustry Russia ◽

10.22184/1993-8578.2020.13.3s.585.588 ◽

2020 ◽

Vol 96 (3s) ◽

pp. 585-588

Author(s):

С.Е. Фролова ◽

Е.С. Янакова

Keyword(s):

Neural Network ◽

Artificial Intelligence ◽

Computer Vision ◽

High Performance ◽

Systems On Chip ◽

High Performance Systems ◽

On Chip ◽

Network Technologies ◽

Neural Network Technologies

Предлагаются методы построения платформ прототипирования высокопроизводительных систем на кристалле для задач искусственного интеллекта. Изложены требования к платформам подобного класса и принципы изменения проекта СнК для имплементации в прототип. Рассматриваются методы отладки проектов на платформе прототипирования. Приведены результаты работ алгоритмов компьютерного зрения с использованием нейросетевых технологий на FPGA-прототипе семантических ядер ELcore. Methods have been proposed for building prototyping platforms for high-performance systems-on-chip for artificial intelligence tasks. The requirements for platforms of this class and the principles for changing the design of the SoC for implementation in the prototype have been described as well as methods of debugging projects on the prototyping platform. The results of the work of computer vision algorithms using neural network technologies on the FPGA prototype of the ELcore semantic cores have been presented.

Download Full-text

Ultracompact and low-power-consumption silicon thermo-optic switch for high-speed data

Nanophotonics ◽

10.1515/nanoph-2020-0496 ◽

2020 ◽

Vol 10 (2) ◽

pp. 937-945

Author(s):

Ruihuan Zhang ◽

Yu He ◽

Yong Zhang ◽

Shaohua An ◽

Qingming Zhu ◽

...

Keyword(s):

Power Consumption ◽

Low Power ◽

High Speed ◽

High Performance ◽

Pulse Amplitude ◽

Telecommunication Networks ◽

Low Power Consumption ◽

Power Efficient ◽

High Speed Data ◽

On Chip

AbstractUltracompact and low-power-consumption optical switches are desired for high-performance telecommunication networks and data centers. Here, we demonstrate an on-chip power-efficient 2 × 2 thermo-optic switch unit by using a suspended photonic crystal nanobeam structure. A submilliwatt switching power of 0.15 mW is obtained with a tuning efficiency of 7.71 nm/mW in a compact footprint of 60 μm × 16 μm. The bandwidth of the switch is properly designed for a four-level pulse amplitude modulation signal with a 124 Gb/s raw data rate. To the best of our knowledge, the proposed switch is the most power-efficient resonator-based thermo-optic switch unit with the highest tuning efficiency and data ever reported.

Download Full-text

High-performance chemical- and light-inducible recombinases in mammalian cells and mice

Nature Communications ◽

10.1038/s41467-019-12800-7 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 9

Author(s):

Benjamin H. Weinberg ◽

Jang Hwan Cho ◽

Yash Agarwal ◽

N. T. Hang Pham ◽

Leidy D. Caraballo ◽

...

Keyword(s):

Gene Expression ◽

Mammalian Cells ◽

High Performance ◽

Genome Engineering ◽

Genetic Circuits ◽

Control Of Gene Expression ◽

Expression Control ◽

Gene Expression Control ◽

High Performance Systems ◽

Spatiotemporal Control

Abstract Site-specific DNA recombinases are important genome engineering tools. Chemical- and light-inducible recombinases, in particular, enable spatiotemporal control of gene expression. However, inducible recombinases are scarce due to the challenge of engineering high performance systems, thus constraining the sophistication of genetic circuits and animal models that can be created. Here we present a library of >20 orthogonal inducible split recombinases that can be activated by small molecules, light and temperature in mammalian cells and mice. Furthermore, we engineer inducible split Cre systems with better performance than existing systems. Using our orthogonal inducible recombinases, we create a genetic switchboard that can independently regulate the expression of 3 different cytokines in the same cell, a tripartite inducible Flp, and a 4-input AND gate. We quantitatively characterize the inducible recombinases for benchmarking their performances, including computation of distinguishability of outputs. This library expands capabilities for multiplexed mammalian gene expression control.

Download Full-text