Implementation trade-offs in using a restricted data flow architecture in a high performance RISC microprocessor

We present a numerical optimization method to find highly efficient (sparse) approximations for convolutional image filters. Using a modified parallel tempering approach, we solve a constrained optimization that maximizes approximation quality while strictly staying within a user-prescribed performance budget. The results are multi-pass filters where each pass computes a weighted sum of bilinearly interpolated sparse image samples, exploiting hardware acceleration on the GPU. We systematically decompose the target filter into a series of sparse convolutions, trying to find good trade-offs between approximation quality and performance. Since our sparse filters are linear and translation-invariant, they do not exhibit the aliasing and temporal coherence issues that often appear in filters working on image pyramids. We show several applications, ranging from simple Gaussian or box blurs to the emulation of sophisticated Bokeh effects with user-provided masks. Our filters achieve high performance as well as high quality, often providing significant speed-up at acceptable quality even for separable filters. The optimized filters can be baked into shaders and used as a drop-in replacement for filtering tasks in image processing or rendering pipelines.

Download Full-text

You Only Traverse Twice: A YOTT Placement, Routing, and Timing Approach for CGRAs

ACM Transactions on Embedded Computing Systems ◽

10.1145/3477038 ◽

2021 ◽

Vol 20 (5s) ◽

pp. 1-25

Author(s):

Michael Canesche ◽

Westerley Carvalho ◽

Lucas Reis ◽

Matheus Oliveira ◽

Salles Magalhães ◽

...

Keyword(s):

Execution Time ◽

High Performance ◽

Coarse Grained ◽

Optimal Placement ◽

Greedy Heuristics ◽

High Quality ◽

Solution Quality ◽

Graph Traversal ◽

Trade Offs ◽

Graph Properties

Coarse-grained reconfigurable architecture (CGRA) mapping involves three main steps: placement, routing, and timing. The mapping is an NP-complete problem, and a common strategy is to decouple this process into its independent steps. This work focuses on the placement step, and its aim is to propose a technique that is both reasonably fast and leads to high-performance solutions. Furthermore, a near-optimal placement simplifies the following routing and timing steps. Exact solutions cannot find placements in a reasonable execution time as input designs increase in size. Heuristic solutions include meta-heuristics, such as Simulated Annealing (SA) and fast and straightforward greedy heuristics based on graph traversal. However, as these approaches are probabilistic and have a large design space, it is not easy to provide both run-time efficiency and good solution quality. We propose a graph traversal heuristic that provides the best of both: high-quality placements similar to SA and the execution time of graph traversal approaches. Our placement introduces novel ideas based on “you only traverse twice” (YOTT) approach that performs a two-step graph traversal. The first traversal generates annotated data to guide the second step, which greedily performs the placement, node per node, aided by the annotated data and target architecture constraints. We introduce three new concepts to implement this technique: I/O and reconvergence annotation, degree matching, and look-ahead placement. Our analysis of this approach explores the placement execution time/quality trade-offs. We point out insights on how to analyze graph properties during dataflow mapping. Our results show that YOTT is 60.6 , 9.7 , and 2.3 faster than a high-quality SA, bounding box SA VPR, and multi-single traversal placements, respectively. Furthermore, YOTT reduces the average wire length and the maximal FIFO size (additional timing requirement on CGRAs) to avoid delay mismatches in fully pipelined architectures.

Download Full-text

Persistent memory hash indexes

Proceedings of the VLDB Endowment ◽

10.14778/3446095.3446101 ◽

2021 ◽

Vol 14 (5) ◽

pp. 785-798

Author(s):

Daokun Hu ◽

Zhiwen Chen ◽

Jianbing Wu ◽

Jianhua Sun ◽

Hao Chen

Keyword(s):

Future Development ◽

High Performance ◽

Performance Metrics ◽

Comprehensive Evaluation ◽

State Of The Art ◽

Hash Tables ◽

Trade Offs ◽

Depth Analysis ◽

Persistent Memory ◽

Memory Modules

Persistent memory (PM) is increasingly being leveraged to build hash-based indexing structures featuring cheap persistence, high performance, and instant recovery, especially with the recent release of Intel Optane DC Persistent Memory Modules. However, most of them are evaluated on DRAM-based emulators with unreal assumptions, or focus on the evaluation of specific metrics with important properties sidestepped. Thus, it is essential to understand how well the proposed hash indexes perform on real PM and how they differentiate from each other if a wider range of performance metrics are considered. To this end, this paper provides a comprehensive evaluation of persistent hash tables. In particular, we focus on the evaluation of six state-of-the-art hash tables including Level hashing, CCEH, Dash, PCLHT, Clevel, and SOFT, with real PM hardware. Our evaluation was conducted using a unified benchmarking framework and representative workloads. Besides characterizing common performance properties, we also explore how hardware configurations (such as PM bandwidth, CPU instructions, and NUMA) affect the performance of PM-based hash tables. With our in-depth analysis, we identify design trade-offs and good paradigms in prior arts, and suggest desirable optimizations and directions for the future development of PM-based hash tables.

Download Full-text

Trade-Off Analysis for the Design of a High Performance Hydrostatic Actuation System

10.1115/imece2000-1981 ◽

2000 ◽

Author(s):

S. R. Habibi

Keyword(s):

Mathematical Modeling ◽

Quadratic Programming ◽

High Performance ◽

Design Parameters ◽

Mathematical Functions ◽

Trade Off ◽

Actuation System ◽

Trade Offs ◽

Expected Performance ◽

Dominant Design

Abstract This paper considers the design of a high performance hydrostatic actuation system referred to as the ElectroHydraulic Actuator (EHA). The expected performance of EHA and its dominant design parameters are identified by using mathematical modeling. The design parameters are classified into Direct and Indirect categories based on the measure of their accessibility to the designer. The Direct parameters are directly quantifiable and, can be linked to the performance of EHA through a set of mathematical functions. A prototype of EHA has been produced and described. The mathematical functions linking performance to design parameters are used to investigate design trade-offs. Design improvements to the prototype are suggested by using constrained quadratic programming.

Download Full-text

Crafting Efficient Neural Graph of Large Entropy

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/311 ◽

2019 ◽

Author(s):

Minjing Dong ◽

Hanting Chen ◽

Yunhe Wang ◽

Chang Xu

Keyword(s):

Information Flow ◽

Network Architecture ◽

High Performance ◽

Global Information ◽

High Quality ◽

Network Pruning ◽

Initial Network ◽

Trade Offs ◽

Graph Properties ◽

Deep Cnn

Network pruning is widely applied to deep CNN models due to their heavy computation costs and achieves high performance by keeping important weights while removing the redundancy. Pruning redundant weights directly may hurt global information flow, which suggests that an efficient sparse network should take graph properties into account. Thus, instead of paying more attention to preserving important weight, we focus on the pruned architecture itself. We propose to use graph entropy as the measurement, which shows useful properties to craft high-quality neural graphs and enables us to propose efficient algorithm to construct them as the initial network architecture. Our algorithm can be easily implemented and deployed to different popular CNN models and achieve better trade-offs.

Download Full-text

CMOS DEVICES ARCHITECTURES AND TECHNOLOGY INNOVATIONS FOR THE NANOELECTRONICS ERA

International Journal of High Speed Electronics and Systems ◽

10.1142/s0129156406003618 ◽

2006 ◽

Vol 16 (01) ◽

pp. 193-219 ◽

Cited By ~ 1

Author(s):

S. DELEONIBUS ◽

B. de SALVO ◽

T. ERNST ◽

O. FAYNOT ◽

T. POIROUX ◽

...

Keyword(s):

Low Power ◽

High Performance ◽

Gate Dielectric ◽

Low Voltage ◽

Short Channel Effects ◽

Short Channel ◽

Trade Offs ◽

Future System ◽

The Future ◽

Channel Effects

Innovations in electronics history have been possible because of the strong association of devices and materials research. The demand for low voltage, low power and high performance are the great challenges for engineering of sub 50nm gate length CMOS devices. Functional CMOS devices in the range of 5 nm channel length have been demonstrated. The alternative architectures allowing to increase devices drivability and reduce power are reviewed through the issues to address in gate/channel and substrate, gate dielectric as well as source and drain engineering. HiK gate dielectric and metal gate are among the most strategic options to consider for power consumption and low supply voltage management. It will be very difficult to compete with CMOS logic because of the low series resistance required to obtain high performance. By introducing new materials ( Ge , diamond/graphite Carbon, HiK, …), Si based CMOS will be scaled beyond the ITRS as the future System-on-Chip Platform integrating new disruptive devices. The association of C-diamond with HiK as a combination for new functionalized Buried Insulators, for example, will bring new ways of improving short channel effects and suppress self-heating. That will allow new optimization of Ion-Ioff trade offs. The control of low power dissipation and short channel effects together with high performance will be the major challenges in the future.

Download Full-text

Characterization of Exotic Material Heat Sinks for Laser Diode Arrays

MRS Proceedings ◽

10.1557/proc-883-ff3.2 ◽

2005 ◽

Vol 883 ◽

Author(s):

Edward.F. Stephens

Keyword(s):

Laser Diode ◽

High Performance ◽

Critical Role ◽

Cost Effective ◽

Heat Sinks ◽

High Peak Power ◽

Low Duty Cycle ◽

Trade Offs ◽

Duty Cycles ◽

Laser Diode Arrays

AbstractLow duty cycle, high peak power, conductively cooled laser diode arrays have been manufactured for several years by a number of different vendors. Typically these packages have been limited to a few percent duty cycles due to thermal problems that develop in tight bar pitch arrays at higher duty cycles. Traditionally these packages are made from some combination of copper and BeO or Tungsten/copper and BeO. Trade-offs between thermal conductivity and CTE matching are always made when manufacturing these devices. In addition, the manufacturability of the heat sinks plays a critical role in creating a cost effective, high performance solution. In this discussion we examine several different exotic materials that have been manufactured and tested as heat sinks for laser diode arrays.

Download Full-text

Eco-Efficient Resource Management in HPC Clusters through Computer Intelligence Techniques

Energies ◽

10.3390/en12112129 ◽

2019 ◽

Vol 12 (11) ◽

pp. 2129 ◽

Cited By ~ 1

Author(s):

Alberto Cocaña-Fernández ◽

Emilio San José Guiote ◽

Luciano Sánchez ◽

José Ranilla

Keyword(s):

Environmental Impact ◽

High Performance ◽

Service Providers ◽

It Service ◽

Genetic Fuzzy System ◽

Trade Offs ◽

Efficient Resource ◽

The University ◽

Performance Computing

High Performance Computing Clusters (HPCCs) are common platforms for solving both up-to-date challenges and high-dimensional problems faced by IT service providers. Nonetheless, the use of HPCCs carries a substantial and growing economic and environmental impact, owing to the large amount of energy they need to operate. In this paper, a two-stage holistic optimisation mechanism is proposed to manage HPCCs in an eco-efficiently manner. The first stage logically optimises the resources of the HPCC through reactive and proactive strategies, while the second stage optimises hardware allocation by leveraging a genetic fuzzy system tailored to the underlying equipment. The model finds optimal trade-offs among quality of service, direct/indirect operating costs, and environmental impact, through multiobjective evolutionary algorithms meeting the preferences of the administrator. Experimentation was done using both actual workloads from the Scientific Modelling Cluster of the University of Oviedo and synthetically-generated workloads, showing statistical evidence supporting the adoption of the new mechanism.

Download Full-text