The Performance Penalty of XML for Program Intermediate Representations

Author(s):  
P. Anderson
Keyword(s):  
2021 ◽  
Vol 18 (3) ◽  
pp. 1-22
Author(s):  
Michael Stokes ◽  
David Whalley ◽  
Soner Onder

While data filter caches (DFCs) have been shown to be effective at reducing data access energy, they have not been adopted in processors due to the associated performance penalty caused by high DFC miss rates. In this article, we present a design that both decreases the DFC miss rate and completely eliminates the DFC performance penalty even for a level-one data cache (L1 DC) with a single cycle access time. First, we show that a DFC that lazily fills each word in a DFC line from an L1 DC only when the word is referenced is more energy-efficient than eagerly filling the entire DFC line. For a 512B DFC, we are able to eliminate loads of words into the DFC that are never referenced before being evicted, which occurred for about 75% of the words in 32B lines. Second, we demonstrate that a lazily word filled DFC line can effectively share and pack data words from multiple L1 DC lines to lower the DFC miss rate. For a 512B DFC, we completely avoid accessing the L1 DC for loads about 23% of the time and avoid a fully associative L1 DC access for loads 50% of the time, where the DFC only requires about 2.5% of the size of the L1 DC. Finally, we present a method that completely eliminates the DFC performance penalty by speculatively performing DFC tag checks early and only accessing DFC data when a hit is guaranteed. For a 512B DFC, we improve data access energy usage for the DTLB and L1 DC by 33% with no performance degradation.


2016 ◽  
Vol 9 (10) ◽  
pp. 3803-3815 ◽  
Author(s):  
Gheorghe-Teodor Bercea ◽  
Andrew T. T. McRae ◽  
David A. Ham ◽  
Lawrence Mitchell ◽  
Florian Rathgeber ◽  
...  

Abstract. We present a generic algorithm for numbering and then efficiently iterating over the data values attached to an extruded mesh. An extruded mesh is formed by replicating an existing mesh, assumed to be unstructured, to form layers of prismatic cells. Applications of extruded meshes include, but are not limited to, the representation of three-dimensional high aspect ratio domains employed by geophysical finite element simulations. These meshes are structured in the extruded direction. The algorithm presented here exploits this structure to avoid the performance penalty traditionally associated with unstructured meshes. We evaluate the implementation of this algorithm in the Firedrake finite element system on a range of low compute intensity operations which constitute worst cases for data layout performance exploration. The experiments show that having structure along the extruded direction enables the cost of the indirect data accesses to be amortized after 10–20 layers as long as the underlying mesh is well ordered. We characterize the resulting spatial and temporal reuse in a representative set of both continuous-Galerkin and discontinuous-Galerkin discretizations. On meshes with realistic numbers of layers the performance achieved is between 70 and 90 % of a theoretical hardware-specific limit.


Author(s):  
Ruben Müller ◽  
Henok Y. Gebretsadik ◽  
Niels Schütze

Abstract. Recently, the Kessem–Tendaho project is completed to bring about socioeconomic development and growth in the Awash River Basin, Ethiopia. To support reservoir Koka, two new reservoirs where built together with extensive infrastructure for new irrigation projects. For best possible socioeconomic benefits under conflicting management goals, like energy production at three hydropower stations and basin wide water supply at various sites, an integrated reservoir system management is required. To satisfy the multi-purpose nature of the reservoir system, multi-objective parameterization-simulation-optimization model is applied. Different Pareto-optimal trade-off solutions between water supply and hydro-power generation are provided for two scenarios (i) recent conditions and (ii) future planned increases for Tendaho and Upper Awash Irrigation projects. Reservoir performance is further assessed under (i) rule curves with a high degree of freedom – this allows for best performance, but may result in rules curves to variable for real word operation and (ii) smooth rule curves, obtained by artificial neuronal networks. The results show no performance penalty for smooth rule curves under future conditions but a notable penalty under recent conditions.


Author(s):  
T. L. Bowen

The feasibility of an isolated reverse turbine concept for marine propulsion was examined with emphasis on (1) the reverse turbine size needed to meet the stopping distance requirement of a particular ship during a crashback maneuver, and (2) the ahead turbine performance penalty due to reverse turbine windage losses. This particular reverse turbine system was made adaptable to the exhaust elbow and output shaft of an existing free-power-turbine gas turbine. The analysis was based on the application of this reverse turbine concept to a notational single-shaft frigate. The study-ship’s propulsion system includes two General Electric LM2500 gas turbines with reversing capability, a reduction gear, and a fixed-pitch propeller. A ship propulsion simulation was developed for the purpose of calculating steady-state ahead and backing performance data, as well as transient behavior of the ship during crashback maneuvers. The reverse turbine’s speed and torque required to stop the ship in five ship-lengths and 3.5 ship-lengths were determined from these calculations. Four reverse turbine designs were generated using a computer program for preliminary design of axial-flow turbines. The designs included a single-stage and a two-stage impulse turbine for both stopping distances. The penalty on ahead performance due to reverse turbine windage was estimated for each design, using existing experimental data found in the literature. The results obtained thus far tend to support the feasibility of this reverse turbine concept.


2014 ◽  
Vol 23 (04) ◽  
pp. 1450046
Author(s):  
ENRIQUE SEDANO ◽  
SILVIO SEPULVEDA ◽  
FERNANDO CASTRO ◽  
DANIEL CHAVER ◽  
RODRIGO GONZALEZ-ALBERQUILLA ◽  
...  

Studying blocks behavior during their lifetime in cache can provide useful information to reduce the miss rate and therefore improve processor performance. According to this rationale, the peLIFO replacement algorithm [M. Chaudhuri, Proc. Micro'09, New York, 12–16 December, 2009, pp. 401–412], which learns dynamically the number of cache ways required to satisfy short-term reuses preserving the remaining ways for long-term reuses, has been recently proposed. In this paper, we propose several changes to the original peLIFO policy in order to reduce the implementation complexity involved, and we extend the algorithm to a shared-cache environment considering dynamic information about threads behavior to improve cache efficiency. Experimental results confirm that our simplification techniques reduce the required hardware with a negligible performance penalty, while the best of our thread-aware extension proposals reduces average CPI by 8.7% and 15.2% on average compared to the original peLIFO and LRU respectively for a set of 43 multi-programmed workloads on an 8 MB 16-way set associative shared L2 cache.


Sign in / Sign up

Export Citation Format

Share Document