performance penalty Latest Research Papers

AbstractThe Dynamic Execution Integrity Engine (DExIE) is a lightweight hardware monitor that can be flexibly attached to many IoT-class processor pipelines. It is guaranteed to catch both inter- and intra-function illegal control flows in time to prevent any illegal instructions from touching memory. The performance impact of attaching DExIE to a core depends on the concrete pipeline structure. In some especially suitable cases, extending a processor with DExIE will have no performance penalty. DExIE is real-time capable, as it causes no or only up to 10.4 % additional and then predictable pipeline stalls. Depending on the monitored processor’s size and structure, DExIE is faster than software-based monitoring and often smaller than a separate guard processor. We present not just the hardware architecture, but also the automated programming flow, and discuss compact adaptable storage formats to hold fine-grained control flow information.

Download Full-text

An Experimental Implementation of a Resilient Graphic Rendering Cluster

Applied Sciences ◽

10.3390/app112412046 ◽

2021 ◽

Vol 11 (24) ◽

pp. 12046

Author(s):

Tibor Skala ◽

Mirsad Todorovac ◽

Miklós Kozlovszky ◽

Marko Maričević

Keyword(s):

Local System ◽

Load Shedding ◽

Level Of Service ◽

Service Availability ◽

Grid Scheduling ◽

Experimental Implementation ◽

Catastrophic Failures ◽

Low Performance ◽

Performance Penalty ◽

The Cost

In this paper, we describe the challenge of developing a web front that will give an interactive and relatively immediate result without the overhead of complex grid scheduling, in the sense of the grid’s lack of interactivity and need for certificates that users simply do not own. In particular, the local system of issuing grid certificates is somewhat limited to a narrower community compared to that which we wanted to reach in order to popularize the grid, and our desired level of service availability exceeded the use of the cluster for grid purposes. Therefore, we have developed an interactive, scalable web front and back-end animation rendering frame dispatcher to access our cluster’s rendering power with low latency, low overhead and low performance penalty added to the cost of Persistence of Vision Ray rendering. The system is designed to survive temporary or catastrophic failures such as temporary power loss, load shedding, malfunction of rendering server cluster or client hardware, whether through an automatic or a manual restart, as long as the hardware that keeps the previous work and periodically dumped state of the automata is preserved.

Download Full-text

On the Competitiveness of Oblivious Routing: A Statistical View

Applied Sciences ◽

10.3390/app11209408 ◽

2021 ◽

Vol 11 (20) ◽

pp. 9408

Author(s):

Gábor Németh

Keyword(s):

Performance Measures ◽

Adaptive Algorithm ◽

Competitive Ratio ◽

Directed Graphs ◽

Expected Value ◽

Undirected Graphs ◽

Worst Case ◽

Oblivious Routing ◽

Performance Penalty ◽

Open Question

Oblivious routing is a static algorithm for routing arbitrary user demands with the property that the competitive ratio, the proportion of the maximum congestion to the best possible congestion, is minimal. Oblivious routing turned out surprisingly efficient in this worst-case sense: in undirected graphs, we pay only a logarithmic performance penalty, and this penalty is usually smaller than 2 in directed graphs as well. However, compared to an optimal adaptive algorithm, which never causes congestion when subjected to a routable demand, oblivious routing surely has congestion. The open question is of how often is the network in a congested state. In this paper, we study two performance measures naturally arising in this context: the probability of congestion and the expected value of congestion. Our main result is the finding that, in certain directed graphs on n nodes, the probability of congestion approaches 1 in some undirected graphs, despite the competitive ratio being O(1).

Download Full-text

Getting to the point: index sets and parallelism-preserving autodiff for pointful array programming

Proceedings of the ACM on Programming Languages ◽

10.1145/3473593 ◽

2021 ◽

Vol 5 (ICFP) ◽

pp. 1-29

Author(s):

Adam Paszke ◽

Daniel D. Johnson ◽

David Duvenaud ◽

Dimitrios Vytiniotis ◽

Alexey Radul ◽

...

Keyword(s):

Programming Language ◽

Automatic Differentiation ◽

Language Design ◽

Programming Language Design ◽

Fine Grained ◽

Reverse Mode ◽

Index Sets ◽

Accumulation Effect ◽

Performance Penalty ◽

High Level

We present a novel programming language design that attempts to combine the clarity and safety of high-level functional languages with the efficiency and parallelism of low-level numerical languages. We treat arrays as eagerly-memoized functions on typed index sets, allowing abstract function manipulations, such as currying, to work on arrays. In contrast to composing primitive bulk-array operations, we argue for an explicit nested indexing style that mirrors application of functions to arguments. We also introduce a fine-grained typed effects system which affords concise and automatically-parallelized in-place updates. Specifically, an associative accumulation effect allows reverse-mode automatic differentiation of in-place updates in a way that preserves parallelism. Empirically, we benchmark against the Futhark array programming language, and demonstrate that aggressive inlining and type-driven compilation allows array programs to be written in an expressive, "pointful" style with little performance penalty.

Download Full-text

Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3462632 ◽

2021 ◽

Vol 18 (4) ◽

pp. 1-26

Author(s):

Candace Walden ◽

Devesh Singh ◽

Meenatchi Jagasivamani ◽

Shang Li ◽

Luyi Kang ◽

...

Keyword(s):

Regular Structure ◽

Main Memory ◽

Monolithically Integrated ◽

Area Efficiency ◽

Memory Area ◽

Simulation Results ◽

Memory Interface ◽

Performance Penalty ◽

Many Core ◽

Design Ideas

Many emerging non-volatile memories are compatible with CMOS logic, potentially enabling their integration into a CPU’s die. This article investigates such monolithically integrated CPU–main memory chips. We exploit non-volatile memories employing 3D crosspoint subarrays, such as resistive RAM (ReRAM), and integrate them over the CPU’s last-level cache (LLC). The regular structure of cache arrays enables co-design of the LLC and ReRAM main memory for area efficiency. We also develop a streamlined LLC/main memory interface that employs a single shared internal interconnect for both the cache and main memory arrays, and uses a unified controller to service both LLC and main memory requests. We apply our monolithic design ideas to a many-core CPU by integrating 3D ReRAM over each core’s LLC slice. We find that co-design of the LLC and ReRAM saves 27% of the total LLC–main memory area at the expense of slight increases in delay and energy. The streamlined LLC/main memory interface saves an additional 12% in area. Our simulation results show monolithic integration of CPU and main memory improves performance by 5.3× and 1.7× over HBM2 DRAM for several graph and streaming kernels, respectively. It also reduces the memory system’s energy by 6.0× and 1.7×, respectively. Moreover, we show that the area savings of co-design permits the CPU to have 23% more cores and main memory, and that streamlining the LLC/main memory interface incurs a small 4% performance penalty.

Download Full-text

Decreasing the Miss Rate and Eliminating the Performance Penalty of a Data Filter Cache

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3449043 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1-22

Author(s):

Michael Stokes ◽

David Whalley ◽

Soner Onder

Keyword(s):

Energy Efficient ◽

Data Access ◽

Performance Degradation ◽

Access Time ◽

Data Cache ◽

Energy Usage ◽

Single Cycle ◽

Performance Penalty

While data filter caches (DFCs) have been shown to be effective at reducing data access energy, they have not been adopted in processors due to the associated performance penalty caused by high DFC miss rates. In this article, we present a design that both decreases the DFC miss rate and completely eliminates the DFC performance penalty even for a level-one data cache (L1 DC) with a single cycle access time. First, we show that a DFC that lazily fills each word in a DFC line from an L1 DC only when the word is referenced is more energy-efficient than eagerly filling the entire DFC line. For a 512B DFC, we are able to eliminate loads of words into the DFC that are never referenced before being evicted, which occurred for about 75% of the words in 32B lines. Second, we demonstrate that a lazily word filled DFC line can effectively share and pack data words from multiple L1 DC lines to lower the DFC miss rate. For a 512B DFC, we completely avoid accessing the L1 DC for loads about 23% of the time and avoid a fully associative L1 DC access for loads 50% of the time, where the DFC only requires about 2.5% of the size of the L1 DC. Finally, we present a method that completely eliminates the DFC performance penalty by speculatively performing DFC tag checks early and only accessing DFC data when a hit is guaranteed. For a 512B DFC, we improve data access energy usage for the DTLB and L1 DC by 33% with no performance degradation.

Download Full-text

Performance of the Transport Layer Security Handshake Over 6TiSCH

Sensors ◽

10.3390/s21062192 ◽

2021 ◽

Vol 21 (6) ◽

pp. 2192

Author(s):

Timothy Claeys ◽

Mališa Vučinić ◽

Thomas Watteyne ◽

Franck Rousseau ◽

Bernard Tourancheau

Keyword(s):

Security Protocol ◽

Transport Layer ◽

Communication Link ◽

Message Delivery ◽

Battery Lifetime ◽

Network Characteristics ◽

Transport Layer Security ◽

Reference Implementation ◽

Performance Penalty ◽

Reliability Algorithms

This paper presents a thorough comparison of the Transport Layer Security (TLS) v1.2 and Datagram TLS (DTLS) v1.2 handshake in 6TiSCH networks. TLS and DTLS play a crucial role in protecting daily Internet traffic, while 6TiSCH is a major low-power link layer technology for the IoT. In recent years, DTLS has been the de-facto security protocol to protect IoT application traffic, mainly because it runs over lightweight, unreliable transport protocols, i.e., UDP. However, unlike the DTLS record layer, the handshake requires reliable message delivery. It, therefore, incorporates sequence numbers, a retransmission timer, and a fragmentation algorithm. Our goal is to study how well these mechanisms perform, in the constrained setting of 6TiSCH, compared to TCP’s reliability algorithms, relied upon by TLS. We port the mbedTLS library to OpenWSN, a 6TiSCH reference implementation, and deploy the code on the state-of-the-art OpenMote platform. We show that, when the peers use an ideal channel, the DTLS handshake uses up to 800 less and completes 0.6 s faster. Nonetheless, using an unreliable communication link, the DTLS handshake duration suffers a performance penalty of roughly 45%, while TLS’ handshake duration degrades by merely 15%. Similarly, the number of exchanged bytes doubles for DTLS while for TLS the increase is limited to 15%. The results indicate that IoT product developers should account for network characteristics when selecting a security protocol. Neglecting to do so can negatively impact the battery lifetime of the entire constrained network.

Download Full-text

DyPS: Dynamic, Private and Secure GWAS

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2021-0025 ◽

2021 ◽

Vol 2021 (2) ◽

pp. 214-234

Author(s):

Túlio Pascoal ◽

Jérémie Decouchant ◽

Antoine Boutet ◽

Paulo Esteves-Verissimo

Keyword(s):

Association Studies ◽

Genome Wide Association Studies ◽

Test Statistics ◽

Genome Wide ◽

Significant Performance ◽

Speed Up ◽

Request Processing ◽

Execution Environment ◽

Performance Penalty ◽

Trusted Execution Environment

Abstract Genome-Wide Association Studies (GWAS) identify the genomic variations that are statistically associated with a particular phenotype (e.g., a disease). The confidence in GWAS results increases with the number of genomes analyzed, which encourages federated computations where biocenters would periodically share the genomes they have sequenced. However, for economical and legal reasons, this collaboration will only happen if biocenters cannot learn each others’ data. In addition, GWAS releases should not jeopardize the privacy of the individuals whose genomes are used. We introduce DyPS, a novel framework to conduct dynamic privacy-preserving federated GWAS. DyPS leverages a Trusted Execution Environment to secure dynamic GWAS computations. Moreover, DyPS uses a scaling mechanism to speed up the releases of GWAS results according to the evolving number of genomes used in the study, even if individuals retract their participation consent. Lastly, DyPS also tolerates up to all-but-one colluding biocenters without privacy leaks. We implemented and extensively evaluated DyPS through several scenarios involving more than 6 million simulated genomes and up to 35,000 real genomes. Our evaluation shows that DyPS updates test statistics with a reasonable additional request processing delay (11% longer) compared to an approach that would update them with minimal delay but would lead to 8% of the genomes not being protected. In addition, DyPS can result in the same amount of aggregate statistics as a static release (i.e., at the end of the study), but can produce up to 2.6 times more statistics information during earlier dynamic releases. Besides, we show that DyPS can support a larger number of genomes and SNP positions without any significant performance penalty.

Download Full-text

Analyzing the Impact of Soft Errors in Deep Neural Networks on GPUsfrom Instruction Level

WSEAS TRANSACTIONS ON SYSTEMS AND CONTROL ◽

10.37394/23203.2020.15.70 ◽

2020 ◽

Vol 15 ◽

Keyword(s):

Neural Networks ◽

Object Detection ◽

Deep Neural Networks ◽

Soft Errors ◽

Input Image ◽

Classification Algorithms ◽

Vulnerability Factor ◽

Significant Challenge ◽

Performance Penalty ◽

The Impact

Deep Neural Networks (DNNs) used in safetycritical systems cannot compromise their performance due to reliability issues. In particular, soft errors are the worst. Selective softwarebased protection solutions are among the best techniques to improve the reliability of DNNs efficiently. However, their most significant challenge is precisely hardening portions of the DNN model to avoid performance degradation. In this work, we propose a comprehensive methodology to analyze the reliability of object detection and classification algorithms run on GPUs from the lowest (instruction) evaluation level. The ultimate goal is to avoid the performance penalty of full instruction duplication by confidently identifying the vulnerable instructions. For this purpose, we propose a technique, Instruction Vulnerability Factor (IVF). By applying our methodology on ResNet and YOLO models, we demonstrate that both models’ most vulnerable instructions can be precisely determined. Moreover, we show that YOLO is more sensitive to the changes caused by soft errors than ResNet. Also, ResNet depends on the input image in its reliability, while YOLO tends to be independent.

Download Full-text

FTRM: A Cache-Based Fault Tolerant Recovery Mechanism for Multi-Channel Flash Devices

Electronics ◽

10.3390/electronics9101581 ◽

2020 ◽

Vol 9 (10) ◽

pp. 1581

Author(s):

Ronnie Mativenga ◽

Prince Hamandawana ◽

Tae-Sun Chung ◽

Jongik Kim

Keyword(s):

Page Number ◽

Flash Memory ◽

Fault Tolerant ◽

Memory Storage ◽

Portable Devices ◽

Flash Storage ◽

Power Failure ◽

Recovery Mechanism ◽

Performance Penalty ◽

Efficient Recovery

Flash memory prevalence has reached greater extents with its performance and compactness capabilities. This enables it to be easily adopted as storage media in various portable devices which includes smart watches, cell-phones, drones, and in-vehicle infotainment systems to mention but a few. To support large flash storage in such portable devices, existing flash translation layers (FTLs) employ a cache mapping table (CMT), which contains a small portion of logical page number to physical page number (LPN-PPN) mappings. For robustness, it is of importance to consider the CMT reconstruction mechanisms during system recovery. Currently, existing approaches cannot overcome the performance penalty after experiencing unexpected power failure. This is due to the disregard of the delay caused by inconsistencies between the cached page-mapping entries in RAM and their corresponding mapping pages in flash storage. Furthermore, how to select proper pages for reconstructing the CMT when rebooting a device needs to be revisited. In this study we address these problems and propose a fault tolerant power-failure recovery mechanism (FTRM) for flash memory storage systems. Our empirical study shows that FTRM is an efficient recovery and robust protocol.

Download Full-text

performance penalty
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

DExIE - An IoT-Class Hardware Monitor for Real-Time Fine-Grained Control-Flow Integrity

An Experimental Implementation of a Resilient Graphic Rendering Cluster

On the Competitiveness of Oblivious Routing: A Statistical View

Getting to the point: index sets and parallelism-preserving autodiff for pointful array programming

Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache

Decreasing the Miss Rate and Eliminating the Performance Penalty of a Data Filter Cache

Performance of the Transport Layer Security Handshake Over 6TiSCH

DyPS: Dynamic, Private and Secure GWAS

Analyzing the Impact of Soft Errors in Deep Neural Networks on GPUsfrom Instruction Level

FTRM: A Cache-Based Fault Tolerant Recovery Mechanism for Multi-Channel Flash Devices

Export Citation Format

performance penaltyRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

DExIE - An IoT-Class Hardware Monitor for Real-Time Fine-Grained Control-Flow Integrity

An Experimental Implementation of a Resilient Graphic Rendering Cluster

On the Competitiveness of Oblivious Routing: A Statistical View

Getting to the point: index sets and parallelism-preserving autodiff for pointful array programming

Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache

Decreasing the Miss Rate and Eliminating the Performance Penalty of a Data Filter Cache

Performance of the Transport Layer Security Handshake Over 6TiSCH

DyPS: Dynamic, Private and Secure GWAS

Analyzing the Impact of Soft Errors in Deep Neural Networks on GPUsfrom Instruction Level

FTRM: A Cache-Based Fault Tolerant Recovery Mechanism for Multi-Channel Flash Devices

performance penalty
Recently Published Documents