Approaches to Optimize Memory Footprint for Elephant Flows

Unikernel specializes a minimalistic LibOS and a target application into a standalone single-purpose virtual machine (VM) running on a hypervisor, which is referred to as (virtual) appliance . Compared to traditional VMs, Unikernel appliances have smaller memory footprint and lower overhead while guaranteeing the same level of isolation. On the downside, Unikernel strips off the process abstraction from its monolithic appliance and thus sacrifices flexibility, efficiency, and applicability. In this article, we examine whether there is a balance embracing the best of both Unikernel appliances (strong isolation) and processes (high flexibility/efficiency). We present KylinX, a dynamic library operating system for simplified and efficient cloud virtualization by providing the pVM (process-like VM) abstraction. A pVM takes the hypervisor as an OS and the Unikernel appliance as a process allowing both page-level and library-level dynamic mapping. At the page level, KylinX supports pVM fork plus a set of API for inter-pVM communication (IpC, which is compatible with conventional UNIX IPC). At the library level, KylinX supports shared libraries to be linked to a Unikernel appliance at runtime. KylinX enforces mapping restrictions against potential threats. We implement a prototype of KylinX by modifying MiniOS and Xen tools. Extensive experimental results show that KylinX achieves similar performance both in micro benchmarks (fork, IpC, library update, etc.) and in applications (Redis, web server, and DNS server) compared to conventional processes, while retaining the strong isolation benefit of VMs/Unikernels.

Download Full-text

Proof-of-Prestige: A Useful Work Reward System for Unverifiable Tasks

ACM Transactions on Internet Technology ◽

10.1145/3419483 ◽

2021 ◽

Vol 21 (2) ◽

pp. 1-27

Author(s):

Michał Król ◽

Alberto Sonnino ◽

Mustafa Al-Bassam ◽

Argyrios G. Tasiopoulos ◽

Etienne Rivière ◽

...

Keyword(s):

Mobile Application ◽

Reward System ◽

Fair Exchange ◽

Content Dissemination ◽

Collusion Attacks ◽

Computational Overhead ◽

Smart Contract ◽

Memory Footprint ◽

Vast Range ◽

Over Time

As cryptographic tokens and altcoins are increasingly being built to serve as utility tokens, the notion of useful work consensus protocols is becoming ever more important. With useful work consensus protocols, users get rewards after they have carried out some specific tasks useful for the network. While in some cases the proof of some utility or service can be provided, the majority of tasks are impossible to verify reliably. To deal with such cases, we design “Proof-of-Prestige” (PoP)—a reward system that can run directly on Proof-of-Stake (PoS) blockchains or as a smart contract on top of Proof-of-Work (PoW) blockchains. PoP introduces “prestige,” which is a volatile resource that, in contrast to coins, regenerates over time. Prestige can be gained by performing useful work, spent when benefiting from services, and directly translates to users minting power. Our scheme allows us to reliably reward decentralized workers while keeping the system free for the end-users. PoP is resistant against Sybil and collusion attacks and can be used with a vast range of unverifiable tasks. We build a simulator to assess the cryptoeconomic behavior of the system and deploy a full prototype of a content dissemination platform rewarding its participants. We implement the blockchain component on both Ethereum (PoW) and Cosmos (PoS), provide a mobile application, and connect it with our scheme with a negligible memory footprint. Finally, we adapt a fair exchange protocol allowing us to atomically exchange files for rewards also in scenarios where not all the parties have Internet connectivity. Our evaluation shows that even for large Ethereum traces, PoP introduces sub-millisecond computational overhead for miners in Cosmos and less than 0.013$ smart contract invocation cost for users in Ethereum.

Download Full-text

Classification of Cattle Behaviours Using Neck-Mounted Accelerometer-Equipped Collars and Convolutional Neural Networks

Sensors ◽

10.3390/s21124050 ◽

2021 ◽

Vol 21 (12) ◽

pp. 4050

Author(s):

Dejan Pavlovic ◽

Christopher Davison ◽

Andrew Hamilton ◽

Oskar Marko ◽

Robert Atkinson ◽

...

Keyword(s):

Neural Network ◽

Model Performance ◽

Ground Truth ◽

Practical Implementation ◽

Ground Truth Data ◽

Battery Lifetime ◽

Implementation Challenges ◽

Memory Footprint ◽

Commercial Farms ◽

Using Data

Monitoring cattle behaviour is core to the early detection of health and welfare issues and to optimise the fertility of large herds. Accelerometer-based sensor systems that provide activity profiles are now used extensively on commercial farms and have evolved to identify behaviours such as the time spent ruminating and eating at an individual animal level. Acquiring this information at scale is central to informing on-farm management decisions. The paper presents the development of a Convolutional Neural Network (CNN) that classifies cattle behavioural states (`rumination’, `eating’ and `other’) using data generated from neck-mounted accelerometer collars. During three farm trials in the United Kingdom (Easter Howgate Farm, Edinburgh, UK), 18 steers were monitored to provide raw acceleration measurements, with ground truth data provided by muzzle-mounted pressure sensor halters. A range of neural network architectures are explored and rigorous hyper-parameter searches are performed to optimise the network. The computational complexity and memory footprint of CNN models are not readily compatible with deployment on low-power processors which are both memory and energy constrained. Thus, progressive reductions of the CNN were executed with minimal loss of performance in order to address the practical implementation challenges, defining the trade-off between model performance versus computation complexity and memory footprint to permit deployment on micro-controller architectures. The proposed methodology achieves a compression of 14.30 compared to the unpruned architecture but is nevertheless able to accurately classify cattle behaviours with an overall F1 score of 0.82 for both FP32 and FP16 precision while achieving a reasonable battery lifetime in excess of 5.7 years.

Download Full-text

Memory footprint reduction for embedded systems

10.1145/1361096.1361102 ◽

2008 ◽

Cited By ~ 1

Author(s):

Koen De Bosschere

Keyword(s):

Embedded Systems ◽

Memory Footprint

Download Full-text

EZIOTracer

ACM SIGOPS Operating Systems Review ◽

10.1145/3469379.3469391 ◽

2021 ◽

Vol 55 (1) ◽

pp. 88-98

Author(s):

Mohammed Islam Naas ◽

François Trahay ◽

Alexis Colin ◽

Pierre Olivier ◽

Stéphane Rubini ◽

...

Keyword(s):

Analysis Framework ◽

Comprehensive Understanding ◽

Kernel Space ◽

Data Intensive ◽

Storage Performance ◽

Performance Requirements ◽

Memory Footprint ◽

Extreme Performance ◽

Data Intensive Applications ◽

Kernel Level

Tracing is a popular method for evaluating, investigating, and modeling the performance of today's storage systems. Tracing has become crucial with the increase in complexity of modern storage applications/systems, that are manipulating an ever-increasing amount of data and are subject to extreme performance requirements. There exists many tracing tools focusing either on the user-level or the kernel-level, however we observe the lack of a unified tracer targeting both levels: this prevents a comprehensive understanding of modern applications' storage performance profiles. In this paper, we present EZIOTracer, a unified I/O tracer for both (Linux) kernel and user spaces, targeting data intensive applications. EZIOTracer is composed of a userland as well as a kernel space tracer, complemented with a trace analysis framework able to merge the output of the two tracers, and in particular to relate user-level events to kernel-level ones, and vice-versa. On the kernel side, EZIOTracer relies on eBPF to offer safe, low-overhead, low memory footprint, and flexible tracing capabilities. We demonstrate using FIO benchmark the ability of EZIOTracer to track down I/O performance issues by relating events recorded at both the kernel and user levels. We show that this can be achieved with a relatively low overhead that ranges from 2% to 26% depending on the I/O intensity.

Download Full-text

JAMPI: Efficient Matrix Multiplication in Spark Using Barrier Execution Mode

10.20944/preprints202007.0450.v1 ◽

2020 ◽

Author(s):

Tamas Foldi ◽

Chris von Csefalvay ◽

Nicolas A. Perez

Keyword(s):

Neural Network ◽

Machine Learning ◽

Message Passing ◽

Matrix Multiplication ◽

Map Reduce ◽

Distributed Training ◽

Learning Tasks ◽

Memory Footprint ◽

Asynchronous Network ◽

Execution Mode

The new barrier mode in Apache Spark allows embedding distributed deep learning training as a Spark stage to simplify the distributed training workflow. In Spark, a task in a stage doesn’t depend on any other tasks in the same stage, and hence it can be scheduled independently. However, several algorithms require more sophisticated inter-task communications, similar to the MPI paradigm. By combining distributed message passing (using asynchronous network IO), OpenJDK’s new auto-vectorization and Spark’s barrier execution mode, we can add non-map/reduce based algorithms, such as Cannon’s distributed matrix multiplication to Spark. We document an efficient distributed matrix multiplication using Cannon’s algorithm, which improves significantly on the performance of the existing MLlib implementation. Used within a barrier task, the algorithm described herein results in an up to 24% performance increase on a 10,000x10,000 square matrix with a significantly lower memory footprint. Applications of efficient matrix multiplication include, among others, accelerating the training and implementation of deep convolutional neural network based workloads, and thus such efficient algorithms can play a ground-breaking role in faster, more efficient execution of even the most complicated machine learning tasks

Download Full-text

Towards Fully 8-bit Integer Inference for the Transformer Model

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/520 ◽

2020 ◽

Author(s):

Ye Lin ◽

Yanyang Li ◽

Tengbo Liu ◽

Tong Xiao ◽

Tongran Liu ◽

...

Keyword(s):

Deep Neural Networks ◽

Floating Point ◽

Memory Footprint ◽

Great Progress ◽

Comparable Performance ◽

Complex Models ◽

Language Modelling ◽

Transformer Model ◽

And Storage ◽

Modelling Task

8-bit integer inference, as a promising direction in reducing both the latency and storage of deep neural networks, has made great progress recently. On the other hand, previous systems still rely on 32-bit floating point for certain functions in complex models (e.g., Softmax in Transformer), and make heavy use of quantization and de-quantization. In this work, we show that after a principled modification on the Transformer architecture, dubbed Integer Transformer, an (almost) fully 8-bit integer inference algorithm Scale Propagation could be derived. De-quantization is adopted when necessary, which makes the network more efficient. Our experiments on WMT16 En<->Ro, WMT14 En<->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.

Download Full-text

Iterative Algorithm for Parameterization of Two-Region Piecewise Uniform Quantizer for the Laplacian Source

Mathematics ◽

10.3390/math9233091 ◽

2021 ◽

Vol 9 (23) ◽

pp. 3091

Author(s):

Jelena Nikolić ◽

Danijela Aleksić ◽

Zoran Perić ◽

Milan Dinčić

Keyword(s):

Neural Networks ◽

Performance Assessment ◽

Probability Density ◽

Probability Density Functions ◽

Density Functions ◽

Uniform Quantization ◽

Memory Footprint ◽

High Bit Rates ◽

And Performance ◽

Laplacian Distribution

Motivated by the fact that uniform quantization is not suitable for signals having non-uniform probability density functions (pdfs), as the Laplacian pdf is, in this paper we have divided the support region of the quantizer into two disjunctive regions and utilized the simplest uniform quantization with equal bit-rates within both regions. In particular, we assumed a narrow central granular region (CGR) covering the peak of the Laplacian pdf and a wider peripheral granular region (PGR) where the pdf is predominantly tailed. We performed optimization of the widths of CGR and PGR via distortion optimization per border–clipping threshold scaling ratio which resulted in an iterative formula enabling the parametrization of our piecewise uniform quantizer (PWUQ). For medium and high bit-rates, we demonstrated the convenience of our PWUQ over the uniform quantizer, paying special attention to the case where 99.99% of the signal amplitudes belong to the support region or clipping region. We believe that the resulting formulas for PWUQ design and performance assessment are greatly beneficial in neural networks where weights and activations are typically modelled by the Laplacian distribution, and where uniform quantization is commonly used to decrease memory footprint.

Download Full-text