A Joint Model Provisioning and Request Dispatch Solution for Low-Latency Inference Services on Edge

With the advancement of machine learning, a growing number of mobile users rely on machine learning inference for making time-sensitive and safety-critical decisions. Therefore, the demand for high-quality and low-latency inference services at the network edge has become the key to modern intelligent society. This paper proposes a novel solution that jointly provisions machine learning models and dispatches inference requests to reduce inference latency on edge nodes. Existing solutions either direct inference requests to the nearest edge node to save network latency or balance edge nodes’ workload by reducing queuing and computing time. The proposed solution provisions each edge node with the optimal number and type of inference instances under a holistic consideration of networking, computing, and memory resources. Mobile users can thus be directed to utilize inference services on the edge nodes that offer minimal serving latency. The proposed solution has been implemented using TensorFlow Serving and Kubernetes on an edge cluster. Through simulation and testbed experiments under various system settings, the evaluation results showed that the joint strategy could consistently achieve lower latency than simply searching for the best edge node to serve inference requests.

Download Full-text

You Only Look Once, But Compute Twice: Service Function Chaining for Low-Latency Object Detection in Softwarized Networks

Applied Sciences ◽

10.3390/app11052177 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2177

Author(s):

Zuo Xiang ◽

Patrick Seeling ◽

Frank H. P. Fitzek

Keyword(s):

Machine Learning ◽

Object Detection ◽

Low Latency ◽

Connected Vehicles ◽

Network Nodes ◽

Machine Learning Model ◽

Fold Reduction ◽

Broad Variety ◽

Computational Resources ◽

Service Latency

With increasing numbers of computer vision and object detection application scenarios, those requiring ultra-low service latency times have become increasingly prominent; e.g., those for autonomous and connected vehicles or smart city applications. The incorporation of machine learning through the applications of trained models in these scenarios can pose a computational challenge. The softwarization of networks provides opportunities to incorporate computing into the network, increasing flexibility by distributing workloads through offloading from client and edge nodes over in-network nodes to servers. In this article, we present an example for splitting the inference component of the YOLOv2 trained machine learning model between client, network, and service side processing to reduce the overall service latency. Assuming a client has 20% of the server computational resources, we observe a more than 12-fold reduction of service latency when incorporating our service split compared to on-client processing and and an increase in speed of more than 25% compared to performing everything on the server. Our approach is not only applicable to object detection, but can also be applied in a broad variety of machine learning-based applications and services.

Download Full-text

Machine Learning Oriented Resource Allocation to Achieve Ultra Low Power, Low Latency and High Reliability Vehicular Communication Networks

2020 IEEE 17th India Council International Conference (INDICON) ◽

10.1109/indicon49873.2020.9342584 ◽

2020 ◽

Author(s):

Sasweth C Rajanarayanan ◽

Rohit Misra ◽

Rahul Jashvantbhai Pandya

Keyword(s):

Machine Learning ◽

Resource Allocation ◽

Low Power ◽

Communication Networks ◽

High Reliability ◽

Low Latency ◽

Vehicular Communication ◽

Ultra Low Power ◽

Vehicular Communication Networks

Download Full-text

On-Device Deep Learning Inference for System-on-Chip (SoC) Architectures

Electronics ◽

10.3390/electronics10060689 ◽

2021 ◽

Vol 10 (6) ◽

pp. 689

Author(s):

Tom Springer ◽

Elia Eiroa-Lledo ◽

Elizabeth Stevens ◽

Erik Linstead

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Real Time ◽

Operating Systems ◽

System On Chip ◽

Low Latency ◽

Management Framework ◽

On Chip ◽

Specialized Hardware ◽

Deterministic Behavior

As machine learning becomes ubiquitous, the need to deploy models on real-time, embedded systems will become increasingly critical. This is especially true for deep learning solutions, whose large models pose interesting challenges for target architectures at the “edge” that are resource-constrained. The realization of machine learning, and deep learning, is being driven by the availability of specialized hardware, such as system-on-chip solutions, which provide some alleviation of constraints. Equally important, however, are the operating systems that run on this hardware, and specifically the ability to leverage commercial real-time operating systems which, unlike general purpose operating systems such as Linux, can provide the low-latency, deterministic execution required for embedded, and potentially safety-critical, applications at the edge. Despite this, studies considering the integration of real-time operating systems, specialized hardware, and machine learning/deep learning algorithms remain limited. In particular, better mechanisms for real-time scheduling in the context of machine learning applications will prove to be critical as these technologies move to the edge. In order to address some of these challenges, we present a resource management framework designed to provide a dynamic on-device approach to the allocation and scheduling of limited resources in a real-time processing environment. These types of mechanisms are necessary to support the deterministic behavior required by the control components contained in the edge nodes. To validate the effectiveness of our approach, we applied rigorous schedulability analysis to a large set of randomly generated simulated task sets and then verified the most time critical applications, such as the control tasks which maintained low-latency deterministic behavior even during off-nominal conditions. The practicality of our scheduling framework was demonstrated by integrating it into a commercial real-time operating system (VxWorks) then running a typical deep learning image processing application to perform simple object detection. The results indicate that our proposed resource management framework can be leveraged to facilitate integration of machine learning algorithms with real-time operating systems and embedded platforms, including widely-used, industry-standard real-time operating systems.

Download Full-text

A Comprehensive Machine Learning Approach for Quantitatively Analyzing Development Performance and Optimization for a Heterogeneous Carbonate Reservoir in Middle East

10.2118/208529-ms ◽

2021 ◽

Author(s):

Ruijie Huang ◽

Chenji Wei ◽

Baohua Wang ◽

Baozhu Li ◽

Jian Yang ◽

...

Keyword(s):

Machine Learning ◽

Principal Component ◽

Optimal Number ◽

Gas Production ◽

Carbonate Reservoir ◽

Multidimensional Data ◽

Well Performance ◽

Reference Case ◽

Evaluation Approach ◽

Water Cut

Abstract Compared with conventional reservoir, the development efficiency of the carbonate reservoir is lower, because of the strong heterogeneity and complicated reservoir structure. How to accurately and quantitatively analyze development performance is critical to understand challenges faced, and to propose optimization plans to improve recovery. In the study, we develop a workflow to evaluate similarities and difference of well performance based on Machine Learning methods. A comprehensive Machine Learning evaluation approach for well performance is established by utilizing Principal Component Analysis (PCA) in combination with K-Means clustering. The multidimensional dataset used for analysis consists of over 15 years dynamic surveillance data of producers and static geology parameters of formation, such as oil/water/gas production, GOR, water cut (WC), porosity, permeability, thickness, and depth. This approach divides multidimensional data into several clusters by PCA and K-Means, and quantitatively evaluate the well performance based on clustering results. The approach is successfully developed to visualize (dis)similarities among dynamic and static data of heterogeneous carbonate reservoir, the optimal number of clusters of 27-dimension data is 4. This method provides a systematic framework for visually and quantitatively analyzing and evaluating the development performance of production wells. Reservoir engineers can efficiently propose targeted optimization measures based on the analysis results. This paper offers a reference case for well performance clustering and quantitative analysis and proposing optimization plans that will help engineers make better decision in similar situation.

Download Full-text

A 5.1ms Low-Latency Face Detection Imager with In-Memory Charge-Domain Computing of Machine-Learning Classifiers

2021 Symposium on VLSI Circuits ◽

10.23919/vlsicircuits52068.2021.9492432 ◽

2021 ◽

Author(s):

Hyunsoo Song ◽

Sungjin Oh ◽

Juan Salinas ◽

Sung-Yun Park ◽

Euisik Yoon

Keyword(s):

Machine Learning ◽

Face Detection ◽

Low Latency ◽

Machine Learning Classifiers ◽

Learning Classifiers

Download Full-text

Comparison of synchronous and asynchronous parallelization of extreme surrogate-assisted multi-objective evolutionary algorithm

Natural Computing ◽

10.1007/s11047-020-09806-2 ◽

2020 ◽

Author(s):

Tomohiro Harada ◽

Misaki Kaidan ◽

Ruck Thawonmas

Keyword(s):

Machine Learning ◽

Evolutionary Algorithm ◽

Optimization Problems ◽

Computing Time ◽

The Other ◽

Multi Objective Optimization ◽

Multi Objective ◽

Evaluation Time ◽

Surrogate Function

Abstract This paper investigates the integration of a surrogate-assisted multi-objective evolutionary algorithm (MOEA) and a parallel computation scheme to reduce the computing time until obtaining the optimal solutions in evolutionary algorithms (EAs). A surrogate-assisted MOEA solves multi-objective optimization problems while estimating the evaluation of solutions with a surrogate function. A surrogate function is produced by a machine learning model. This paper uses an extreme learning surrogate-assisted MOEA/D (ELMOEA/D), which utilizes one of the well-known MOEA algorithms, MOEA/D, and a machine learning technique, extreme learning machine (ELM). A parallelization of MOEA, on the other hand, evaluates solutions in parallel on multiple computing nodes to accelerate the optimization process. We consider a synchronous and an asynchronous parallel MOEA as a master-slave parallelization scheme for ELMOEA/D. We carry out an experiment with multi-objective optimization problems to compare the synchronous parallel ELMOEA/D with the asynchronous parallel ELMOEA/D. In the experiment, we simulate two settings of the evaluation time of solutions. One determines the evaluation time of solutions by the normal distribution with different variances. On the other hand, another evaluation time correlates to the objective function value. We compare the quality of solutions obtained by the parallel ELMOEA/D variants within a particular computing time. The experimental results show that the parallelization of ELMOEA/D significantly reduces the computational time. In addition, the integration of ELMOEA/D with the asynchronous parallelization scheme obtains higher quality of solutions quicker than the synchronous parallel ELMOEA/D.

Download Full-text