Communication-Efficient Randomized Algorithm for Multi-Kernel Online Federated Learning

10.36227/techrxiv.14501814.v1 ◽

2021 ◽

Author(s):

Songnam Hong ◽

Jeongmin Chae

Keyword(s):

Gradient Descent ◽

Kernel Method ◽

Randomized Algorithm ◽

Communication Overhead ◽

Extension Principle ◽

Sequential Data ◽

Asymptotic Performance ◽

Numerical Tests ◽

Learning Tasks ◽

Regret Bound

Online federated learning (OFL) is a promising framework to learn a sequence of global functions using distributed sequential data at local devices. In this framework, we first introduce a {\em single} kernel-based OFL (termed S-KOFL) by incorporating the random-feature (RF) approximation, online gradient descent (OGD), and federated averaging (FedAvg) properly. However, it is nontrivial to develop a communication-efficient method with multiple kernels. One can construct a multi-kernel method (termed vM-KOFL) by following the extension principle in the centralized counterpart. This vanilla method is not practical as the communication overhead grows linearly with the size of a kernel dictionary. Moreover, this problem is not addressed via the existing communication-efficient techniques in federated learning such as quantization or sparsification. Our major contribution is to propose a novel randomized algorithm (named eM-KOFL), which can enjoy the advantage of multiple kernels while having a similar communication overhead with S-KOFL. It is theoretically proved that eM-KOFL yields the same asymptotic performance as vM-KOFL, i.e., both methods achieve an optimal sublinear regret bound. Mimicking the key principle of eM-KOFL efficiently, pM-KOFL is presented. Via numerical tests with real datasets, we demonstrate that pM-KOFL can yield the same performances as vM-KOFL and eM-KOFL on various online learning tasks while having the same communication overhead as S-KOFL. These suggest the practicality of the proposed pM-KOFL.

Download Full-text

Ansatz-Independent Variational Quantum Classifiers and the Price of Ansatz

10.21203/rs.3.rs-919214/v1 ◽

2021 ◽

Author(s):

Hideyuki Miyahara ◽

Vwani Roychowdhury

Keyword(s):

Gradient Descent ◽

Unitary Operator ◽

Kernel Method ◽

Quantum Circuit ◽

Open Problems ◽

Circuit Realization ◽

Computational Framework ◽

Descent Algorithm ◽

Gradient Descent Algorithm ◽

Representational Power

Abstract The paradigm of variational quantum classifiers (VQCs) encodes classical information as quantum states, followed by quantum processing and then measurements to generate classical predictions. VQCs are promising candidates for efficient utilizations of noisy intermediate scale quantum (NISQ) devices: classifiers involving M-dimensional datasets can be implemented with only ⌈log2 M⌉ qubits by using an amplitude encoding. A general framework for designing and training VQCs, however, is lacking. An encouraging specific embodiment of VQCs, quantum circuit learning (QCL), utilizes an ansatz: a circuit with a predetermined circuit geometry and parametrized gates expressing a time-evolution unitary operator; training involves learning the gate parameters through a gradient- descent algorithm where the gradients themselves can be efficiently estimated by the quantum circuit. The representational power of QCL, however, depends strongly on the choice of the ansatz, as it limits the range of possible unitary operators that a VQC can search over. Equally importantly, the landscape of the optimization problem may have challenging properties such as barren plateaus and the associated gradient-descent algorithm may not find good local minima. Thus, it is critically important to estimate (i) the price of ansatz; that is, the gap between the performance of QCL and the performance of ansatz-independent VQCs, and (ii) the price of using quantum circuits as classical classifiers: that is, the performance gap between VQCs and equivalent classical classifiers. This paper develops a computational framework to address both these open problems. First, it shows that VQCs, including QCL, fit inside the well-known kernel method. Next it introduces a framework for efficiently designing ansatz-independent VQCs, which we call the unitary kernel method (UKM). The UKM framework enables one to estimate the first known bounds on both the price of anstaz and the price of any speedup advantages of VQCs: numerical results with datatsets of various dimensions, ranging from 4 to 256, show that the ansatz-induced gap can vary between 10−20%, while the VQC-induced gap (between VQC and kernel method) can vary between 10−16%. To further understand the role of ansatz in VQCs, we also propose a method of decomposing a given unitary operator into a quantum circuit, which we call the variational circuit realization (VCR): given any parameterized circuit block (as for example, used in QCL), it finds optimal parameters and the number of layers of the circuit block required to approximate any target unitary operator with a given precision.

Download Full-text

On the Convergence of (Stochastic) Gradient Descent with Extrapolation for Non-Convex Minimization

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/556 ◽

2019 ◽

Author(s):

Yi Xu ◽

Zhuoning Yuan ◽

Sen Yang ◽

Rong Jin ◽

Tianbao Yang

Keyword(s):

Convex Optimization ◽

Gradient Descent ◽

Optimization Problems ◽

Upper Bounds ◽

Convex Minimization ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

First Order ◽

Convex Optimization Problems ◽

Learning Tasks

Extrapolation is a well-known technique for solving convex optimization and variational inequalities and recently attracts some attention for non-convex optimization. Several recent works have empirically shown its success in some machine learning tasks. However, it has not been analyzed for non-convex minimization and there still remains a gap between the theory and the practice. In this paper, we analyze gradient descent and stochastic gradient descent with extrapolation for finding an approximate first-order stationary point in smooth non-convex optimization problems. Our convergence upper bounds show that the algorithms with extrapolation can be accelerated than without extrapolation.

Download Full-text

Demystifying the Random Feature-Based Online Multi-Kernel Learning

10.36227/techrxiv.14501844.v1 ◽

2021 ◽

Author(s):

Songnam Hong ◽

Jeongmin Chae

Keyword(s):

Online Learning ◽

Low Complexity ◽

Optimization Techniques ◽

Streaming Data ◽

Kernel Learning ◽

Expert Advice ◽

Theoretical Contribution ◽

Learning Tasks ◽

Feature Based ◽

Regret Bound

<div>The random feature-based online multi-kernel learning (RF-OMKL) is a promising framework in functional learning tasks. This framework is necessary for an online learning with continuous streaming data due to its low-complexity and scalability. </div><div>In RF-OMKL framework, numerous algorithms can be presented according to an underlying online learning and optimization techniques. The best known algorithm (termed Raker) has been proposed with the lens of the famous online learning with expert advice, where each kernel from a kernel dictionary is viewed as an expert. Harnessing this relation, it was proved that Raker yields a sublinear {\em expert} regret bound, in which as the name implies, the best function is further restricted as the expert-based framework. Namely, it is not an actual sublinear regret bound under RF-OMKL framework. In this paper, we propose a novel algorithm (named BestOMKL) for RF-OMKL framework and prove that it achieves a sublinear regret bound under a certain condition. Beyond our theoretical contribution, we demonstrate the superiority of our algorithm via numerical tests with real datasets. Notably, BestOMKL outperforms the state-of-the-art kernel-based algorithms (including Raker) on various online learning tasks, while having a lower complexity as Raker. These suggest the practicality of BestOMKL.</div>

Download Full-text

Stochastic gradient descent for hybrid quantum-classical optimization

Quantum ◽

10.22331/q-2020-08-31-314 ◽

2020 ◽

Vol 4 ◽

pp. 314 ◽

Cited By ~ 2

Author(s):

Ryan Sweke ◽

Frederik Wilde ◽

Johannes Jakob Meyer ◽

Maria Schuld ◽

Paul K. Fährmann ◽

...

Keyword(s):

Gradient Descent ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Expectation Values ◽

Data Set ◽

Doubly Stochastic ◽

Learning Tasks ◽

Value Estimation ◽

Near Term ◽

Classical Optimization

Within the context of hybrid quantum-classical optimization, gradient descent based optimizers typically require the evaluation of expectation values with respect to the outcome of parameterized quantum circuits. In this work, we explore the consequences of the prior observation that estimation of these quantities on quantum hardware results in a form of stochastic gradient descent optimization. We formalize this notion, which allows us to show that in many relevant cases, including VQE, QAOA and certain quantum classifiers, estimating expectation values with k measurement outcomes results in optimization algorithms whose convergence properties can be rigorously well understood, for any value of k. In fact, even using single measurement outcomes for the estimation of expectation values is sufficient. Moreover, in many settings the required gradients can be expressed as linear combinations of expectation values -- originating, e.g., from a sum over local terms of a Hamiltonian, a parameter shift rule, or a sum over data-set instances -- and we show that in these cases k-shot expectation value estimation can be combined with sampling over terms of the linear combination, to obtain ``doubly stochastic'' gradient descent optimizers. For all algorithms we prove convergence guarantees, providing a framework for the derivation of rigorous optimization results in the context of near-term quantum devices. Additionally, we explore numerically these methods on benchmark VQE, QAOA and quantum-enhanced machine learning tasks and show that treating the stochastic settings as hyper-parameters allows for state-of-the-art results with significantly fewer circuit executions and measurements.

Download Full-text

A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/473 ◽

2019 ◽

Cited By ~ 3

Author(s):

Shaohuai Shi ◽

Kaiyong Zhao ◽

Qiang Wang ◽

Zhenheng Tang ◽

Xiaowen Chu

Keyword(s):

Gradient Descent ◽

Communication Complexity ◽

Stochastic Gradient Descent ◽

Communication Overhead ◽

Data Sets ◽

Objective Functions ◽

Promising Technique ◽

Convergence Performance ◽

The Impact ◽

Theoretical Results

Gradient sparsification is a promising technique to significantly reduce the communication overhead in decentralized synchronous stochastic gradient descent (S-SGD) algorithms. Yet, many existing gradient sparsification schemes (e.g., Top-k sparsification) have a communication complexity of O(kP), where k is the number of selected gradients by each worker and P is the number of workers. Recently, the gTop-k sparsification scheme has been proposed to reduce the communication complexity from O(kP) to O(k logP), which significantly boosts the system scalability. However, it remains unclear whether the gTop-k sparsification scheme can converge in theory. In this paper, we first provide theoretical proofs on the convergence of the gTop-k scheme for non-convex objective functions under certain analytic assumptions. We then derive the convergence rate of gTop-k S-SGD, which is at the same order as the vanilla mini-batch SGD. Finally, we conduct extensive experiments on different machine learning models and data sets to verify the soundness of the assumptions and theoretical results, and discuss the impact of the compression ratio on the convergence performance.

Download Full-text

A Distributed Conjugate Gradient Online Learning Method over Networks

Complexity ◽

10.1155/2020/1390963 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Cuixia Xu ◽

Junlong Zhu ◽

Youlin Shang ◽

Qingtao Wu

Keyword(s):

Conjugate Gradient ◽

Gradient Descent ◽

Line Search ◽

Descent Method ◽

Convergence Speed ◽

Gradient Algorithm ◽

Gradient Descent Method ◽

Objective Functions ◽

Regret Bound ◽

Number Of Iterations

In a distributed online optimization problem with a convex constrained set over an undirected multiagent network, the local objective functions are convex and vary over time. Most of the existing methods used to solve this problem are based on the fastest gradient descent method. However, the convergence speed of these methods is decreased with an increase in the number of iterations. To accelerate the convergence speed of the algorithm, we present a distributed online conjugate gradient algorithm, different from a gradient method, in which the search directions are a set of vectors that are conjugated to each other and the step sizes are obtained through an accurate line search. We analyzed the convergence of the algorithm theoretically and obtained a regret bound of OT, where T is the number of iterations. Finally, numerical experiments conducted on a sensor network demonstrate the performance of the proposed algorithm.

Download Full-text

Averaging Is Probably Not the Optimum Way of Aggregating Parameters in Federated Learning

Entropy ◽

10.3390/e22030314 ◽

2020 ◽

Vol 22 (3) ◽

pp. 314 ◽

Cited By ~ 4

Author(s):

Peng Xiao ◽

Samuel Cheng ◽

Vladimir Stankovic ◽

Dejan Vukobratovic

Keyword(s):

Mutual Information ◽

Data Privacy ◽

Gradient Descent ◽

State Of The Art ◽

Stochastic Gradient Descent ◽

Local Data ◽

Learning Tasks ◽

Current State ◽

Increasing Trend ◽

Stochastic Properties

Federated learning is a decentralized topology of deep learning, that trains a shared model through data distributed among each client (like mobile phones, wearable devices), in order to ensure data privacy by avoiding raw data exposed in data center (server). After each client computes a new model parameter by stochastic gradient descent (SGD) based on their own local data, these locally-computed parameters will be aggregated to generate an updated global model. Many current state-of-the-art studies aggregate different client-computed parameters by averaging them, but none theoretically explains why averaging parameters is a good approach. In this paper, we treat each client computed parameter as a random vector because of the stochastic properties of SGD, and estimate mutual information between two client computed parameters at different training phases using two methods in two learning tasks. The results confirm the correlation between different clients and show an increasing trend of mutual information with training iteration. However, when we further compute the distance between client computed parameters, we find that parameters are getting more correlated while not getting closer. This phenomenon suggests that averaging parameters may not be the optimum way of aggregating trained parameters.

Download Full-text

Demystifying the Random Feature-Based Online Multi-Kernel Learning

10.36227/techrxiv.14501844 ◽

2021 ◽

Author(s):

Songnam Hong ◽

Jeongmin Chae

Keyword(s):

Online Learning ◽

Low Complexity ◽

Optimization Techniques ◽

Streaming Data ◽

Kernel Learning ◽

Expert Advice ◽

Theoretical Contribution ◽

Learning Tasks ◽

Feature Based ◽

Regret Bound

<div>The random feature-based online multi-kernel learning (RF-OMKL) is a promising framework in functional learning tasks. This framework is necessary for an online learning with continuous streaming data due to its low-complexity and scalability. </div><div>In RF-OMKL framework, numerous algorithms can be presented according to an underlying online learning and optimization techniques. The best known algorithm (termed Raker) has been proposed with the lens of the famous online learning with expert advice, where each kernel from a kernel dictionary is viewed as an expert. Harnessing this relation, it was proved that Raker yields a sublinear {\em expert} regret bound, in which as the name implies, the best function is further restricted as the expert-based framework. Namely, it is not an actual sublinear regret bound under RF-OMKL framework. In this paper, we propose a novel algorithm (named BestOMKL) for RF-OMKL framework and prove that it achieves a sublinear regret bound under a certain condition. Beyond our theoretical contribution, we demonstrate the superiority of our algorithm via numerical tests with real datasets. Notably, BestOMKL outperforms the state-of-the-art kernel-based algorithms (including Raker) on various online learning tasks, while having a lower complexity as Raker. These suggest the practicality of BestOMKL.</div>

Download Full-text

Learning two multiple-cue probability learning tasks in succession.

PsycEXTRA Dataset ◽

10.1037/e418592004-001 ◽

1977 ◽

Author(s):

Berndt Brehmer ◽

Pia Almqvist

Keyword(s):

Probability Learning ◽

Learning Tasks

Download Full-text