Meta-Descent for Online, Continual Prediction

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013943 ◽

2019 ◽

Vol 33 ◽

pp. 3943-3950

Author(s):

Andrew Jacobsen ◽

Matthew Schlegel ◽

Cameron Linke ◽

Thomas Degris ◽

Adam White ◽

...

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Time Series Prediction ◽

Real Data ◽

Second Order ◽

Stochastic Gradient Descent ◽

Step Size ◽

Vector Approximation ◽

Prediction Problems ◽

Stationary Problems

This paper investigates different vector step-size adaptation approaches for non-stationary online, continual prediction problems. Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of appropriately chosen step-sizes. Many methods, including AdaGrad, RMSProp, and AMSGrad, keep statistics about the learning process to approximate a second order update—a vector approximation of the inverse Hessian. Another family of approaches use meta-gradient descent to adapt the stepsize parameters to minimize prediction error. These metadescent strategies are promising for non-stationary problems, but have not been as extensively explored as quasi-second order methods. We first derive a general, incremental metadescent algorithm, called AdaGain, designed to be applicable to a much broader range of algorithms, including those with semi-gradient updates or even those with accelerations, such as RMSProp. We provide an empirical comparison of methods from both families. We conclude that methods from both families can perform well, but in non-stationary prediction problems the meta-descent methods exhibit advantages. Our method is particularly robust across several prediction problems, and is competitive with the state-of-the-art method on a large-scale, time-series prediction problem on real data from a mobile robot.

Download Full-text

Delay-Adaptive Distributed Stochastic Optimization

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6001 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5503-5510

Author(s):

Zhaolin Ren ◽

Zhengyuan Zhou ◽

Linhai Qiu ◽

Ajay Deshpande ◽

Jayant Kalagnanam

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Optimization Problems ◽

Stochastic Gradient Descent ◽

Large Scale Optimization ◽

Step Size ◽

Gradient Information ◽

Convergence Results ◽

Scale Optimization ◽

Class Of Functions

In large-scale optimization problems, distributed asynchronous stochastic gradient descent (DASGD) is a commonly used algorithm. In most applications, there are often a large number of computing nodes asynchronously computing gradient information. As such, the gradient information received at a given iteration is often stale. In the presence of such delays, which can be unbounded, the convergence of DASGD is uncertain. The contribution of this paper is twofold. First, we propose a delay-adaptive variant of DASGD where we adjust each iteration's step-size based on the size of the delay, and prove asymptotic convergence of the algorithm on variationally coherent stochastic problems, a class of functions which properly includes convex, quasi-convex and star-convex functions. Second, we extend the convergence results of standard DASGD, used usually for problems with bounded domains, to problems with unbounded domains. In this way, we extend the frontier of theoretical guarantees for distributed asynchronous optimization, and provide new insights for practitioners working on large-scale optimization problems.

Download Full-text

Variance Counterbalancing for Stochastic Large-scale Learning

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213020500104 ◽

2020 ◽

Vol 29 (05) ◽

pp. 2050010

Author(s):

Pola Lydia Lagari ◽

Lefteri H. Tsoukalas ◽

Isaac E. Lagaris

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Performance Enhancement ◽

Mean Squared Error ◽

Large Data ◽

Random Sets ◽

Stochastic Gradient Descent ◽

Step Size ◽

Data Set ◽

Acceleration Techniques

Stochastic Gradient Descent (SGD) is perhaps the most frequently used method for large scale training. A common example is training a neural network over a large data set, which amounts to minimizing the corresponding mean squared error (MSE). Since the convergence of SGD is rather slow, acceleration techniques based on the notion of “Mini-Batches” have been developed. All of them however, mimicking SGD, impose diminishing step-sizes as a means to inhibit large variations in the MSE objective. In this article, we introduce random sets of mini-batches instead of individual mini-batches. We employ an objective function that minimizes the average MSE and its variance over these sets, eliminating so the need for the systematic step size reduction. This approach permits the use of state-of-the-art optimization methods, far more efficient than the gradient descent, and yields a significant performance enhancement.

Download Full-text

Solving large scale linear prediction problems using stochastic gradient descent algorithms

Twenty-first international conference on Machine learning - ICML '04 ◽

10.1145/1015330.1015332 ◽

2004 ◽

Cited By ~ 313

Author(s):

Tong Zhang

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Linear Prediction ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Descent Algorithms ◽

Prediction Problems

Download Full-text

Distributed Stochastic Optimization with Large Delays

Mathematics of Operations Research ◽

10.1287/moor.2021.1200 ◽

2021 ◽

Author(s):

Zhengyuan Zhou ◽

Panayotis Mertikopoulos ◽

Nicholas Bambos ◽

Peter Glynn ◽

Yinyu Ye

Keyword(s):

Stochastic Optimization ◽

Gradient Descent ◽

Large Scale ◽

Optimization Problems ◽

Broad Class ◽

Algorithm Design ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Step Size ◽

Critical Set

The recent surge of breakthroughs in machine learning and artificial intelligence has sparked renewed interest in large-scale stochastic optimization problems that are universally considered hard. One of the most widely used methods for solving such problems is distributed asynchronous stochastic gradient descent (DASGD), a family of algorithms that result from parallelizing stochastic gradient descent on distributed computing architectures (possibly) asychronously. However, a key obstacle in the efficient implementation of DASGD is the issue of delays: when a computing node contributes a gradient update, the global model parameter may have already been updated by other nodes several times over, thereby rendering this gradient information stale. These delays can quickly add up if the computational throughput of a node is saturated, so the convergence of DASGD may be compromised in the presence of large delays. Our first contribution is that, by carefully tuning the algorithm’s step size, convergence to the critical set is still achieved in mean square, even if the delays grow unbounded at a polynomial rate. We also establish finer results in a broad class of structured optimization problems (called variationally coherent), where we show that DASGD converges to a global optimum with a probability of one under the same delay assumptions. Together, these results contribute to the broad landscape of large-scale nonconvex stochastic optimization by offering state-of-the-art theoretical guarantees and providing insights for algorithm design.

Download Full-text

Large-Scale Machine Learning with Stochastic Gradient Descent Léon Bottou

Statistical Learning and Data Science ◽

10.1201/b11429-6 ◽

2011 ◽

pp. 33-42 ◽

Cited By ~ 1

Keyword(s):

Machine Learning ◽

Gradient Descent ◽

Large Scale ◽

Stochastic Gradient ◽

Stochastic Gradient Descent

Download Full-text

A new stochastic gradient descent possibilistic clustering algorithm

AI Communications ◽

10.3233/aic-210125 ◽

2021 ◽

pp. 1-18

Author(s):

Angeliki Koutsimpela ◽

Konstantinos D. Koutroumbas

Keyword(s):

Cost Function ◽

Gradient Descent ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Data ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Data Sets ◽

Convergence Results ◽

Possibilistic Clustering

Several well known clustering algorithms have their own online counterparts, in order to deal effectively with the big data issue, as well as with the case where the data become available in a streaming fashion. However, very few of them follow the stochastic gradient descent philosophy, despite the fact that the latter enjoys certain practical advantages (such as the possibility of (a) running faster than their batch processing counterparts and (b) escaping from local minima of the associated cost function), while, in addition, strong theoretical convergence results have been established for it. In this paper a novel stochastic gradient descent possibilistic clustering algorithm, called O- PCM 2 is introduced. The algorithm is presented in detail and it is rigorously proved that the gradient of the associated cost function tends to zero in the L 2 sense, based on general convergence results established for the family of the stochastic gradient descent algorithms. Furthermore, an additional discussion is provided on the nature of the points where the algorithm may converge. Finally, the performance of the proposed algorithm is tested against other related algorithms, on the basis of both synthetic and real data sets.

Download Full-text

Periodic step-size adaptation in second-order gradient descent for single-pass on-line structured learning

Machine Learning ◽

10.1007/s10994-009-5142-6 ◽

2009 ◽

Vol 77 (2-3) ◽

pp. 195-224 ◽

Cited By ~ 5

Author(s):

Chun-Nan Hsu ◽

Han-Shen Huang ◽

Yu-Ming Chang ◽

Yuh-Jye Lee

Keyword(s):

Gradient Descent ◽

Second Order ◽

Structured Learning ◽

Step Size ◽

Single Pass ◽

On Line

Download Full-text

Large-Scale Machine Learning with Stochastic Gradient Descent

Proceedings of COMPSTAT'2010 ◽

10.1007/978-3-7908-2604-3_16 ◽

2010 ◽

pp. 177-186 ◽

Cited By ~ 1247

Author(s):

Léon Bottou

Keyword(s):

Machine Learning ◽

Gradient Descent ◽

Large Scale ◽

Stochastic Gradient ◽

Stochastic Gradient Descent

Download Full-text

Stochastic Subgradient for Large-Scale Support Vector Machine Using the Generalized Pinball Loss Function

Symmetry ◽

10.3390/sym13091652 ◽

2021 ◽

Vol 13 (9) ◽

pp. 1652

Author(s):

Wanida Panup ◽

Rabian Wangkeeree

Keyword(s):

Support Vector Machine ◽

Loss Function ◽

Gradient Descent ◽

Large Scale ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Support Vector ◽

Hinge Loss ◽

Gradient Descent Algorithm ◽

Pinball Loss

In this paper, we propose a stochastic gradient descent algorithm, called stochastic gradient descent method-based generalized pinball support vector machine (SG-GPSVM), to solve data classification problems. This approach was developed by replacing the hinge loss function in the conventional support vector machine (SVM) with a generalized pinball loss function. We show that SG-GPSVM is convergent and that it approximates the conventional generalized pinball support vector machine (GPSVM). Further, the symmetric kernel method was adopted to evaluate the performance of SG-GPSVM as a nonlinear classifier. Our suggested algorithm surpasses existing methods in terms of noise insensitivity, resampling stability, and accuracy for large-scale data scenarios, according to the experimental results.

Download Full-text

SSGD: A Safe and Efficient Method of Gradient Descent

Security and Communication Networks ◽

10.1155/2021/5404061 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Jinhuan Duan ◽

Xianxian Li ◽

Shiqi Gao ◽

Zili Zhong ◽

Jinyan Wang

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Optimization Problems ◽

Unit Vector ◽

Descent Method ◽

Stochastic Gradient ◽

Learning System ◽

Training Data ◽

Stochastic Gradient Descent ◽

Gradient Descent Method

With the vigorous development of artificial intelligence technology, various engineering technology applications have been implemented one after another. The gradient descent method plays an important role in solving various optimization problems, due to its simple structure, good stability, and easy implementation. However, in multinode machine learning system, the gradients usually need to be shared, which will cause privacy leakage, because attackers can infer training data with the gradient information. In this paper, to prevent gradient leakage while keeping the accuracy of the model, we propose the super stochastic gradient descent approach to update parameters by concealing the modulus length of gradient vectors and converting it or them into a unit vector. Furthermore, we analyze the security of super stochastic gradient descent approach and demonstrate that our algorithm can defend against the attacks on the gradient. Experiment results show that our approach is obviously superior to prevalent gradient descent approaches in terms of accuracy, robustness, and adaptability to large-scale batches. Interestingly, our algorithm can also resist model poisoning attacks to a certain extent.

Download Full-text