Submodular Batch Selection for Training Deep Neural Networks

Mini-batch gradient descent based methods are the de facto algorithms for training neural network architectures today.We introduce a mini-batch selection strategy based on submodular function maximization. Our novel submodular formulation captures the informativeness of each sample and diversity of the whole subset. We design an efficient, greedy algorithm which can give high-quality solutions to this NP-hard combinatorial optimization problem. Our extensive experiments on standard datasets show that the deep models trained using the proposed batch selection strategy provide better generalization than Stochastic Gradient Descent as well as a popular baseline sampling strategy across different learning rates, batch sizes, and distance metrics.

Download Full-text

Learning Rates for Stochastic Gradient Descent with Nonconvex Objectives

IEEE Transactions on Pattern Analysis and Machine Intelligence ◽

10.1109/tpami.2021.3068154 ◽

2021 ◽

pp. 1-1

Author(s):

Yunwen Lei ◽

Ke Tang

Keyword(s):

Gradient Descent ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Learning Rates

Download Full-text

An Improvement of Stochastic Gradient Descent Approach for Mean-Variance Portfolio Optimization Problem

Journal of Mathematics ◽

10.1155/2021/8892636 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Stephanie S. W. Su ◽

Sie Long Kek

Keyword(s):

Portfolio Optimization ◽

Standard Error ◽

Gradient Descent ◽

Optimization Problem ◽

Stochastic Gradient Descent ◽

Moment Estimation ◽

The Mean ◽

Portfolio Optimization Problem ◽

Mean Variance Portfolio ◽

Mean Variance

In this paper, the current variant technique of the stochastic gradient descent (SGD) approach, namely, the adaptive moment estimation (Adam) approach, is improved by adding the standard error in the updating rule. The aim is to fasten the convergence rate of the Adam algorithm. This improvement is termed as Adam with standard error (AdamSE) algorithm. On the other hand, the mean-variance portfolio optimization model is formulated from the historical data of the rate of return of the S&P 500 stock, 10-year Treasury bond, and money market. The application of SGD, Adam, adaptive moment estimation with maximum (AdaMax), Nesterov-accelerated adaptive moment estimation (Nadam), AMSGrad, and AdamSE algorithms to solve the mean-variance portfolio optimization problem is further investigated. During the calculation procedure, the iterative solution converges to the optimal portfolio solution. It is noticed that the AdamSE algorithm has the smallest iteration number. The results show that the rate of convergence of the Adam algorithm is significantly enhanced by using the AdamSE algorithm. In conclusion, the efficiency of the improved Adam algorithm using the standard error has been expressed. Furthermore, the applicability of SGD, Adam, AdaMax, Nadam, AMSGrad, and AdamSE algorithms in solving the mean-variance portfolio optimization problem is validated.

Download Full-text

Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates

Journal of Complexity ◽

10.1016/j.jco.2019.101438 ◽

2020 ◽

Vol 57 ◽

pp. 101438

Author(s):

Arnulf Jentzen ◽

Philippe von Wurstemberger

Keyword(s):

Error Bounds ◽

Optimization Algorithm ◽

Gradient Descent ◽

Convergence Rates ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Lower Error ◽

Learning Rates

Download Full-text

LEAST SQUARE REGRESSION WITH COEFFICIENT REGULARIZATION BY GRADIENT DESCENT

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s021969131100447x ◽

2012 ◽

Vol 10 (01) ◽

pp. 1250005 ◽

Cited By ~ 1

Author(s):

JUAN HUANG ◽

HONG CHEN ◽

LUOQING LI

Keyword(s):

Integral Operator ◽

Explicit Expression ◽

Gradient Descent ◽

Stochastic Gradient ◽

Least Square ◽

Stochastic Gradient Descent ◽

Least Square Regression ◽

Learning Rates ◽

Gradient Descent Algorithm ◽

Regularization Parameters

We propose a stochastic gradient descent algorithm for the least square regression with coefficient regularization. An explicit expression of the solution via sampling operator and empirical integral operator is derived. Learning rates are given in terms of the suitable choices of the step sizes and regularization parameters.

Download Full-text

Inefficiency of K-FAC for Large Batch Size Training

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5946 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5053-5060

Author(s):

Linjian Ma ◽

Gabe Montague ◽

Jiayu Ye ◽

Zhewei Yao ◽

Amir Gholami ◽

...

Keyword(s):

Gradient Descent ◽

Computational Cost ◽

Parameter Tuning ◽

Stochastic Gradient Descent ◽

Batch Size ◽

Common Belief ◽

Record Times ◽

Large Batch ◽

Strong Scaling ◽

Batch Sizes

There have been several recent work claiming record times for ImageNet training. This is achieved by using large batch sizes during training to leverage parallel resources to produce faster wall-clock training times per training epoch. However, often these solutions require massive hyper-parameter tuning, which is an important cost that is often ignored. In this work, we perform an extensive analysis of large batch size training for two popular methods that is Stochastic Gradient Descent (SGD) as well as Kronecker-Factored Approximate Curvature (K-FAC) method. We evaluate the performance of these methods in terms of both wall-clock time and aggregate computational cost, and study the hyper-parameter sensitivity by performing more than 512 experiments per batch size for each of these methods. We perform experiments on multiple different models on two datasets of CIFAR-10 and SVHN. The results show that beyond a critical batch size both K-FAC and SGD significantly deviate from ideal strong scaling behaviour, and that despite common belief K-FAC does not exhibit improved large-batch scalability behavior, as compared to SGD.

Download Full-text

FNNC: Achieving Fairness through Neural Networks

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/315 ◽

2020 ◽

Author(s):

Manisha Padala ◽

Sujit Gujar

Keyword(s):

Gradient Descent ◽

Optimization Problem ◽

State Of The Art ◽

High Accuracy ◽

Stochastic Gradient Descent ◽

Classification Models ◽

Constrained Optimization Problem ◽

Lagrangian Multipliers ◽

Fairness Constraints ◽

Generalization Errors

In classification models, fairness can be ensured by solving a constrained optimization problem. We focus on fairness constraints like Disparate Impact, Demographic Parity, and Equalized Odds, which are non-decomposable and non-convex. Researchers define convex surrogates of the constraints and then apply convex optimization frameworks to obtain fair classifiers. Surrogates serve as an upper bound to the actual constraints, and convexifying fairness constraints is challenging. We propose a neural network-based framework, \emph{FNNC}, to achieve fairness while maintaining high accuracy in classification. The above fairness constraints are included in the loss using Lagrangian multipliers. We prove bounds on generalization errors for the constrained losses which asymptotically go to zero. The network is optimized using two-step mini-batch stochastic gradient descent. Our experiments show that FNNC performs as good as the state of the art, if not better. The experimental evidence supplements our theoretical guarantees. In summary, we have an automated solution to achieve fairness in classification, which is easily extendable to many fairness constraints.

Download Full-text

A Stochastic Gradient Descent Based SVM with Fuzzy-Rough Feature Selection and Instance Selection for Breast Cancer Diagnosis

Journal of Medical Imaging and Health Informatics ◽

10.1166/jmihi.2015.1514 ◽

2015 ◽

Vol 5 (6) ◽

pp. 1233-1239 ◽

Cited By ~ 2

Author(s):

Aytuğ Onan

Keyword(s):

Breast Cancer ◽

Feature Selection ◽

Cancer Diagnosis ◽

Gradient Descent ◽

Breast Cancer Diagnosis ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Instance Selection ◽

Selection For

Download Full-text

Improved Deep Embedded Clustering with Local Structure Preservation

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/243 ◽

2017 ◽

Cited By ~ 80

Author(s):

Xifeng Guo ◽

Long Gao ◽

Xinwang Liu ◽

Jianping Yin

Keyword(s):

Neural Networks ◽

Local Structure ◽

Gradient Descent ◽

Optimization Problem ◽

Feature Space ◽

Stochastic Gradient Descent ◽

Structure Preservation ◽

Feature Representations ◽

Deep Feature ◽

Data Points

Deep clustering learns deep feature representations that favor clustering task using neural networks. Some pioneering work proposes to simultaneously learn embedded features and perform clustering by explicitly defining a clustering oriented loss. Though promising performance has been demonstrated in various applications, we observe that a vital ingredient has been overlooked by these work that the defined clustering loss may corrupt feature space, which leads to non-representative meaningless features and this in turn hurts clustering performance. To address this issue, in this paper, we propose the Improved Deep Embedded Clustering (IDEC) algorithm to take care of data structure preservation. Specifically, we manipulate feature space to scatter data points using a clustering loss as guidance. To constrain the manipulation and maintain the local structure of data generating distribution, an under-complete autoencoder is applied. By integrating the clustering loss and autoencoder's reconstruction loss, IDEC can jointly optimize cluster labels assignment and learn features that are suitable for clustering with local structure preservation. The resultant optimization problem can be effectively solved by mini-batch stochastic gradient descent and backpropagation. Experiments on image and text datasets empirically validate the importance of local structure preservation and the effectiveness of our algorithm.

Download Full-text

Taming the Noisy Gradient: Train Deep Neural Networks with Small Batch Sizes

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/604 ◽

2019 ◽

Cited By ~ 3

Author(s):

Yikai Zhang ◽

Hui Qu ◽

Chao Chen ◽

Dimitris Metaxas

Keyword(s):

Neural Networks ◽

Gradient Descent ◽

Deep Neural Networks ◽

Stochastic Gradient Descent ◽

Batch Size ◽

Training Performance ◽

Strongly Convex ◽

Batch Sizes ◽

Learning Architectures ◽

New Framework

Deep learning architectures are usually proposed with millions of parameters, resulting in a memory issue when training deep neural networks with stochastic gradient descent type methods using large batch sizes. However, training with small batch sizes tends to produce low quality solution due to the large variance of stochastic gradients. In this paper, we tackle this problem by proposing a new framework for training deep neural network with small batches/noisy gradient. During optimization, our method iteratively applies a proximal type regularizer to make loss function strongly convex. Such regularizer stablizes the gradient, leading to better training performance. We prove that our algorithm achieves comparable convergence rate as vanilla SGD even with small batch size. Our framework is simple to implement and can be potentially combined with many existing optimization algorithms. Empirical results show that our method outperforms SGD and Adam when batch size is small. Our implementation is available at https://github.com/huiqu18/TRAlgorithm.

Download Full-text

Linear Support Vector Machine (SVM) with Stochastic Gradient Descent (SGD) training and multinomial Nave Bayes (NB) in News Classification

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v7i4.360363 ◽

2019 ◽

Vol 7 (4) ◽

pp. 360-363

Author(s):

Feroz Ahmed ◽

Shabina Ghafir

Keyword(s):

Support Vector Machine ◽

Gradient Descent ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Support Vector ◽

Linear Support Vector Machine

Download Full-text