Do Subsampled Newton Methods Work for High-Dimensional Data?

Xiang Li; Shusen Wang; Zhihua Zhang

doi:10.1609/aaai.v34i04.5905

Do Subsampled Newton Methods Work for High-Dimensional Data?

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5905 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4723-4730

Author(s):

Xiang Li ◽

Shusen Wang ◽

Zhihua Zhang

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Empirical Risk Minimization ◽

Newton Methods ◽

Risk Minimization ◽

Strongly Convex ◽

Empirical Risk ◽

Data Points ◽

Approximate Hessian

Subsampled Newton methods approximate Hessian matrices through subsampling techniques to alleviate the per-iteration cost. Previous results require Ω (d) samples to approximate Hessians, where d is the dimension of data points, making it less practical for high-dimensional data. The situation is deteriorated when d is comparably as large as the number of data points n, which requires to take the whole dataset into account, making subsampling not useful. This paper theoretically justifies the effectiveness of subsampled Newton methods on strongly convex empirical risk minimization with high dimensional data. Specifically, we provably require only Θ˜(deffγ) samples for approximating the Hessian matrices, where deffγ is the γ-ridge leverage and can be much smaller than d as long as nγ ≫ 1. Our theories work for three types of Newton methods: subsampled Netwon, distributed Newton, and proximal Newton.

Download Full-text

Efficient Private ERM for Smooth Objectives

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/548 ◽

2017 ◽

Cited By ~ 6

Author(s):

Jiaqi Zhang ◽

Kai Zheng ◽

Wenlong Mou ◽

Liwei Wang

Keyword(s):

Gradient Descent ◽

Optimization Algorithms ◽

Stochastic Gradient Descent ◽

Empirical Risk Minimization ◽

Risk Minimization ◽

Running Time ◽

Strongly Convex ◽

Empirical Risk ◽

Gradient Descent Algorithm ◽

Previous State

In this paper, we consider efficient differentially private empirical risk minimization from the viewpoint of optimization algorithms. For strongly convex and smooth objectives, we prove that gradient descent with output perturbation not only achieves nearly optimal utility, but also significantly improves the running time of previous state-of-the-art private optimization algorithms, for both $\epsilon$-DP and $(\epsilon, \delta)$-DP. For non-convex but smooth objectives, we propose an RRPSGD (Random Round Private Stochastic Gradient Descent) algorithm, which provably converges to a stationary point with privacy guarantee. Besides the expected utility bounds, we also provide guarantees in high probability form. Experiments demonstrate that our algorithm consistently outperforms existing method in both utility and running time.

Download Full-text

Between Pure and Approximate Differential Privacy

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v7i2.648 ◽

2017 ◽

Vol 7 (2) ◽

Cited By ~ 6

Author(s):

Thomas Steinke ◽

Jonathan Ullman

Keyword(s):

Lower Bound ◽

Differential Privacy ◽

High Dimensional ◽

Empirical Risk Minimization ◽

Sample Complexity ◽

Risk Minimization ◽

Worst Case ◽

Logarithmic Factor ◽

Empirical Risk ◽

Statistical Queries

We show a new lower bound on the sample complexity of (ε,δ)-differentially private algorithms that accurately answer statistical queries on high-dimensional databases. The novelty of our bound is that it depends optimally on the parameter δ, which loosely corresponds to the probability that the algorithm fails to be private, and is the first to smoothly interpolate between approximate differential privacy (δ >0) and pure differential privacy (δ= 0). Specifically, we consider a database D ∈{±1}n×d and its one-way marginals, which are the d queries of the form “What fraction of individual records have the i-th bit set to +1?” We show that in order to answer all of these queries to within error ±α (on average) while satisfying (ε,δ)-differential privacy for some function δ such that δ≥2−o(n) and δ≤1/n1+Ω(1), it is necessary that \[n≥Ω (\frac{√dlog(1/δ)}{αε}).\] This bound is optimal up to constant factors. This lower bound implies similar new bounds for problems like private empirical risk minimization and private PCA. To prove our lower bound, we build on the connection between fingerprinting codes and lower bounds in differential privacy (Bun, Ullman, and Vadhan, STOC’14). In addition to our lower bound, we give new purely and approximately differentially private algorithms for answering arbitrary statistical queries that improve on the sample complexity of the standard Laplace and Gaussian mechanisms for achieving worst-case accuracy guarantees by a logarithmic factor.

Download Full-text

Distributed block-diagonal approximation methods for regularized empirical risk minimization

Machine Learning ◽

10.1007/s10994-019-05859-2 ◽

2019 ◽

Vol 109 (4) ◽

pp. 813-852

Author(s):

Ching-pei Lee ◽

Kai-Wei Chang

Keyword(s):

Machine Learning ◽

Large Scale ◽

Broad Class ◽

Linear Convergence ◽

Approximate Solutions ◽

Empirical Risk Minimization ◽

Risk Minimization ◽

Global Linear Convergence ◽

Strongly Convex ◽

Empirical Risk

AbstractIn recent years, there is a growing need to train machine learning models on a huge volume of data. Therefore, designing efficient distributed optimization algorithms for empirical risk minimization (ERM) has become an active and challenging research topic. In this paper, we propose a flexible framework for distributed ERM training through solving the dual problem, which provides a unified description and comparison of existing methods. Our approach requires only approximate solutions of the sub-problems involved in the optimization process, and is versatile to be applied on many large-scale machine learning problems including classification, regression, and structured prediction. We show that our framework enjoys global linear convergence for a broad class of non-strongly-convex problems, and some specific choices of the sub-problems can even achieve much faster convergence than existing approaches by a refined analysis. This improved convergence rate is also reflected in the superior empirical performance of our method.

Download Full-text

Statistical Learning: Stability is Sufficient for Generalization and Necessary and Sufficient for Consistency of Empirical Risk Minimization

10.21236/ada459857 ◽

2004 ◽

Cited By ~ 4

Author(s):

Sayan Mukherjee ◽

Partha Niyogi ◽

Tomaso Poggio ◽

Ryan Rifkin

Keyword(s):

Statistical Learning ◽

Empirical Risk Minimization ◽

Risk Minimization ◽

Empirical Risk ◽

Necessary And Sufficient

Download Full-text

Distributed empirical risk minimization over directed graphs

2019 53rd Asilomar Conference on Signals, Systems, and Computers ◽

10.1109/ieeeconf44664.2019.9049065 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ran Xin ◽

Anit Kumar Sahu ◽

Soummya Kar ◽

Usman A. Khan

Keyword(s):

Directed Graphs ◽

Empirical Risk Minimization ◽

Risk Minimization ◽

Empirical Risk

Download Full-text

Learning Bounds of ERM Principle for Sequences of Time-Dependent Samples

Discrete Dynamics in Nature and Society ◽

10.1155/2015/826812 ◽

2015 ◽

Vol 2015 ◽

pp. 1-8

Author(s):

Mingchen Yao ◽

Chao Zhang ◽

Wei Wu

Keyword(s):

Time Dependent ◽

Dependent Data ◽

Empirical Risk Minimization ◽

Risk Minimization ◽

Practical Applications ◽

Empirical Risk ◽

Learning Tasks ◽

Generalization Bound ◽

Erm Principle ◽

Dependent Samples

Many generalization results in learning theory are established under the assumption that samples are independent and identically distributed (i.i.d.). However, numerous learning tasks in practical applications involve the time-dependent data. In this paper, we propose a theoretical framework to analyze the generalization performance of the empirical risk minimization (ERM) principle for sequences of time-dependent samples (TDS). In particular, we first present the generalization bound of ERM principle for TDS. By introducing some auxiliary quantities, we also give a further analysis of the generalization properties and the asymptotical behaviors of ERM principle for TDS.

Download Full-text

Differentially Private Empirical Risk Minimization for AUC Maximization

Neurocomputing ◽

10.1016/j.neucom.2021.07.001 ◽

2021 ◽

Author(s):

Puyu Wang ◽

Zhenhuan Yang ◽

Yunwen Lei ◽

Yiming Ying ◽

Hai Zhang

Keyword(s):

Empirical Risk Minimization ◽

Risk Minimization ◽

Empirical Risk ◽

Auc Maximization

Download Full-text

Asymptotic Properties of Stationary Solutions of Coupled Nonconvex Nonsmooth Empirical Risk Minimization

Mathematics of Operations Research ◽

10.1287/moor.2021.1198 ◽

2021 ◽

Author(s):

Zhengling Qi ◽

Ying Cui ◽

Yufeng Liu ◽

Jong-Shi Pang

Keyword(s):

Phase Retrieval ◽

Convergence Rates ◽

Stationary Solutions ◽

Statistical Properties ◽

Global Minimizer ◽

General Distribution ◽

Empirical Risk Minimization ◽

Risk Minimization ◽

Minimization Problems ◽

Empirical Risk

This paper has two main goals: (a) establish several statistical properties—consistency, asymptotic distributions, and convergence rates—of stationary solutions and values of a class of coupled nonconvex and nonsmooth empirical risk-minimization problems and (b) validate these properties by a noisy amplitude-based phase-retrieval problem, the latter being of much topical interest. Derived from available data via sampling, these empirical risk-minimization problems are the computational workhorse of a population risk model that involves the minimization of an expected value of a random functional. When these minimization problems are nonconvex, the computation of their globally optimal solutions is elusive. Together with the fact that the expectation operator cannot be evaluated for general probability distributions, it becomes necessary to justify whether the stationary solutions of the empirical problems are practical approximations of the stationary solution of the population problem. When these two features, general distribution and nonconvexity, are coupled with nondifferentiability that often renders the problems “non-Clarke regular,” the task of the justification becomes challenging. Our work aims to address such a challenge within an algorithm-free setting. The resulting analysis is, therefore, different from much of the analysis in the recent literature that is based on local search algorithms. Furthermore, supplementing the classical global minimizer-centric analysis, our results offer a promising step to close the gap between computational optimization and asymptotic analysis of coupled, nonconvex, nonsmooth statistical estimation problems, expanding the former with statistical properties of the practically obtained solution and providing the latter with a more practical focus pertaining to computational tractability.

Download Full-text

Discussion of “concentration for (regularized) empirical risk minimization” by Sara van de Geer and Martin Wainwright

Sankhya A ◽

10.1007/s13171-017-0113-7 ◽

2017 ◽

Vol 79 (2) ◽

pp. 201-207

Author(s):

Stéphane Boucheron

Keyword(s):

Empirical Risk Minimization ◽

Risk Minimization ◽

Empirical Risk ◽

Regularized Empirical Risk Minimization

Download Full-text

Learning Theory Estimates with Observations from General Stationary Stochastic Processes

Neural Computation ◽

10.1162/neco_a_00870 ◽

2016 ◽

Vol 28 (12) ◽

pp. 2853-2889 ◽

Cited By ~ 1

Author(s):

Hanyuan Hang ◽

Yunlong Feng ◽

Ingo Steinwart ◽

Johan A. K. Suykens

Keyword(s):

Stochastic Processes ◽

Type Inequality ◽

Empirical Risk Minimization ◽

Risk Minimization ◽

Mixing Processes ◽

Gaussian Kernels ◽

Empirical Risk ◽

Learning Rates ◽

Stationary Stochastic Processes ◽

Learning Schemes

This letter investigates the supervised learning problem with observations drawn from certain general stationary stochastic processes. Here by general, we mean that many stationary stochastic processes can be included. We show that when the stochastic processes satisfy a generalized Bernstein-type inequality, a unified treatment on analyzing the learning schemes with various mixing processes can be conducted and a sharp oracle inequality for generic regularized empirical risk minimization schemes can be established. The obtained oracle inequality is then applied to derive convergence rates for several learning schemes such as empirical risk minimization (ERM), least squares support vector machines (LS-SVMs) using given generic kernels, and SVMs using gaussian kernels for both least squares and quantile regression. It turns out that for independent and identically distributed (i.i.d.) processes, our learning rates for ERM recover the optimal rates. For non-i.i.d. processes, including geometrically [Formula: see text]-mixing Markov processes, geometrically [Formula: see text]-mixing processes with restricted decay, [Formula: see text]-mixing processes, and (time-reversed) geometrically [Formula: see text]-mixing processes, our learning rates for SVMs with gaussian kernels match, up to some arbitrarily small extra term in the exponent, the optimal rates. For the remaining cases, our rates are at least close to the optimal rates. As a by-product, the assumed generalized Bernstein-type inequality also provides an interpretation of the so-called effective number of observations for various mixing processes.

Download Full-text