generalization bounds
Recently Published Documents


TOTAL DOCUMENTS

95
(FIVE YEARS 27)

H-INDEX

15
(FIVE YEARS 2)

2021 ◽  
Vol 2021 (12) ◽  
pp. 124014
Author(s):  
Umut Şimşekli ◽  
Ozan Sener ◽  
George Deligiannidis ◽  
Murat A Erdogdu

Abstract Despite its success in a wide range of applications, characterizing the generalization properties of stochastic gradient descent (SGD) in non-convex deep learning problems is still an important challenge. While modeling the trajectories of SGD via stochastic differential equations (SDE) under heavy-tailed gradient noise has recently shed light over several peculiar characteristics of SGD, a rigorous treatment of the generalization properties of such SDEs in a learning theoretical framework is still missing. Aiming to bridge this gap, in this paper, we prove generalization bounds for SGD under the assumption that its trajectories can be well-approximated by a Feller process, which defines a rich class of Markov processes that include several recent SDE representations (both Brownian or heavy-tailed) as its special case. We show that the generalization error can be controlled by the Hausdorff dimension of the trajectories, which is intimately linked to the tail behavior of the driving process. Our results imply that heavier-tailed processes should achieve better generalization; hence, the tail-index of the process can be used as a notion of ‘capacity metric’. We support our theory with experiments on deep neural networks illustrating that the proposed capacity metric accurately estimates the generalization error, and it does not necessarily grow with the number of parameters unlike the existing capacity metrics in the literature.


Quantum ◽  
2021 ◽  
Vol 5 ◽  
pp. 582
Author(s):  
Matthias C. Caro ◽  
Elies Gil-Fuster ◽  
Johannes Jakob Meyer ◽  
Jens Eisert ◽  
Ryan Sweke

A large body of recent work has begun to explore the potential of parametrized quantum circuits (PQCs) as machine learning models, within the framework of hybrid quantum-classical optimization. In particular, theoretical guarantees on the out-of-sample performance of such models, in terms of generalization bounds, have emerged. However, none of these generalization bounds depend explicitly on how the classical input data is encoded into the PQC. We derive generalization bounds for PQC-based models that depend explicitly on the strategy used for data-encoding. These imply bounds on the performance of trained PQC-based models on unseen data. Moreover, our results facilitate the selection of optimal data-encoding strategies via structural risk minimization, a mathematically rigorous framework for model selection. We obtain our generalization bounds by bounding the complexity of PQC-based models as measured by the Rademacher complexity and the metric entropy, two complexity measures from statistical learning theory. To achieve this, we rely on a representation of PQC-based models via trigonometric functions. Our generalization bounds emphasize the importance of well-considered data-encoding strategies for PQC-based models.


Author(s):  
Puyu Wang ◽  
Liang Wu ◽  
Yunwen Lei

Randomized coordinate descent (RCD) is a popular optimization algorithm with wide applications in various machine learning problems, which motivates a lot of theoretical analysis on its convergence behavior. As a comparison, there is no work studying how the models trained by RCD would generalize to test examples. In this paper, we initialize the generalization analysis of RCD by leveraging the powerful tool of algorithmic stability. We establish argument stability bounds of RCD for both convex and strongly convex objectives, from which we develop optimal generalization bounds by showing how to early-stop the algorithm to tradeoff the estimation and optimization. Our analysis shows that RCD enjoys better stability as compared to stochastic gradient descent.


Author(s):  
Waleed Mustafa ◽  
Yunwen Lei ◽  
Antoine Ledent ◽  
Marius Kloft

In machine learning we often encounter structured output prediction problems (SOPPs), i.e. problems where the output space admits a rich internal structure. Application domains where SOPPs naturally occur include natural language processing, speech recognition, and computer vision. Typical SOPPs have an extremely large label set, which grows exponentially as a function of the size of the output. Existing generalization analysis implies generalization bounds with at least a square-root dependency on the cardinality d of the label set, which can be vacuous in practice. In this paper, we significantly improve the state of the art by developing novel high-probability bounds with a logarithmic dependency on d. Furthermore, we leverage the lens of algorithmic stability to develop generalization bounds in expectation without any dependency on d. Our results therefore build a solid theoretical foundation for learning in large-scale SOPPs. Furthermore, we extend our results to learning with weakly dependent data.


2021 ◽  
pp. 115399
Author(s):  
Yule Vaz ◽  
Rodrigo Fernandes de Mello ◽  
Carlos Henrique Grossi Ferreira

2020 ◽  
Author(s):  
Jonathan Warrell ◽  
Hussein Mohsen ◽  
Mark Gerstein

AbstractDeep learning methods have achieved state-of-the-art performance in many domains of artificial intelligence, but are typically hard to interpret. Network interpretation is important for multiple reasons, including knowledge discovery, hypothesis generation, fairness and establishing trust. Model transformations provide a general approach to interpreting a trained network post-hoc: the network is approximated by a model, which is typically compressed, whose structure can be more easily interpreted in some way (we call such approaches interpretability schemes). However, the relationship between compression and interpretation has not been fully explored: How much should a network be compressed for optimal extraction of interpretable information? Should compression be combined with other criteria when selecting model transformations? We investigate these issues using two different compression-based schemes, which aim to extract orthogonal kinds of information, pertaining to feature and data instance-based groupings respectively. The first (rank projection trees) uses a structured sparsification method such that nested groups of features can be extracted having potential joint interactions. The second (cascaded network decomposition) splits a network into a cascade of simpler networks, allowing groups of training instances with similar characteristics to be extracted at each stage of the cascade. We use predictive tasks in cancer and psychiatric genomics to assess the ability of these approaches to extract informative feature and data-point groupings from trained networks. We show that the generalization error of a network provides an indicator of the quality of the information extracted; further we derive PAC-Bayes generalization bounds for both schemes, which we show can be used as proxy indicators, and can thus provide a criterion for selecting the optimal compression. Finally, we show that the PAC-Bayes framework can be naturally modified to incorporate additional criteria alongside compression, such as prior knowledge based on previous models, which can enhance interpretable model selection.


Sign in / Sign up

Export Citation Format

Share Document