When do neural networks outperform kernel methods?*

Behrooz Ghorbani; Song Mei; Theodor Misiakiewicz; Andrea Montanari

doi:10.1088/1742-5468/ac3a81

When do neural networks outperform kernel methods?*

Journal of Statistical Mechanics Theory and Experiment ◽

10.1088/1742-5468/ac3a81 ◽

2021 ◽

Vol 2021 (12) ◽

pp. 124009

Author(s):

Behrooz Ghorbani ◽

Song Mei ◽

Theodor Misiakiewicz ◽

Andrea Montanari

Keyword(s):

Neural Networks ◽

Reproducing Kernel ◽

Reproducing Kernel Hilbert Space ◽

Empirical Work ◽

Target Function ◽

Curse Of Dimensionality ◽

Stochastic Gradient Descent ◽

Dimensional Structure ◽

Unified Framework ◽

Low Dimensional

Abstract For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If covariates are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present the spiked covariates model that can capture in a unified framework both behaviors observed in earlier work. We hypothesize that such a latent low-dimensional structure is present in image classification. We test numerically this hypothesis by showing that specific perturbations of the training distribution degrade the performances of RKHS methods much more significantly than NNs.

Get full-text (via PubEx)

Dimensionality Reduction of Complex Metastable Systems via Kernel Embeddings of Transition Manifolds

Journal of Nonlinear Science ◽

10.1007/s00332-020-09668-z ◽

2020 ◽

Vol 31 (1) ◽

Author(s):

Andreas Bittracher ◽

Stefan Klus ◽

Boumediene Hamzi ◽

Péter Koltai ◽

Christof Schütte

Keyword(s):

Stochastic Systems ◽

Reproducing Kernel ◽

Learning Algorithm ◽

Reproducing Kernel Hilbert Space ◽

Mathematical Framework ◽

Reaction Coordinates ◽

Effective Dynamics ◽

Distortion Bounds ◽

Low Dimensional ◽

Metastable Systems

AbstractWe present a novel kernel-based machine learning algorithm for identifying the low-dimensional geometry of the effective dynamics of high-dimensional multiscale stochastic systems. Recently, the authors developed a mathematical framework for the computation of optimal reaction coordinates of such systems that is based on learning a parameterization of a low-dimensional transition manifold in a certain function space. In this article, we enhance this approach by embedding and learning this transition manifold in a reproducing kernel Hilbert space, exploiting the favorable properties of kernel embeddings. Under mild assumptions on the kernel, the manifold structure is shown to be preserved under the embedding, and distortion bounds can be derived. This leads to a more robust and more efficient algorithm compared to the previous parameterization approaches.

Get full-text (via PubEx)

An Interpretable Denoising Layer for Neural Networks Based on Reproducing Kernel Hilbert Space and its Application in Machine Fault Diagnosis

Chinese Journal of Mechanical Engineering ◽

10.1186/s10033-021-00564-5 ◽

2021 ◽

Vol 34 (1) ◽

Author(s):

Baoxuan Zhao ◽

Changming Cheng ◽

Guowei Tu ◽

Zhike Peng ◽

Qingbo He ◽

...

Keyword(s):

Neural Networks ◽

Hilbert Space ◽

Fault Diagnosis ◽

Reproducing Kernel ◽

Design Method ◽

Reproducing Kernel Hilbert Space ◽

Computational Cost ◽

Machine Fault Diagnosis ◽

Machine Fault ◽

Modeling Strategy

AbstractDeep learning algorithms based on neural networks make remarkable achievements in machine fault diagnosis, while the noise mixed in measured signals harms the prediction accuracy of networks. Existing denoising methods in neural networks, such as using complex network architectures and introducing sparse techniques, always suffer from the difficulty of estimating hyperparameters and the lack of physical interpretability. To address this issue, this paper proposes a novel interpretable denoising layer based on reproducing kernel Hilbert space (RKHS) as the first layer for standard neural networks, with the aim to combine the advantages of both traditional signal processing technology with physical interpretation and network modeling strategy with parameter adaption. By investigating the influencing mechanism of parameters on the regularization procedure in RKHS, the key parameter that dynamically controls the signal smoothness with low computational cost is selected as the only trainable parameter of the proposed layer. Besides, the forward and backward propagation algorithms of the designed layer are formulated to ensure that the selected parameter can be automatically updated together with other parameters in the neural network. Moreover, exponential and piecewise functions are introduced in the weight updating process to keep the trainable weight within a reasonable range and avoid the ill-conditioned problem. Experiment studies verify the effectiveness and compatibility of the proposed layer design method in intelligent fault diagnosis of machinery in noisy environments.

Get full-text (via PubEx)

Balanced joint maximum mean discrepancy for deep transfer learning

Analysis and Applications ◽

10.1142/s0219530520400035 ◽

2020 ◽

pp. 1-18

Author(s):

Chuangji Meng ◽

Cunlu Xu ◽

Qin Lei ◽

Wei Su ◽

Jinzhao Wu

Keyword(s):

Network Architecture ◽

Reproducing Kernel ◽

Domain Adaptation ◽

Reproducing Kernel Hilbert Space ◽

Stochastic Gradient Descent ◽

Backpropagation Algorithm ◽

Maximum Mean Discrepancy ◽

Feature Representations ◽

Joint Distributions ◽

Balanced Distribution

Recent studies have revealed that deep networks can learn transferable features that generalize well to novel tasks with little or unavailable labeled data for domain adaptation. However, justifying which components of the feature representations can reason about original joint distributions using JMMD within the regime of deep architecture remains unclear. We present a new backpropagation algorithm for JMMD called the Balanced Joint Maximum Mean Discrepancy (B-JMMD) to further reduce the domain discrepancy. B-JMMD achieves the effect of balanced distribution adaptation for deep network architecture, and can be treated as an improved version of JMMD’s backpropagation algorithm. The proposed method leverages the importance of marginal and conditional distributions behind multiple domain-specific layers across domains adaptively to get a good match for the joint distributions in a second-order reproducing kernel Hilbert space. The learning of the proposed method can be performed technically by a special form of stochastic gradient descent, in which the gradient is computed by backpropagation with a strategy of balanced distribution adaptation. Theoretical analysis shows that the proposed B-JMMD is superior to JMMD method. Experiments confirm that our method yields state-of-the-art results with standard datasets.

Get full-text (via PubEx)

Modeling and Analyzing Neural Networks Using Reproducing Kernel Hilbert Space Algorithm

Applied Mathematics & Information Sciences ◽

10.18576/amis/120108 ◽

2018 ◽

Vol 12 (1) ◽

pp. 89-99 ◽

Cited By ~ 3

Author(s):

Zainah Momani ◽

Mohammad Al Shridah ◽

Omar Abu Arqub ◽

Mohammad Al-Momani ◽

Shaher Momani

Keyword(s):

Neural Networks ◽

Hilbert Space ◽

Reproducing Kernel ◽

Reproducing Kernel Hilbert Space

Get full-text (via PubEx)

Two-sample statistics based on anisotropic kernels

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaz018 ◽

2019 ◽

Vol 9 (3) ◽

pp. 677-719 ◽

Cited By ~ 1

Author(s):

Xiuyuan Cheng ◽

Alexander Cloninger ◽

Ronald R Coifman

Keyword(s):

Reproducing Kernel ◽

Reproducing Kernel Hilbert Space ◽

Reference Points ◽

Finite Sample ◽

Maximum Mean Discrepancy ◽

Data Points ◽

Consistency Of The Test ◽

Low Dimensional ◽

Anisotropic Kernel ◽

Anisotropic Kernels

Abstract The paper introduces a new kernel-based Maximum Mean Discrepancy (MMD) statistic for measuring the distance between two distributions given finitely many multivariate samples. When the distributions are locally low-dimensional, the proposed test can be made more powerful to distinguish certain alternatives by incorporating local covariance matrices and constructing an anisotropic kernel. The kernel matrix is asymmetric; it computes the affinity between $n$ data points and a set of $n_R$ reference points, where $n_R$ can be drastically smaller than $n$. While the proposed statistic can be viewed as a special class of Reproducing Kernel Hilbert Space MMD, the consistency of the test is proved, under mild assumptions of the kernel, as long as $\|p-q\| \sqrt{n} \to \infty $, and a finite-sample lower bound of the testing power is obtained. Applications to flow cytometry and diffusion MRI datasets are demonstrated, which motivate the proposed approach to compare distributions.

Get full-text (via PubEx)

Mathematical analysis on out-of-sample extensions

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s021969131850042x ◽

2018 ◽

Vol 16 (05) ◽

pp. 1850042 ◽

Cited By ~ 1

Author(s):

Jianzhong Wang

Keyword(s):

Mathematical Analysis ◽

Reproducing Kernel ◽

Reproducing Kernel Hilbert Space ◽

Representation Formula ◽

Extension Operator ◽

Feature Representation ◽

Training Set ◽

Data Set ◽

Out Of Sample ◽

Low Dimensional

Let [Formula: see text] be a data set in [Formula: see text], where [Formula: see text] is the training set and [Formula: see text] is the test one. Many unsupervised learning algorithms based on kernel methods have been developed to provide dimensionality reduction (DR) embedding for a given training set [Formula: see text] ([Formula: see text]) that maps the high-dimensional data [Formula: see text] to its low-dimensional feature representation [Formula: see text]. However, these algorithms do not straightforwardly produce DR of the test set [Formula: see text]. An out-of-sample extension method provides DR of [Formula: see text] using an extension of the existent embedding [Formula: see text], instead of re-computing the DR embedding for the whole set [Formula: see text]. Among various out-of-sample DR extension methods, those based on Nyström approximation are very attractive. Many papers have developed such out-of-extension algorithms and shown their validity by numerical experiments. However, the mathematical theory for the DR extension still need further consideration. Utilizing the reproducing kernel Hilbert space (RKHS) theory, this paper develops a preliminary mathematical analysis on the out-of-sample DR extension operators. It treats an out-of-sample DR extension operator as an extension of the identity on the RKHS defined on [Formula: see text]. Then the Nyström-type DR extension turns out to be an orthogonal projection. In the paper, we also present the conditions for the exact DR extension and give the estimate for the error of the extension.

Get full-text (via PubEx)

A Unifying View of Wiener and Volterra Theory and Polynomial Kernel Regression

Neural Computation ◽

10.1162/neco.2006.18.12.3097 ◽

2006 ◽

Vol 18 (12) ◽

pp. 3097-3118 ◽

Cited By ~ 75

Author(s):

Matthias O. Franz ◽

Bernhard Schölkopf

Keyword(s):

Hilbert Space ◽

Reproducing Kernel ◽

Reproducing Kernel Hilbert Space ◽

Polynomial Kernel ◽

Implicit Representation ◽

Weakly Nonlinear ◽

The Past ◽

Volterra Theory ◽

Low Dimensional ◽

Polynomial Kernels

Volterra and Wiener series are perhaps the best-understood nonlinear system representations in signal processing. Although both approaches have enjoyed a certain popularity in the past, their application has been limited to rather low-dimensional and weakly nonlinear systems due to the exponential growth of the number of terms that have to be estimated. We show that Volterra and Wiener series can be represented implicitly as elements of a reproducing kernel Hilbert space by using polynomial kernels. The estimation complexity of the implicit representation is linear in the input dimensionality and independent of the degree of nonlinearity. Experiments show performance advantages in terms of convergence, interpretability, and system sizes that can be handled.

Get full-text (via PubEx)

Domain Adaptation via Tree Kernel Based Maximum Mean Discrepancy for User Consumption Intention Identification

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/560 ◽

2018 ◽

Cited By ~ 4

Author(s):

Xiao Ding ◽

Bibo Cai ◽

Ting Liu ◽

Qiankun Shi

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Reproducing Kernel ◽

Kernel Method ◽

Reproducing Kernel Hilbert Space ◽

Target Domain ◽

Source Domain ◽

Domain Specific ◽

Tree Kernel ◽

The Mean

Identifying user consumption intention from social media is of great interests to downstream applications. Since such task is domain-dependent, deep neural networks have been applied to learn transferable features for adapting models from a source domain to a target domain. A basic idea to solve this problem is reducing the distribution difference between the source domain and the target domain such that the transfer error can be bounded. However, the feature transferability drops dramatically in higher layers of deep neural networks with increasing domain discrepancy. Hence, previous work has to use a few target domain annotated data to train domain-specific layers. In this paper, we propose a deep transfer learning framework for consumption intention identification, to reduce the data bias and enhance the transferability in domain-specific layers. In our framework, the representation of the domain-specific layer is mapped to a reproducing kernel Hilbert space, where the mean embeddings of different domain distributions can be explicitly matched. By using an optimal tree kernel method for measuring the mean embedding matching, the domain discrepancy can be effectively reduced. The framework can learn transferable features in a completely unsupervised manner with statistical guarantees. Experimental results on five different domain datasets show that our approach dramatically outperforms state-of-the-art baselines, and it is general enough to be applied to more scenarios. The source code and datasets can be found at http://ir.hit.edu.cn/$\scriptsize{\sim}$xding/index\_english.htm.

Get full-text (via PubEx)

Distributed least squares prediction for functional linear regression

Inverse Problems ◽

10.1088/1361-6420/ac4153 ◽

2021 ◽

Author(s):

Hongzhi Tong

Keyword(s):

Least Squares ◽

Reproducing Kernel ◽

Learning Strategy ◽

Reproducing Kernel Hilbert Space ◽

Target Function ◽

Regularity Conditions ◽

Least Squares Regression ◽

Functional Linear Model ◽

Data Set ◽

Functional Linear Regression

Abstract To cope with the challenges of memory bottleneck and algorithmic scalability when massive data sets are involved, we propose a distributed least squares procedure in the framework of functional linear model and reproducing kernel Hilbert space. This approach divides the big data set into multiple subsets, applies regularized least squares regression on each of them, and then averages the individual outputs as a ﬁnal prediction. We establish the non-asymptotic prediction error bounds for the proposed learning strategy under some regularity conditions. When the target function only has weak regularity, we also introduce some unlabelled data to construct a semi-supervised approach to enlarge the number of the partitioned subsets. Results in present paper provide a theoretical guarantee that the distributed algorithm can achieve the optimal rate of convergence while allowing the whole data set to be partitioned into a large number of subsets for parallel processing.

Get full-text (via PubEx)

Analysis of regularized Nyström subsampling for regression functions of low smoothness

Analysis and Applications ◽

10.1142/s0219530519500039 ◽

2019 ◽

Vol 17 (06) ◽

pp. 931-946 ◽

Cited By ~ 1

Author(s):

Shuai Lu ◽

Peter Mathé ◽

Sergiy Pereverzyev

Keyword(s):

Reproducing Kernel ◽

Regularization Parameter ◽

Reproducing Kernel Hilbert Space ◽

Practical Importance ◽

Target Function ◽

Mild Conditions ◽

Learning Rates ◽

Regression Functions ◽

Large Kernel ◽

General Source

This paper studies a Nyström-type subsampling approach to large kernel learning methods in the misspecified case, where the target function is not assumed to belong to the reproducing kernel Hilbert space generated by the underlying kernel. This case is less understood in spite of its practical importance. To model such a case, the smoothness of target functions is described in terms of general source conditions. It is surprising that almost for the whole range of the source conditions, describing the misspecified case, the corresponding learning rate bounds can be achieved with just one value of the regularization parameter. This observation allows a formulation of mild conditions under which the plain Nyström subsampling can be realized with subquadratic cost maintaining the guaranteed learning rates.

Get full-text (via PubEx)