scholarly journals Mathematical analysis on out-of-sample extensions

Author(s):  
Jianzhong Wang

Let [Formula: see text] be a data set in [Formula: see text], where [Formula: see text] is the training set and [Formula: see text] is the test one. Many unsupervised learning algorithms based on kernel methods have been developed to provide dimensionality reduction (DR) embedding for a given training set [Formula: see text] ([Formula: see text]) that maps the high-dimensional data [Formula: see text] to its low-dimensional feature representation [Formula: see text]. However, these algorithms do not straightforwardly produce DR of the test set [Formula: see text]. An out-of-sample extension method provides DR of [Formula: see text] using an extension of the existent embedding [Formula: see text], instead of re-computing the DR embedding for the whole set [Formula: see text]. Among various out-of-sample DR extension methods, those based on Nyström approximation are very attractive. Many papers have developed such out-of-extension algorithms and shown their validity by numerical experiments. However, the mathematical theory for the DR extension still need further consideration. Utilizing the reproducing kernel Hilbert space (RKHS) theory, this paper develops a preliminary mathematical analysis on the out-of-sample DR extension operators. It treats an out-of-sample DR extension operator as an extension of the identity on the RKHS defined on [Formula: see text]. Then the Nyström-type DR extension turns out to be an orthogonal projection. In the paper, we also present the conditions for the exact DR extension and give the estimate for the error of the extension.

2020 ◽  
Vol 31 (1) ◽  
Author(s):  
Andreas Bittracher ◽  
Stefan Klus ◽  
Boumediene Hamzi ◽  
Péter Koltai ◽  
Christof Schütte

AbstractWe present a novel kernel-based machine learning algorithm for identifying the low-dimensional geometry of the effective dynamics of high-dimensional multiscale stochastic systems. Recently, the authors developed a mathematical framework for the computation of optimal reaction coordinates of such systems that is based on learning a parameterization of a low-dimensional transition manifold in a certain function space. In this article, we enhance this approach by embedding and learning this transition manifold in a reproducing kernel Hilbert space, exploiting the favorable properties of kernel embeddings. Under mild assumptions on the kernel, the manifold structure is shown to be preserved under the embedding, and distortion bounds can be derived. This leads to a more robust and more efficient algorithm compared to the previous parameterization approaches.


2021 ◽  
Author(s):  
Noor Ahmad ◽  
Mohd Hafiz Mohd

The extrapolated kernel least mean square algorithm (extrap-KLMS) with memory is proposed for the forecasting of future trends of COVID-19. The extrap-KLMS is derived in the framework of data-driven modelling that attempts to describe the dynamics of infectious disease by reconstructing the phase-space of the state variables in a reproducing kernel Hilbert space (RKHS). Short-time forecasting is enabled via an extrapolation of the KLMS trained model using a forward euler step, along the direction of a memory-dependent gradient estimate. A user-defined memory averaging window allows users to incorporate prior knowledge of the history of the pandemic into the gradient estimate thus providing a spectrum of scenario-based estimates of futures trends. The performance of the extrap-KLMS method is validated using data set for Malaysia, Saudi Arabia and Italy in which we highlight the flexibility of the method in capturing persistent trends of the pandemic. A situational analysis of the Malaysian third wave further demonstrate the capabilities of our method


2019 ◽  
Vol 9 (3) ◽  
pp. 677-719 ◽  
Author(s):  
Xiuyuan Cheng ◽  
Alexander Cloninger ◽  
Ronald R Coifman

Abstract The paper introduces a new kernel-based Maximum Mean Discrepancy (MMD) statistic for measuring the distance between two distributions given finitely many multivariate samples. When the distributions are locally low-dimensional, the proposed test can be made more powerful to distinguish certain alternatives by incorporating local covariance matrices and constructing an anisotropic kernel. The kernel matrix is asymmetric; it computes the affinity between $n$ data points and a set of $n_R$ reference points, where $n_R$ can be drastically smaller than $n$. While the proposed statistic can be viewed as a special class of Reproducing Kernel Hilbert Space MMD, the consistency of the test is proved, under mild assumptions of the kernel, as long as $\|p-q\| \sqrt{n} \to \infty $, and a finite-sample lower bound of the testing power is obtained. Applications to flow cytometry and diffusion MRI datasets are demonstrated, which motivate the proposed approach to compare distributions.


2021 ◽  
Author(s):  
Noor Ahmad ◽  
Mohd Hafiz Mohd

The extrapolated kernel least mean square algorithm (extrap-KLMS) with memory is proposed for the forecasting of future trends of COVID-19. The extrap-KLMS is derived in the framework of data-driven modelling that attempts to describe the dynamics of infectious disease by reconstructing the phase-space of the state variables in a reproducing kernel Hilbert space (RKHS). Short-time forecasting is enabled via an extrapolation of the KLMS trained model using a forward euler step, along the direction of a memory-dependent gradient estimate. A user-defined memory averaging window allows users to incorporate prior knowledge of the history of the pandemic into the gradient estimate thus providing a spectrum of scenario-based estimates of futures trends. The performance of the extrap-KLMS method is validated using data set for Malaysia, Saudi Arabia and Italy in which we highlight the flexibility of the method in capturing persistent trends of the pandemic. A situational analysis of the Malaysian third wave further demonstrate the capabilities of our method


2006 ◽  
Vol 18 (12) ◽  
pp. 3097-3118 ◽  
Author(s):  
Matthias O. Franz ◽  
Bernhard Schölkopf

Volterra and Wiener series are perhaps the best-understood nonlinear system representations in signal processing. Although both approaches have enjoyed a certain popularity in the past, their application has been limited to rather low-dimensional and weakly nonlinear systems due to the exponential growth of the number of terms that have to be estimated. We show that Volterra and Wiener series can be represented implicitly as elements of a reproducing kernel Hilbert space by using polynomial kernels. The estimation complexity of the implicit representation is linear in the input dimensionality and independent of the degree of nonlinearity. Experiments show performance advantages in terms of convergence, interpretability, and system sizes that can be handled.


2021 ◽  
Vol 2021 (12) ◽  
pp. 124009
Author(s):  
Behrooz Ghorbani ◽  
Song Mei ◽  
Theodor Misiakiewicz ◽  
Andrea Montanari

Abstract For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If covariates are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present the spiked covariates model that can capture in a unified framework both behaviors observed in earlier work. We hypothesize that such a latent low-dimensional structure is present in image classification. We test numerically this hypothesis by showing that specific perturbations of the training distribution degrade the performances of RKHS methods much more significantly than NNs.


2016 ◽  
Vol 14 (06) ◽  
pp. 795-808 ◽  
Author(s):  
Andreas Christmann ◽  
Florian Dumpert ◽  
Dao-Hong Xiang

Statistical machine learning plays an important role in modern statistics and computer science. One main goal of statistical machine learning is to provide universally consistent algorithms, i.e. the estimator converges in probability or in some stronger sense to the Bayes risk or to the Bayes decision function. Kernel methods based on minimizing the regularized risk over a reproducing kernel Hilbert space (RKHS) belong to these statistical machine learning methods. It is in general unknown which kernel yields optimal results for a particular data set or for the unknown probability measure. Hence various kernel learning methods were proposed to choose the kernel and therefore also its RKHS in a data adaptive manner. Nevertheless, many practitioners often use the classical Gaussian RBF kernel or certain Sobolev kernels with good success. The goal of this paper is to offer one possible theoretical explanation for this empirical fact.


2021 ◽  
Author(s):  
Hongzhi Tong

Abstract To cope with the challenges of memory bottleneck and algorithmic scalability when massive data sets are involved, we propose a distributed least squares procedure in the framework of functional linear model and reproducing kernel Hilbert space. This approach divides the big data set into multiple subsets, applies regularized least squares regression on each of them, and then averages the individual outputs as a final prediction. We establish the non-asymptotic prediction error bounds for the proposed learning strategy under some regularity conditions. When the target function only has weak regularity, we also introduce some unlabelled data to construct a semi-supervised approach to enlarge the number of the partitioned subsets. Results in present paper provide a theoretical guarantee that the distributed algorithm can achieve the optimal rate of convergence while allowing the whole data set to be partitioned into a large number of subsets for parallel processing.


2017 ◽  
Vol 43 (3) ◽  
pp. 567-592 ◽  
Author(s):  
Dong Nguyen ◽  
Jacob Eisenstein

Quantifying the degree of spatial dependence for linguistic variables is a key task for analyzing dialectal variation. However, existing approaches have important drawbacks. First, they are based on parametric models of dependence, which limits their power in cases where the underlying parametric assumptions are violated. Second, they are not applicable to all types of linguistic data: Some approaches apply only to frequencies, others to boolean indicators of whether a linguistic variable is present. We present a new method for measuring geographical language variation, which solves both of these problems. Our approach builds on Reproducing Kernel Hilbert Space (RKHS) representations for nonparametric statistics, and takes the form of a test statistic that is computed from pairs of individual geotagged observations without aggregation into predefined geographical bins. We compare this test with prior work using synthetic data as well as a diverse set of real data sets: a corpus of Dutch tweets, a Dutch syntactic atlas, and a data set of letters to the editor in North American newspapers. Our proposed test is shown to support robust inferences across a broad range of scenarios and types of data.


2022 ◽  
Vol 12 ◽  
Author(s):  
David Bonnett ◽  
Yongle Li ◽  
Jose Crossa ◽  
Susanne Dreisigacker ◽  
Bhoja Basnet ◽  
...  

We investigated increasing genetic gain for grain yield using early generation genomic selection (GS). A training set of 1,334 elite wheat breeding lines tested over three field seasons was used to generate Genomic Estimated Breeding Values (GEBVs) for grain yield under irrigated conditions applying markers and three different prediction methods: (1) Genomic Best Linear Unbiased Predictor (GBLUP), (2) GBLUP with the imputation of missing genotypic data by Ridge Regression BLUP (rrGBLUP_imp), and (3) Reproducing Kernel Hilbert Space (RKHS) a.k.a. Gaussian Kernel (GK). F2 GEBVs were generated for 1,924 individuals from 38 biparental cross populations between 21 parents selected from the training set. Results showed that F2 GEBVs from the different methods were not correlated. Experiment 1 consisted of selecting F2s with the highest average GEBVs and advancing them to form genomically selected bulks and make intercross populations aiming to combine favorable alleles for yield. F4:6 lines were derived from genomically selected bulks, intercrosses, and conventional breeding methods with similar numbers from each. Results of field-testing for Experiment 1 did not find any difference in yield with genomic compared to conventional selection. Experiment 2 compared the predictive ability of the different GEBV calculation methods in F2 using a set of single plant-derived F2:4 lines from randomly selected F2 plants. Grain yield results from Experiment 2 showed a significant positive correlation between observed yields of F2:4 lines and predicted yield GEBVs of F2 single plants from GK (the predictive ability of 0.248, P < 0.001) and GBLUP (0.195, P < 0.01) but no correlation with rrGBLUP_imp. Results demonstrate the potential for the application of GS in early generations of wheat breeding and the importance of using the appropriate statistical model for GEBV calculation, which may not be the same as the best model for inbreds.


Sign in / Sign up

Export Citation Format

Share Document