Two Further Gradient BYY Learning Rules for Gaussian Mixture with Automated Model Selection

Author(s):  
Jinwen Ma ◽  
Bin Gao ◽  
Yang Wang ◽  
Qiansheng Cheng
2006 ◽  
Vol 18 (5) ◽  
pp. 1007-1065 ◽  
Author(s):  
Shun-ichi Amari ◽  
Hyeyoung Park ◽  
Tomoko Ozeki

The parameter spaces of hierarchical systems such as multilayer perceptrons include singularities due to the symmetry and degeneration of hidden units. A parameter space forms a geometrical manifold, called the neuromanifold in the case of neural networks. Such a model is identified with a statistical model, and a Riemannian metric is given by the Fisher information matrix. However, the matrix degenerates at singularities. Such a singular structure is ubiquitous not only in multilayer perceptrons but also in the gaussian mixture probability densities, ARMA time-series model, and many other cases. The standard statistical paradigm of the Cramér-Rao theorem does not hold, and the singularity gives rise to strange behaviors in parameter estimation, hypothesis testing, Bayesian inference, model selection, and in particular, the dynamics of learning from examples. Prevailing theories so far have not paid much attention to the problem caused by singularity, relying only on ordinary statistical theories developed for regular (nonsingular) models. Only recently have researchers remarked on the effects of singularity, and theories are now being developed. This article gives an overview of the phenomena caused by the singularities of statistical manifolds related to multilayer perceptrons and gaussian mixtures. We demonstrate our recent results on these problems. Simple toy models are also used to show explicit solutions. We explain that the maximum likelihood estimator is no longer subject to the gaussian distribution even asymptotically, because the Fisher information matrix degenerates, that the model selection criteria such as AIC, BIC, and MDL fail to hold in these models, that a smooth Bayesian prior becomes singular in such models, and that the trajectories of dynamics of learning are strongly affected by the singularity, causing plateaus or slow manifolds in the parameter space. The natural gradient method is shown to perform well because it takes the singular geometrical structure into account. The generalization error and the training error are studied in some examples.


2001 ◽  
Vol 11 (01) ◽  
pp. 43-69 ◽  
Author(s):  
Lei Xu

After introducing the fundamentals of BYY system and harmony learning, which has been developed in past several years as a unified statistical framework for parameter learning, regularization and model selection, we systematically discuss this BYY harmony learning on systems with discrete inner-representations. First, we shown that one special case leads to unsupervised learning on Gaussian mixture. We show how harmony learning not only leads us to the EM algorithm for maximum likelihood (ML) learning and the corresponding extended KMEAN algorithms for Mahalanobis clustering with criteria for selecting the number of Gaussians or clusters, but also provides us two new regularization techniques and a unified scheme that includes the previous rival penalized competitive learning (RPCL) as well as its various variants and extensions that performs model selection automatically during parameter learning. Moreover, as a by-product, we also get a new approach for determining a set of 'supporting vectors' for Parzen window density estimation. Second, we shown that other special cases lead to three typical supervised learning models with several new results. On three layer net, we get (i) a new regularized ML learning, (ii) a new criterion for selecting the number of hidden units, and (iii) a family of EM-like algorithms that combines harmony learning with new techniques of regularization. On the original and alternative models of mixture-of-expert (ME) as well as radial basis function (RBF) nets, we get not only a new type of criteria for selecting the number of experts or basis functions but also a new type of the EM-like algorithms that combines regularization techniques and RPCL learning for parameter learning with either least complexity nature on the original ME model or automated model selection on the alternative ME model and RBF nets. Moreover, all the results for the alternative ME model are also applied to other two popular nonparametric statistical approaches, namely kernel regression and supporting vector machine. Particularly, not only we get an easily implemented approach for determining the smoothing parameter in kernel regression, but also we get an alternative approach for deciding the set of supporting vectors in supporting vector machine.


Author(s):  
JINWEN MA ◽  
BIN GAO ◽  
YANG WANG ◽  
QIANSHENG CHENG

Under the Bayesian Ying–Yang (BYY) harmony learning theory, a harmony function has been developed on a BI-directional architecture of the BYY system for Gaussian mixture with an important feature that, via its maximization through a general gradient rule, a model selection can be made automatically during parameter learning on a set of sample data from a Gaussian mixture. This paper further proposes the conjugate and natural gradient rules to efficiently implement the maximization of the harmony function, i.e. the BYY harmony learning, on Gaussian mixture. It is demonstrated by simulation experiments that these two new gradient rules not only work well, but also converge more quickly than the general gradient ones.


Sign in / Sign up

Export Citation Format

Share Document