scholarly journals Multiple scaled contaminated normal distribution and its application in clustering

2019 ◽  
pp. 1471082X1989093 ◽  
Author(s):  
Antonio Punzo ◽  
Cristina Tortora

The multivariate contaminated normal (MCN) distribution represents a simple heavy-tailed generalization of the multivariate normal (MN) distribution to model elliptical contoured scatters in the presence of mild outliers (also referred to as ‘bad’ points herein) and automatically detect bad points. The price of these advantages is two additional parameters: proportion of good observations and degree of contamination. However, in a multivariate setting, only one proportion of good observations and only one degree of contamination may be limiting. To overcome this limitation, we propose a multiple scaled contaminated normal (MSCN) distribution. Among its parameters, we have an orthogonal matrix Γ. In the space spanned by the vectors (principal components) of Γ, there is a proportion of good observations and a degree of contamination for each component. Moreover, each observation has a posterior probability of being good with respect to each principal component. Thanks to this probability, the method provides directional robust estimates of the parameters of the nested MN and automatic directional detection of bad points. The term ‘directional’ is added to specify that the method works separately for each principal component. Mixtures of MSCN distributions are also proposed, and an expectation-maximization algorithm is used for parameter estimation. Real and simulated data are considered to show the usefulness of our mixture with respect to well-established mixtures of symmetric distributions with heavy tails.

2014 ◽  
Vol 2014 ◽  
pp. 1-16 ◽  
Author(s):  
Kanokmon Rujirakul ◽  
Chakchai So-In ◽  
Banchar Arnonkijpanich

Principal component analysis or PCA has been traditionally used as one of the feature extraction techniques in face recognition systems yielding high accuracy when requiring a small number of features. However, the covariance matrix and eigenvalue decomposition stages cause high computational complexity, especially for a large database. Thus, this research presents an alternative approach utilizing an Expectation-Maximization algorithm to reduce the determinant matrix manipulation resulting in the reduction of the stages’ complexity. To improve the computational time, a novel parallel architecture was employed to utilize the benefits of parallelization of matrix computation during feature extraction and classification stages including parallel preprocessing, and their combinations, so-called a Parallel Expectation-Maximization PCA architecture. Comparing to a traditional PCA and its derivatives, the results indicate lower complexity with an insignificant difference in recognition precision leading to high speed face recognition systems, that is, the speed-up over nine and three times over PCA and Parallel PCA.


2014 ◽  
Vol 46 (4) ◽  
pp. 731-741 ◽  
Author(s):  
David Doyle ◽  
Robert Elgie

This article aims to maximize the reliability of presidential power scores for a larger number of countries and time periods than currently exists for any single measure, and in a way that is replicable and easy to update. It begins by identifying all of the studies that have estimated the effect of a presidential power variable, clarifying what scholars have attempted to capture when they have operationalized the concept of presidential power. It then identifies all the measures of presidential power that have been proposed over the years, noting the problems associated with each. To generate the new set of presidential power scores, the study draws upon the comparative and local knowledge embedded in existing measures of presidential power. Employing principal component analysis, together with the expectation maximization algorithm and maximum likelihood estimation, a set of presidential power scores is generated for a larger set of countries and country time periods than currently exists, reporting 95 per cent confidence intervals and standard errors for the scores. Finally, the implications of the new set of scores for future studies of presidential power is discussed.


2014 ◽  
Vol 2014 ◽  
pp. 1-7
Author(s):  
Xuedong Chen ◽  
Qianying Zeng ◽  
Qiankun Song

An extension of some standard likelihood and variable selection criteria based on procedures of linear regression models under the skew-normal distribution or the skew-tdistribution is developed. This novel class of models provides a useful generalization of symmetrical linear regression models, since the random term distributions cover both symmetric as well as asymmetric and heavy-tailed distributions. A generalized expectation-maximization algorithm is developed for computing thel1penalized estimator. Efficacy of the proposed methodology and algorithm is demonstrated by simulated data.


2020 ◽  
Vol 18 (2) ◽  
Author(s):  
Ružica Škurla Babić ◽  
Maja Ozmec-Ban ◽  
Jasmin Bajić

Airline revenue management systems are used to calculate booking limits on each fare class to maximize expected revenue for all future flight departures. Their performance depends critically on the forecasting module that uses historical data to project future quantities of demand. Those data are censored or constrained by the imposed booking limits and do not represent true demand since rejected requests are not recorded. Eight unconstraining methods that transform the censored data into more accurate estimates of actual historical demand ranging from naive methods such as discarding all censored observation, to complex, such as Expectation Maximization Algorithm and Projection Detruncation Algorithm, are analyzed and their accuracy is compared. Those methods are evaluated and tested on simulated data sets generated by ICE V2.0 software: first, the data sets that represent true demand were produced, then the aircraft capacity was reduced and EMSRb booking limits for every booking class were calculated. These limits constrained the original demand data at various points of the booking process and the corresponding censored data sets were obtained. The unconstrained methods were applied to the censored observations and the resulting unconstrained data were compared to the actual demand data and their performance was evaluated.


2019 ◽  
Author(s):  
Wilson McKerrow ◽  
David Fenyö

AbstractMotivationLINE-1 elements are retrotransposons that are capable of copying their sequence to new genomic loci. LINE-1 derepression is associated with a number of disease states, and has the potential to cause significant cellular damage. Because LINE-1 elements are repetitive, it is difficult to quantify RNA at specific LINE-1 loci and to separate transcripts with protein coding capability from other sources of LINE-1 RNA.ResultsWe provide a tool, L1-EM that uses the expectation maximization algorithm to quantify LINE-1 RNA at each genomic locus, separating transcripts that are capable of generating retrotransposition from those that are not. We show the accuracy of L1-EM on simulated data and against long read sequencing from HEK cells.AvailabilityL1-EM is written in python. The source code along with the necessary annotations are available at https://github.com/FenyoLab/L1EM and distributed under [email protected], [email protected]


2001 ◽  
Vol 13 (11) ◽  
pp. 2517-2532 ◽  
Author(s):  
Mark Girolami

An expectation-maximization algorithm for learning sparse and overcomplete data representations is presented. The proposed algorithm exploits a variational approximation to a range of heavy-tailed distributions whose limit is the Laplacian. A rigorous lower bound on the sparse prior distribution is derived, which enables the analytic marginalization of a lower bound on the data likelihood. This lower bound enables the development of an expectation-maximization algorithm for learning the overcomplete basis vectors and inferring the most probable basis coefficients.


Symmetry ◽  
2019 ◽  
Vol 11 (9) ◽  
pp. 1150 ◽  
Author(s):  
Neveka M. Olmos ◽  
Osvaldo Venegas ◽  
Yolanda M. Gómez ◽  
Yuri A. Iriarte

In this paper we introduce a new distribution constructed on the basis of the quotient of two independent random variables whose distributions are the half-normal distribution and a power of the exponential distribution with parameter 2 respectively. The result is a distribution with greater kurtosis than the well known half-normal and slashed half-normal distributions. We studied the general density function of this distribution, with some of its properties, moments, and its coefficients of asymmetry and kurtosis. We developed the expectation–maximization algorithm and present a simulation study. We calculated the moment and maximum likelihood estimators and present three illustrations in real data sets to show the flexibility of the new model.


1999 ◽  
Vol 11 (2) ◽  
pp. 443-482 ◽  
Author(s):  
Michael E. Tipping ◽  
Christopher M. Bishop

Principal component analysis (PCA) is one of the most popular techniques for processing, compressing, and visualizing data, although its effectiveness is limited by its global linearity. While nonlinear variants of PCA have been proposed, an alternative paradigm is to capture data complexity by a combination of local linear PCA projections. However, conventional PCA does not correspond to a probability density, and so there is no unique way to combine PCA models. Therefore, previous attempts to formulate mixture models for PCA have been ad hoc to some extent. In this article, PCA is formulated within a maximum likelihood framework, based on a specific form of gaussian latent variable model. This leads to a well-defined mixture model for probabilistic principal component analyzers, whose parameters can be determined using an expectation-maximization algorithm. We discuss the advantages of this model in the context of clustering, density modeling, and local dimensionality reduction, and we demonstrate its application to image compression and handwritten digit recognition.


2017 ◽  
Vol 5 (1) ◽  
pp. 1
Author(s):  
Rika Yuliwulandari ◽  
Katsushi Tokunaga

AbstractThe human leukocyte antigens (HLAs) play important roles in the immune systems to response to various pathogens and disease among individuals. The aim of this study was analyze the HLA allele and haplotype frequencies of Southern East Asian population that show high incidence of nasopharyngeal carcinoma (NPC) to evaluate the shared HLA haplotype contribution to NPC susceptibility among the population and analyses the genetic affinities between the population. We collect information of HLA haplotype from our previous study, other published paper, and HLA database in 19 population during 2005 to 2015. Haplotype frequencies were estimated using the maximum likelihood method based on an expectation maximization algorithm with ARLEQUIN v.2.0 software. We also calculated the genetic distance among 19 Southern East Asians based on HLA allele frequency using modified Cavalli-Sforza (DA) distance method. Then, a phylogenetic tree was constructed using DISPAN software and principal component analysis (PCA) was performed using XLSTAT-PRO software. A33-B58-DR3 haplotype, tightly linked to NPC, was commonly observed in all populations, supporting the high incidence of NPC in the populations. In addition, A2-B46 haplotype also associated with NPC, was also commonly found in several population that may also have a role in the disease development. The conclusion is the HLA haplotype sharing has an important role than the HLA allele sharing. The A33-B58-DR3 haplotype and A2-B46-DR9 haplotype in this study could be related to NPC in the Southern East Asian populations. The observed haplotype needs to be tested in the real patients to confirm the assumption.AbstrakHuman leukocyte antigens (HLAs) berperan penting dalam sistem imun untuk merespons berbagai patogen dan penyakit di antara individu yang berbeda. Tujuan penelitian ini menganalisis frekuensi alel dan haplotipe HLA populasi Southern East Asia yang menunjukkan insidensi yang tinggi terhadap nasopharyngeal carcinoma (NPC) untuk mengevaluasi kerentanan NPC bagi individu. Informasi haplotipe HLA dikumpulkan dari studi sebelumnya, publikasi jurnal internasional, dan database HLA pada 19 populasi dalam periode tahun 2005–2015. Frekuensi haplotipe dihitung menggunakan metode maximum likelihood berdasarkan expectation maximization algorithm menggunakan piranti lunak ARLEQUIN v.2.0. Jarak genetik di antara 19 populasi Southern East Asians dihitung menggunakan metode modified Cavalli-Sforza (DA) distance. Kemudian, pohon filogenetik dikonstruksi dengan metode neighbor-joining (NJ) menggunakan piranti lunak DISPAN. Principal component analysis (PCA) dilakukan menggunakan piranti lunak XLSTAT-PRO. Haplotipe A33-B58-DR3 terkait erat dengan NPC yang biasa terlihat di semua populasi mendukung tingginya insidensi penyakit dalam populasi. Selain itu, haplotipe A2-B46 juga terkait dengan NPC yang juga ditemukan pada beberapa populasi sehingga kemungkinan memiliki peran dalam perkembangan penyakit. Pada kasus NPC, haplotipe HLA lebih berperan dibanding dengan alel HLA. Haplotipe A33-B58-DR3 dan haplotipe A2-B46-DR9 yang ditemukan terkait dengan NPC pada populasi Southern East Asia. Haplotipe yang diamati tersebut perlu diuji pada pasien nyata untuk mengonfirmasi simpulan.


Sign in / Sign up

Export Citation Format

Share Document