Data Clustering Using a Model Granular Magnet

1997 ◽  
Vol 9 (8) ◽  
pp. 1805-1842 ◽  
Author(s):  
Marcelo Blatt ◽  
Shai Wiseman ◽  
Eytan Domany

We present a new approach to clustering, based on the physical properties of an inhomogeneous ferromagnet. No assumption is made regarding the underlying distribution of the data. We assign a Potts spin to each data point and introduce an interaction between neighboring points, whose strength is a decreasing function of the distance between the neighbors. This magnetic system exhibits three phases. At very low temperatures, it is completely ordered; all spins are aligned. At very high temperatures, the system does not exhibit any ordering, and in an intermediate regime, clusters of relatively strongly coupled spins become ordered, whereas different clusters remain uncorrelated. This intermediate phase is identified by a jump in the order parameters. The spin-spin correlation function is used to partition the spins and the corresponding data points into clusters. We demonstrate on three synthetic and three real data sets how the method works. Detailed comparison to the performance of other techniques clearly indicates the relative success of our method.

2019 ◽  
Vol 8 (2) ◽  
pp. 159
Author(s):  
Morteza Marzjarani

Heteroscedasticity plays an important role in data analysis. In this article, this issue along with a few different approaches for handling heteroscedasticity are presented. First, an iterative weighted least square (IRLS) and an iterative feasible generalized least square (IFGLS) are deployed and proper weights for reducing heteroscedasticity are determined. Next, a new approach for handling heteroscedasticity is introduced. In this approach, through fitting a multiple linear regression (MLR) model or a general linear model (GLM) to a sufficiently large data set, the data is divided into two parts through the inspection of the residuals based on the results of testing for heteroscedasticity, or via simulations. The first part contains the records where the absolute values of the residuals could be assumed small enough to the point that heteroscedasticity would be ignorable. Under this assumption, the error variances are small and close to their neighboring points. Such error variances could be assumed known (but, not necessarily equal).The second or the remaining portion of the said data is categorized as heteroscedastic. Through real data sets, it is concluded that this approach reduces the number of unusual (such as influential) data points suggested for further inspection and more importantly, it will lowers the root MSE (RMSE) resulting in a more robust set of parameter estimates.


2018 ◽  
Vol 11 (2) ◽  
pp. 53-67
Author(s):  
Ajay Kumar ◽  
Shishir Kumar

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.


2018 ◽  
Vol 8 (2) ◽  
pp. 377-406
Author(s):  
Almog Lahav ◽  
Ronen Talmon ◽  
Yuval Kluger

Abstract A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored, which is the structure stemming from the relationships between the coordinates. Specifically, we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space. We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan–Meier survival plot.


Author(s):  
B. Piltz ◽  
S. Bayer ◽  
A. M. Poznanska

In this paper we propose a new algorithm for digital terrain (DTM) model reconstruction from very high spatial resolution digital surface models (DSMs). It represents a combination of multi-directional filtering with a new metric which we call <i>normalized volume above ground</i> to create an above-ground mask containing buildings and elevated vegetation. This mask can be used to interpolate a ground-only DTM. The presented algorithm works fully automatically, requiring only the processing parameters <i>minimum height</i> and <i>maximum width</i> in metric units. Since slope and breaklines are not decisive criteria, low and smooth and even very extensive flat objects are recognized and masked. The algorithm was developed with the goal to generate the normalized DSM for automatic 3D building reconstruction and works reliably also in environments with distinct hillsides or terrace-shaped terrain where conventional methods would fail. A quantitative comparison with the ISPRS data sets <i>Potsdam</i> and <i>Vaihingen</i> show that 98-99% of all building data points are identified and can be removed, while enough ground data points (~66%) are kept to be able to reconstruct the ground surface. Additionally, we discuss the concept of <i>size dependent height thresholds</i> and present an efficient scheme for pyramidal processing of data sets reducing time complexity to linear to the number of pixels, <i>O(WH)</i>.


Geophysics ◽  
2020 ◽  
Vol 85 (2) ◽  
pp. V223-V232 ◽  
Author(s):  
Zhicheng Geng ◽  
Xinming Wu ◽  
Sergey Fomel ◽  
Yangkang Chen

The seislet transform uses the wavelet-lifting scheme and local slopes to analyze the seismic data. In its definition, the designing of prediction operators specifically for seismic images and data is an important issue. We have developed a new formulation of the seislet transform based on the relative time (RT) attribute. This method uses the RT volume to construct multiscale prediction operators. With the new prediction operators, the seislet transform gets accelerated because distant traces get predicted directly. We apply our method to synthetic and real data to demonstrate that the new approach reduces computational cost and obtains excellent sparse representation on test data sets.


Author(s):  
Carlos A. P. Bengaly ◽  
Uendert Andrade ◽  
Jailson S. Alcaniz

Abstract We address the $$\simeq 4.4\sigma $$≃4.4σ tension between local and the CMB measurements of the Hubble Constant using simulated Type Ia Supernova (SN) data-sets. We probe its directional dependence by means of a hemispherical comparison through the entire celestial sphere as an estimator of the $$H_0$$H0 cosmic variance. We perform Monte Carlo simulations assuming isotropic and non-uniform distributions of data points, the latter coinciding with the real data. This allows us to incorporate observational features, such as the sample incompleteness, in our estimation. We obtain that this tension can be alleviated to $$3.4\sigma $$3.4σ for isotropic realizations, and $$2.7\sigma $$2.7σ for non-uniform ones. We also find that the $$H_0$$H0 variance is largely reduced if the data-sets are augmented to 4 and 10 times the current size. Future surveys will be able to tell whether the Hubble Constant tension happens due to unaccounted cosmic variance, or whether it is an actual indication of physics beyond the standard cosmological model.


Author(s):  
Ibrahim Sule ◽  
Sani Ibrahim Doguwa ◽  
Audu Isah ◽  
Haruna Muhammad Jibril

Background: In the last few years, statisticians have introduced new generated families of univariate distributions. These new generators are obtained by adding one or more extra shape parameters to the underlying distribution to get more flexibility in fitting data in different areas such as medical sciences, economics, finance and environmental sciences. The addition of parameter(s) has been proven useful in exploring tail properties and also for improving the goodness-of-fit of the family of distributions under study. Methods: A new three-parameter family of distributions was introduced by using the idea of T-X methodology. Some statistical properties of the new family were derived and studied. Results: A new Topp Leone Kumaraswamy-G family of distributions was introduced. Two special sub-models, that is, the Topp Leone Kumaraswamy exponential distribution and Topp Leone Kumaraswamy log-logistic distribution were investigated. Two real data sets were used to assess the flexibility of the sub-models. Conclusion: The results suggest that the two sub-models performed better than their competitors.


2006 ◽  
Vol 18 (6) ◽  
pp. 765-771 ◽  
Author(s):  
Haruhisa Okuda ◽  
◽  
Yasuo Kitaaki ◽  
Manabu Hashimoto ◽  
Shun’ichi Kaneko ◽  
...  

This paper presents a novel fast and highly accurate 3-D registration algorithm. The ICP (Iterative Closest Point) algorithm converges all the 3-D data points of two data sets to the best-matching points with minimum evaluation values. This algorithm is in widespread use because it has good validity for many applications, but it extracts a heavy computational cost and is very sensitive to error. This is because it uses all the data points of two data sets and least mean square optimization. We previously proposed the M-ICP algorithm, which uses M-estimation to realize robustness against outlying gross noise with the original ICP algorithm. In this paper, we propose a novel algorithm called HM-ICP (Hierarchical M-ICP), which is an extension of the M-ICP that selects regions for matching and hierarchical searching of selected regions. This method selects regions by evaluating the variance of distance values in the target region, and homogeneous topological mapping. Some fundamental experiments using real data sets of 3-D measurement demonstrate the effectiveness of the proposed method, achieving a reduction of more than ten thousand times for computational costs. We also confirmed an error of less than 0.1% for the measurement distance.


Geophysics ◽  
2016 ◽  
Vol 81 (6) ◽  
pp. D625-D641 ◽  
Author(s):  
Dario Grana

The estimation of rock and fluid properties from seismic attributes is an inverse problem. Rock-physics modeling provides physical relations to link elastic and petrophysical variables. Most of these models are nonlinear; therefore, the inversion generally requires complex iterative optimization algorithms to estimate the reservoir model of petrophysical properties. We have developed a new approach based on the linearization of the rock-physics forward model using first-order Taylor series approximations. The mathematical method adopted for the inversion is the Bayesian approach previously applied successfully to amplitude variation with offset linearized inversion. We developed the analytical formulation of the linearized rock-physics relations for three different models: empirical, granular media, and inclusion models, and we derived the formulation of the Bayesian rock-physics inversion under Gaussian assumptions for the prior distribution of the model. The application of the inversion to real data sets delivers accurate results. The main advantage of this method is the small computational cost due to the analytical solution given by the linearization and the Bayesian Gaussian approach.


2021 ◽  
Vol 3 (1) ◽  
pp. 1-7
Author(s):  
Yadgar Sirwan Abdulrahman

Clustering is one of the essential strategies in data analysis. In classical solutions, all features are assumed to contribute equally to the data clustering. Of course, some features are more important than others in real data sets. As a result, essential features will have a more significant impact on identifying optimal clusters than other features. In this article, a fuzzy clustering algorithm with local automatic weighting is presented. The proposed algorithm has many advantages such as: 1) the weights perform features locally, meaning that each cluster's weight is different from the rest. 2) calculating the distance between the samples using a non-euclidian similarity criterion to reduce the noise effect. 3) the weight of the features is obtained comparatively during the learning process. In this study, mathematical analyzes were done to obtain the clustering centers well-being and the features' weights. Experiments were done on the data set range to represent the progressive algorithm's efficiency compared to other proposed algorithms with global and local features


Sign in / Sign up

Export Citation Format

Share Document