Fuzzyc-Means Clustering for Uncertain Data Using Quadratic Penalty-Vector Regularization

Author(s):  
Yasunori Endo ◽  
◽  
Yasushi Hasegawa ◽  
Yukihiro Hamasuna ◽  
Yuchi Kanzawa ◽  
...  

Clustering - defined as an unsupervised data-analysis classification transforming real-space information into data in pattern space and analyzing it - may require that data be represented by a set, rather than points, due to data uncertainty, e.g., measurement error margin, data regarded as one point, or missing values. These data uncertainties have been represented as interval ranges for which many clustering algorithms are constructed, but the lack of guidelines in selecting available distances in individual cases has made selection difficult and raised the need for ways to calculate dissimilarity between uncertain data without introducing a nearest-neighbor or other distance. The tolerance concept we propose represents uncertain data as a point with a tolerance vector, not as an interval, while this is convenient for handling uncertain data, tolerance-vector constraints make mathematical development difficult. We attempt to remove the tolerance-vector constraints using quadratic penaltyvector regularization similar to the tolerance vector. We also propose clustering algorithms for uncertain data considering optimization and obtaining an optimal solution to handle uncertainty appropriately.

Author(s):  
Yasunori Endo ◽  
◽  
Arisa Taniguchi ◽  
Yukihiro Hamasuna ◽  
◽  
...  

Clustering is an unsupervised classification technique for data analysis. In general, each datum in real space is transformed into a point in a pattern space to apply clustering methods. Data cannot often be represented by a point, however, because of its uncertainty, e.g., measurement error margin and missing values in data. In this paper, we will introduce quadratic penalty-vector regularization to handle such uncertain data using Hard c-Means (HCM), which is one of the most typical clustering algorithms. We first propose a new clustering algorithm called hard c-means using quadratic penalty-vector regularization for uncertain data (HCMP). Second, we propose sequential extraction hard c-means using quadratic penalty-vector regularization (SHCMP) to handle datasets whose cluster number is unknown. Furthermore, we verify the effectiveness of our proposed algorithms through numerical examples.


2021 ◽  
Vol 10 (47) ◽  
pp. 81-92
Author(s):  
Andrey V. Silin ◽  
Olga N. Grinyuk ◽  
Tatyana A. Lartseva ◽  
Olga V. Aleksashina ◽  
Tatiana S. Sukhova

This article discusses an approach to creating a complex of programs for the implementation of cluster analysis methods. A number of cluster analysis tools for processing the initial data set and their software implementation are analyzed, as well as the complexity of the application of cluster data analysis. An approach to data is generalized from the point of view of factual material that supplies information for the problem under study and is the basis for discussion, analysis and decision-making. Cluster analysis is a procedure that combines objects or variables into groups based on a given rule. The work provides a grouping of multivariate data using proximity measures such as sample correlation coefficient and its module, cosine of the angle between vectors and Euclidean distance. The authors proposed a method for grouping by centers, by the nearest neighbor and by selected standards. The results can be used by analysts in the process of creating a data analysis structure and will improve the efficiency of clustering algorithms. The practical significance of the results of the application of the developed algorithms is expressed in the software package created by means of the C ++ language in the VS environment.


Test ◽  
2021 ◽  
Author(s):  
Jan Pablo Burgard ◽  
Joscha Krause ◽  
Domingo Morales

AbstractThe assessment of prevalence on regional levels is an important element of public health reporting. Since regional prevalence is rarely collected in registers, corresponding figures are often estimated via small area estimation using suitable health data. However, such data are frequently subject to uncertainty as values have been estimated from surveys. In that case, the method for prevalence estimation must explicitly account for data uncertainty to allow for reliable results. This can be achieved via measurement error models that introduce distribution assumptions on the noisy data. However, these methods usually require target and explanatory variable errors to be independent. This does not hold when data for both have been estimated from the same survey, which is sometimes the case in official statistics. If not accounted for, prevalence estimates can be severely biased. We propose a new measurement error model for regional prevalence estimation that is suitable for settings where target and explanatory variable errors are dependent. We derive empirical best predictors and demonstrate mean-squared error estimation. A maximum likelihood approach for model parameter estimation is presented. Simulation experiments are conducted to prove the effectiveness of the method. An application to regional hypertension prevalence estimation in Germany is provided.


2015 ◽  
pp. 125-138 ◽  
Author(s):  
I. V. Goncharenko

In this article we proposed a new method of non-hierarchical cluster analysis using k-nearest-neighbor graph and discussed it with respect to vegetation classification. The method of k-nearest neighbor (k-NN) classification was originally developed in 1951 (Fix, Hodges, 1951). Later a term “k-NN graph” and a few algorithms of k-NN clustering appeared (Cover, Hart, 1967; Brito et al., 1997). In biology k-NN is used in analysis of protein structures and genome sequences. Most of k-NN clustering algorithms build «excessive» graph firstly, so called hypergraph, and then truncate it to subgraphs, just partitioning and coarsening hypergraph. We developed other strategy, the “upward” clustering in forming (assembling consequentially) one cluster after the other. Until today graph-based cluster analysis has not been considered concerning classification of vegetation datasets.


Mathematics ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 786
Author(s):  
Yenny Villuendas-Rey ◽  
Eley Barroso-Cubas ◽  
Oscar Camacho-Nieto ◽  
Cornelio Yáñez-Márquez

Swarm intelligence has appeared as an active field for solving numerous machine-learning tasks. In this paper, we address the problem of clustering data with missing values, where the patterns are described by mixed (or hybrid) features. We introduce a generic modification to three swarm intelligence algorithms (Artificial Bee Colony, Firefly Algorithm, and Novel Bat Algorithm). We experimentally obtain the adequate values of the parameters for these three modified algorithms, with the purpose of applying them in the clustering task. We also provide an unbiased comparison among several metaheuristics based clustering algorithms, concluding that the clusters obtained by our proposals are highly representative of the “natural structure” of data.


1997 ◽  
Vol 08 (03) ◽  
pp. 301-315 ◽  
Author(s):  
Marcel J. Nijman ◽  
Hilbert J. Kappen

A Radial Basis Boltzmann Machine (RBBM) is a specialized Boltzmann Machine architecture that combines feed-forward mapping with probability estimation in the input space, and for which very efficient learning rules exist. The hidden representation of the network displays symmetry breaking as a function of the noise in the dynamics. Thus, generalization can be studied as a function of the noise in the neuron dynamics instead of as a function of the number of hidden units. We show that the RBBM can be seen as an elegant alternative of k-nearest neighbor, leading to comparable performance without the need to store all data. We show that the RBBM has good classification performance compared to the MLP. The main advantage of the RBBM is that simultaneously with the input-output mapping, a model of the input space is obtained which can be used for learning with missing values. We derive learning rules for the case of incomplete data, and show that they perform better on incomplete data than the traditional learning rules on a 'repaired' data set.


2021 ◽  
Vol 25 (6) ◽  
pp. 1453-1471
Author(s):  
Chunhua Tang ◽  
Han Wang ◽  
Zhiwen Wang ◽  
Xiangkun Zeng ◽  
Huaran Yan ◽  
...  

Most density-based clustering algorithms have the problems of difficult parameter setting, high time complexity, poor noise recognition, and weak clustering for datasets with uneven density. To solve these problems, this paper proposes FOP-OPTICS algorithm (Finding of the Ordering Peaks Based on OPTICS), which is a substantial improvement of OPTICS (Ordering Points To Identify the Clustering Structure). The proposed algorithm finds the demarcation point (DP) from the Augmented Cluster-Ordering generated by OPTICS and uses the reachability-distance of DP as the radius of neighborhood eps of its corresponding cluster. It overcomes the weakness of most algorithms in clustering datasets with uneven densities. By computing the distance of the k-nearest neighbor of each point, it reduces the time complexity of OPTICS; by calculating density-mutation points within the clusters, it can efficiently recognize noise. The experimental results show that FOP-OPTICS has the lowest time complexity, and outperforms other algorithms in parameter setting and noise recognition.


Sign in / Sign up

Export Citation Format

Share Document