scholarly journals Order Selection in Unsupervised Learning and Clustering for Arbitrary and Non-Arbitrary Shaped Data

2021 ◽  
Author(s):  
Mahdi Shahbaba

This thesis focuses on clustering for the purpose of unsupervised learning. One topic of our interest is on estimating the correct number of clusters (CNC). In conventional clustering approaches, such as X-means, G-means, PG-means and Dip-means, estimating the CNC is a preprocessing step prior to finding the centers and clusters. In another word, the first step estimates the CNC and the second step finds the clusters. Each step having different objective function to minimize. Here, we propose minimum averaged central error (MACE)-means clustering and use one objective function to simultaneously estimate the CNC and provide the cluster centers. We have shown superiority of MACEmeans over the conventional methods in term of estimating the CNC with comparable complexity. In addition, on average MACE-means results in better values for adjusted rand index (ARI) and variation of information (VI). Next topic of our interest is order selection step of the conventional methods which is usually a statistical testing method such as Kolmogrov-Smrinov test, Anderson-Darling test, and Hartigan's Dip test. We propose a new statistical test denoted by Sigtest (signature testing). The conventional statistical testing approaches rely on a particular assumption on the probability distribution of each cluster. Sigtest on the other hand can be used with any prior distribution assumption on the clusters. By replacing the statistical testing of the mentioned conventional approaches with Sigtest, we have shown that the clustering methods are improved in terms of having more accurate CNC as well as ARI and VI. Conventional clustering approaches fail in arbitrary shaped clustering. Our last contribution of the thesis is in arbitrary shaped clustering. The proposed method denoted by minimum Pathways is Arbitrary Shaped (minPAS) clustering is proposed based on a unique minimum spanning tree structure of the data. Our simulation results show advantage of minPAS over the state-of-the-art arbitrary shaped clustering methods such as DBSCAN and Affinity Propagation in terms of accuracy, ARI and VI indexes.

2021 ◽  
Author(s):  
Mahdi Shahbaba

This thesis focuses on clustering for the purpose of unsupervised learning. One topic of our interest is on estimating the correct number of clusters (CNC). In conventional clustering approaches, such as X-means, G-means, PG-means and Dip-means, estimating the CNC is a preprocessing step prior to finding the centers and clusters. In another word, the first step estimates the CNC and the second step finds the clusters. Each step having different objective function to minimize. Here, we propose minimum averaged central error (MACE)-means clustering and use one objective function to simultaneously estimate the CNC and provide the cluster centers. We have shown superiority of MACEmeans over the conventional methods in term of estimating the CNC with comparable complexity. In addition, on average MACE-means results in better values for adjusted rand index (ARI) and variation of information (VI). Next topic of our interest is order selection step of the conventional methods which is usually a statistical testing method such as Kolmogrov-Smrinov test, Anderson-Darling test, and Hartigan's Dip test. We propose a new statistical test denoted by Sigtest (signature testing). The conventional statistical testing approaches rely on a particular assumption on the probability distribution of each cluster. Sigtest on the other hand can be used with any prior distribution assumption on the clusters. By replacing the statistical testing of the mentioned conventional approaches with Sigtest, we have shown that the clustering methods are improved in terms of having more accurate CNC as well as ARI and VI. Conventional clustering approaches fail in arbitrary shaped clustering. Our last contribution of the thesis is in arbitrary shaped clustering. The proposed method denoted by minimum Pathways is Arbitrary Shaped (minPAS) clustering is proposed based on a unique minimum spanning tree structure of the data. Our simulation results show advantage of minPAS over the state-of-the-art arbitrary shaped clustering methods such as DBSCAN and Affinity Propagation in terms of accuracy, ARI and VI indexes.


2015 ◽  
Vol 2015 ◽  
pp. 1-8 ◽  
Author(s):  
Rupert Faltermeier ◽  
Martin A. Proescholdt ◽  
Sylvia Bele ◽  
Alexander Brawanski

Although multimodal monitoring sets the standard in daily practice of neurocritical care, problem-oriented analysis tools to interpret the huge amount of data are lacking. Recently a mathematical model was presented that simulates the cerebral perfusion and oxygen supply in case of a severe head trauma, predicting the appearance of distinct correlations between arterial blood pressure and intracranial pressure. In this study we present a set of mathematical tools that reliably detect the predicted correlations in data recorded at a neurocritical care unit. The time resolved correlations will be identified by a windowing technique combined with Fourier-based coherence calculations. The phasing of the data is detected by means of Hilbert phase difference within the above mentioned windows. A statistical testing method is introduced that allows tuning the parameters of the windowing method in such a way that a predefined accuracy is reached. With this method the data of fifteen patients were examined in which we found the predicted correlation in each patient. Additionally it could be shown that the occurrence of a distinct correlation parameter, called scp, represents a predictive value of high quality for the patients outcome.


2019 ◽  
Vol 42 (4) ◽  
pp. 772-777
Author(s):  
Steven L Senior

Abstract Background The English Indices of Multiple Deprivation (IMD) is widely used as a measure of deprivation. However, similarly ranked areas can differ substantially in the underlying domains of deprivation. These domains contain a richer set of data that might be useful for classifying local authorities. Clustering methods offer a set of techniques to identify groups of areas with similar patterns of deprivation. Methods Hierarchical agglomerative (i.e. bottom-up) clustering methods were applied to domain scores for 152 upper tier local authorities. Advances in statistical testing allow clusters to be identified that are unlikely to have arisen from random partitioning of a homogeneous group. The resulting clusters are described in terms of their subdomain scores and basic geographic and demographic characteristics. Results Five statistically significant clusters of local authorities were identified. These clusters only partially reflect different levels of overall deprivation. In particular, two clusters share similar overall IMD scores but have contrasting patterns of deprivation. Conclusion Hierarchical clustering methods identify five distinct clusters that do not correspond closely to quintiles of deprivation. This approach may help to distinguish between places that face similar underlying challenges, and places that appear similar in terms of overall deprivation scores, but that face different challenges.


2020 ◽  
Vol 2 (4) ◽  
pp. 513-528
Author(s):  
Rossella Aversa ◽  
Piero Coronica ◽  
Cristiano De Nobili ◽  
Stefano Cozzini

In this paper, we report upon our recent work aimed at improving and adapting machine learning algorithms to automatically classify nanoscience images acquired by the Scanning Electron Microscope (SEM). This is done by coupling supervised and unsupervised learning approaches. We first investigate supervised learning on a ten-category data set of images and compare the performance of the different models in terms of training accuracy. Then, we reduce the dimensionality of the features through autoencoders to perform unsupervised learning on a subset of images in a selected range of scales (from 1 μm to 2 μm). Finally, we compare different clustering methods to uncover intrinsic structures in the images.


Author(s):  
Tadafumi Kondo ◽  
◽  
Yuchi Kanzawa

This paper presents two fuzzy clustering algorithms for categorical multivariate data based on q-divergence. First, this study shows that a conventional method for vectorial data can be explained as regularizing another conventional method using q-divergence. Second, based on the known results that Kullback-Leibler (KL)-divergence is generalized into the q-divergence, and two conventional fuzzy clustering methods for categorical multivariate data adopt KL-divergence, two fuzzy clustering algorithms for categorical multivariate data that are based on q-divergence are derived from two optimization problems built by extending the KL-divergence in these conventional methods to the q-divergence. Through numerical experiments using real datasets, the proposed methods outperform the conventional methods in term of clustering accuracy.


10.12737/7483 ◽  
2014 ◽  
Vol 8 (7) ◽  
pp. 0-0
Author(s):  
Олег Сдвижков ◽  
Oleg Sdvizhkov

Cluster analysis [3] is a relatively new branch of mathematics that studies the methods partitioning a set of objects, given a finite set of attributes into homogeneous groups (clusters). Cluster analysis is widely used in psychology, sociology, economics (market segmentation), and many other areas in which there is a problem of classification of objects according to their characteristics. Clustering methods implemented in a package STATISTICA [1] and SPSS [2], they return the partitioning into clusters, clustering and dispersion statistics dendrogram of hierarchical clustering algorithms. MS Excel Macros for main clustering methods and application examples are given in the monograph [5]. One of the central problems of cluster analysis is to define some criteria for the number of clusters, we denote this number by K, into which separated are a given set of objects. There are several dozen approaches [4] to determine the number K. In particular, according to [6], the number of clusters K - minimum number which satisfies where - the minimum value of total dispersion for partitioning into K clusters, N - number of objects. Among the clusters automatically causes the consistent application of abnormal clusters [4]. In 2010, proposed and experimentally validated was a method for obtaining the number of K by applying the density function [4]. The article offers two simple approaches to determining K, where each cluster has at least two objects. In the first number K is determined by the shortest Hamiltonian cycles in the second - through the minimum spanning tree. The examples of clustering with detailed step by step solutions and graphic illustrations are suggested. Shown is the use of macro VBA Excel, which returns the minimum spanning tree to the problems of clustering. The article contains a macro code, with commentaries to the main unit.


2018 ◽  
Vol 10 (02) ◽  
pp. 1840007 ◽  
Author(s):  
Alice Rueda ◽  
Sridhar Krishnan

This study focuses on the possibility of remote monitoring and screening of Parkinson’s and age-related voice impairment for the general public using self-recorded data on readily available or emerging technologies such as Smartphone and IoT devices. While most studies use professionally recorded voice in a controlled environment, this study uses self-recorded sustained vowel /a/ recordings using iPhone. Each healthy control (HC) and people with Parkinson’s (PWP) group has 57 age-matching mixed-gender subjects. The control subjects can have age-related voice impairment. Without severity labels, features extracted from the recordings were grouped by their similarity in voice using unsupervised learning with various clustering methods. The optimal number of clusters ([Formula: see text]) was estimated using direct and statistical methods. The estimated [Formula: see text] does not agree with the defined Unified Parkinson’s Disease Rating Scale-Speech (UPDRS-3.1) scales. Using [Formula: see text], five hierarchical and one partition-based clustering were used for comparison and cross-checking. The hierarchical-based methods are Hierarchical Cluster (HCluster), Hierarchical K-Means (HKMeans), Agglomerative Nesting (AGNES), Divisive Analysis (DIANA), and neural network-based Self-Organized Tree Algorithm (SOTA). The partition-based method is Clustering Large Applications (CLARA). Three internal validation indices: connectivity, Dunn index and silhouette width, were used to measure the compactness of the clusters and their separations. The validation result, ordered from the best, is AGNES, HCluster, DIANA, HKMeans, CLARA, and SOTA. Majority vote was applied to the results from AGNES, HCluster and DIANA to obtain the final grouping. Five groups were defined representing outliers, severely impaired voice, minor impaired, healthier voice, and cannot be grouped. All methods identified the same two outliers except SOTA. The clustering and voting have successfully identified the 2 outliers, 5 more severely impaired, 82 minor impaired, and 22 healthier voice. Only 3 could not be grouped. Feature extraction has reduced the data size by a factor of 518. It is possible to first reduce the data size for transmission and perform unsupervised learning at the receiving end for remote monitoring and screening.


2014 ◽  
Vol 971-973 ◽  
pp. 1565-1568
Author(s):  
Zhi Yong Wang

Facing the particularity of the current limitations and spatial clustering clustering methods, the objective function from concept clustering starting to GIS spatial data management and spatial analysis for technical support, explores the space between the sample direct access to the distance calculated distance and indirect reach up costs. K samples randomly selected as the cluster center, with space for the sample to reach the center of each cluster sample is divided according to the distance, the sum of the spatial clustering center of the sample to reach its cost objective function for clustering, introduction of genetic algorithm, a spatial clustering algorithm based on GIS. Finally, the algorithm is tested by examples.


2018 ◽  
Vol 7 (4.6) ◽  
pp. 214
Author(s):  
K. Nikhila ◽  
P. Manvitha

Clustering on unsupervised learning handles with instances, which are not classified already and not having class attribute with them. Applying algorithms is to find useful but items on unknown classes. Approach of unsupervised learning is about instances are automatically making into meaningful groups basing on its similarity. This paper we study about the basic clustering       methods in data mining on unsupervised learning such as ensembles distributed clustering and its algorithms.  


Sign in / Sign up

Export Citation Format

Share Document