scholarly journals The Impact of Random Models on Clustering Similarity

2017 ◽  
Author(s):  
Alexander J. Gates ◽  
Yong-Yeol Ahn

AbstractClustering is a central approach for unsupervised learning. After clustering is applied, the most fundamental analysis is to quantitatively compare clusterings. Such comparisons are crucial for the evaluation of clustering methods as well as other tasks such as consensus clustering. It is often argued that, in order to establish a baseline, clustering similarity should be assessed in the context of a random ensemble of clusterings. The prevailing assumption for the random clustering ensemble is the permutation model in which the number and sizes of clusters are fixed. However, this assumption does not necessarily hold in practice; for example, multiple runs of K-means clustering returns clusterings with a fixed number of clusters, while the cluster size distribution varies greatly. Here, we derive corrected variants of two clustering similarity measures (the Rand index and Mutual Information) in the context of two random clustering ensembles in which the number and sizes of clusters vary. In addition, we study the impact of one-sided comparisons in the scenario with a reference clustering. The consequences of different random models are illustrated using synthetic examples, handwriting recognition, and gene expression data. We demonstrate that the choice of random model can have a drastic impact on the ranking of similar clustering pairs, and the evaluation of a clustering method with respect to a random baseline; thus, the choice of random clustering model should be carefully justified.


2018 ◽  
Author(s):  
Alexander J. Gates ◽  
Yong-Yeol Ahn

SummaryQuantifying the similarity of clusterings is a fundamental step in data analysis. Clustering similarity is the basis for method evaluation, consensus clustering, and tracking the temporal evolution of clusters, among many other tasks. Here we provide CluSim, a comprehensive Python package for the comparison of partitions, overlapping clusterings, and hierarchical clusterings (dendrograms) with more than 20 similarity measures. The CluSim package provides both analytic and empirical methods for assessing the similarity of clusterings in the context of a random model, and provides the novel element-centric approaches for clustering similarity measure that we introduced recently. We illustrate the use of the package through two examples: an evaluation of the clustering of Gene Expression data in the context of different random models, and detailed analysis of model incongruence using element-centric comparisons between a set of phylogentic trees (dendrograms).Availability and implementationThe CluSim Python package and accompanying jupyter notebook is available at https://github.com/Hoosier-Clusters/clusim with the MIT open source [email protected] [email protected]



2021 ◽  
Vol 7 ◽  
pp. e450
Author(s):  
Wenna Huang ◽  
Yong Peng ◽  
Yuan Ge ◽  
Wanzeng Kong

The Kmeans clustering and spectral clustering are two popular clustering methods for grouping similar data points together according to their similarities. However, the performance of Kmeans clustering might be quite unstable due to the random initialization of the cluster centroids. Generally, spectral clustering methods employ a two-step strategy of spectral embedding and discretization postprocessing to obtain the cluster assignment, which easily lead to far deviation from true discrete solution during the postprocessing process. In this paper, based on the connection between the Kmeans clustering and spectral clustering, we propose a new Kmeans formulation by joint spectral embedding and spectral rotation which is an effective postprocessing approach to perform the discretization, termed KMSR. Further, instead of directly using the dot-product data similarity measure, we make generalization on KMSR by incorporating more advanced data similarity measures and call this generalized model as KMSR-G. An efficient optimization method is derived to solve the KMSR (KMSR-G) model objective whose complexity and convergence are provided. We conduct experiments on extensive benchmark datasets to validate the performance of our proposed models and the experimental results demonstrate that our models perform better than the related methods in most cases.



PLoS ONE ◽  
2021 ◽  
Vol 16 (1) ◽  
pp. e0245264
Author(s):  
Ali Sabah ◽  
Sabrina Tiun ◽  
Nor Samsiah Sani ◽  
Masri Ayob ◽  
Adil Yaseen Taha

Existing text clustering methods utilize only one representation at a time (single view), whereas multiple views can represent documents. The multiview multirepresentation method enhances clustering quality. Moreover, existing clustering methods that utilize more than one representation at a time (multiview) use representation with the same nature. Hence, using multiple views that represent data in a different representation with clustering methods is reasonable to create a diverse set of candidate clustering solutions. On this basis, an effective dynamic clustering method must consider combining multiple views of data including semantic view, lexical view (word weighting), and topic view as well as the number of clusters. The main goal of this study is to develop a new method that can improve the performance of web search result clustering (WSRC). An enhanced multiview multirepresentation consensus clustering ensemble (MMCC) method is proposed to create a set of diverse candidate solutions and select a high-quality overlapping cluster. The overlapping clusters are obtained from the candidate solutions created by different clustering methods. The framework to develop the proposed MMCC includes numerous stages: (1) acquiring the standard datasets (MORESQUE and Open Directory Project-239), which are used to validate search result clustering algorithms, (2) preprocessing the dataset, (3) applying multiview multirepresentation clustering models, (4) using the radius-based cluster number estimation algorithm, and (5) employing the consensus clustering ensemble method. Results show an improvement in clustering methods when multiview multirepresentation is used. More importantly, the proposed MMCC model improves the overall performance of WSRC compared with all single-view clustering models.



2021 ◽  
Author(s):  
Antonios Makris ◽  
Camila Leite da Silva ◽  
Vania Bogorny ◽  
Luis Otavio Alvares ◽  
Jose Antonio Macedo ◽  
...  

AbstractDuring the last few years the volumes of the data that synthesize trajectories have expanded to unparalleled quantities. This growth is challenging traditional trajectory analysis approaches and solutions are sought in other domains. In this work, we focus on data compression techniques with the intention to minimize the size of trajectory data, while, at the same time, minimizing the impact on the trajectory analysis methods. To this extent, we evaluate five lossy compression algorithms: Douglas-Peucker (DP), Time Ratio (TR), Speed Based (SP), Time Ratio Speed Based (TR_SP) and Speed Based Time Ratio (SP_TR). The comparison is performed using four distinct real world datasets against six different dynamically assigned thresholds. The effectiveness of the compression is evaluated using classification techniques and similarity measures. The results showed that there is a trade-off between the compression rate and the achieved quality. The is no “best algorithm” for every case and the choice of the proper compression algorithm is an application-dependent process.



Author(s):  
Laura Macia

In this article I discuss cluster analysis as an exploratory tool to support the identification of associations within qualitative data. While not appropriate for all qualitative projects, cluster analysis can be particularly helpful in identifying patterns where numerous cases are studied. I use as illustration a research project on Latino grievances to offer a detailed explanation of the main steps in cluster analysis, providing specific considerations for its use with qualitative data. I specifically describe the issues of data transformation, the choice of clustering methods and similarity measures, the identification of a cluster solution, and the interpretation of the data in a qualitative context.



2013 ◽  
Vol 12 (5) ◽  
pp. 3443-3451
Author(s):  
Rajesh Pasupuleti ◽  
Narsimha Gugulothu

Clustering analysis initiatives  a new direction in data mining that has major impact in various domains including machine learning, pattern recognition, image processing, information retrieval and bioinformatics. Current clustering techniques address some of the  requirements not adequately and failed in standardizing clustering algorithms to support for all real applications. Many clustering methods mostly depend on user specified parametric methods and initial seeds of clusters are randomly selected by  user.  In this paper, we proposed new clustering method based on linear approximation of function by getting over all idea of behavior knowledge of clustering function, then pick the initial seeds of clusters as the points on linear approximation line and perform clustering operations, unlike grouping data objects into clusters by using distance measures, similarity measures and statistical distributions in traditional clustering methods. We have shown experimental results as clusters based on linear approximation yields good  results in practice with an example of  business data are provided.  It also  explains privacy preserving clusters of sensitive data objects.



Author(s):  
Soobia Saeed ◽  
N. Z. Jhanjhi ◽  
Mehmood Naqvi ◽  
Mamoona Humayun ◽  
Vasaki Ponnusamy

Human beings have a knack for errors. Counter-effective actions rendered to specify and rectify such errors in a minimum period of time are required when effectiveness and swift advancement depends on the capability of acknowledging the faults and errors and repair quickly. The software as audit module application in IT complaint is in review in this commentary as is another significant instrument created in the field of data analysis that digs deep into quickly and successfully assessing the imprecisions or grievances identified by the users in a certain company. The target of this study is to evaluate the statistical significance in relationship between client reporting attitude and client reliability and to evaluate the impact of strong responsiveness on client reliability, to measure the statistically noteworthy effect of client grievance conduct on service quality, and to test the impact of service quality on client dedication.



Author(s):  
Ananya Nandy ◽  
Andy Dong ◽  
Kosa Goucher-Lambert

Abstract In order to retrieve analogous designs for design-by-analogy, computational systems require the calculation of similarity between the target design and a repository of source designs. Representing designs as functional abstractions can support designers in practicing design-by-analogy by minimizing fixation on surface-level similarities. In addition, when a design is represented by a functional model using a function-flow format, many measures are available to determine functional similarity. In most current function-based design-by-analogy systems, the functions are represented as vectors and measures like cosine similarity are used to retrieve analogous designs. However, it is hypothesized that changing the similarity measure can significantly change the examples that are retrieved. In this paper, several similarity measures are empirically tested across a set of functional models of energy harvesting products. In addition, the paper explores representing the functional models as networks to find functionally similar designs using graph similarity measures. Surprisingly, the types of designs that are considered similar by vector-based and one of the graph similarity measures are found to vary significantly. Even among a set of functional models that share known similar technology, the different measures find inconsistent degrees of similarity — some measures find the set of models to be very similar and some find them to be very dissimilar. The findings have implications on the choice of similarity metric and its effect on finding analogous designs that, in this case, have similar pairs of functions and flows in their functional models. Since literature has shown that the types of designs presented can impact their effectiveness in aiding the design process, this work intends to spur further consideration of the impact of using different similarity measures when assessing design similarity computationally.



2021 ◽  
Vol 40 (1) ◽  
pp. 477-490
Author(s):  
Yanping Xu ◽  
Tingcong Ye ◽  
Xin Wang ◽  
Yuping Lai ◽  
Jian Qiu ◽  
...  

In the field of security, the data labels are unknown or the labels are too expensive to label, so that clustering methods are used to detect the threat behavior contained in the big data. The most widely used probabilistic clustering model is Gaussian Mixture Models(GMM), which is flexible and powerful to apply prior knowledge for modelling the uncertainty of the data. Therefore, in this paper, we use GMM to build the threat behavior detection model. Commonly, Expectation Maximization (EM) and Variational Inference (VI) are used to estimate the optimal parameters of GMM. However, both EM and VI are quite sensitive to the initial values of the parameters. Therefore, we propose to use Singular Value Decomposition (SVD) to initialize the parameters. Firstly, SVD is used to factorize the data set matrix to get the singular value matrix and singular matrices. Then we calculate the number of the components of GMM by the first two singular values in the singular value matrix and the dimension of the data. Next, other parameters of GMM, such as the mixing coefficients, the mean and the covariance, are calculated based on the number of the components. After that, the initialization values of the parameters are input into EM and VI to estimate the optimal parameters of GMM. The experiment results indicate that our proposed method performs well on the parameters initialization of GMM clustering using EM and VI for estimating parameters.



2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Mercedeh Movassagh ◽  
Lisa M. Bebell ◽  
Kathy Burgoine ◽  
Christine Hehnly ◽  
Lijun Zhang ◽  
...  

AbstractThe composition of the maternal vaginal microbiome influences the duration of pregnancy, onset of labor, and even neonatal outcomes. Maternal microbiome research in sub-Saharan Africa has focused on non-pregnant and postpartum composition of the vaginal microbiome. Here we aimed to illustrate the relationship between the vaginal microbiome of 99 laboring Ugandan women and intrapartum fever using routine microbiology and 16S ribosomal RNA gene sequencing from two hypervariable regions (V1–V2 and V3–V4). To describe the vaginal microbes associated with vaginal microbial communities, we pursued two approaches: hierarchical clustering methods and a novel Grades of Membership (GoM) modeling approach for vaginal microbiome characterization. Leveraging GoM models, we created a basis composed of a preassigned number of microbial topics whose linear combination optimally represents each patient yielding more comprehensive associations and characterization between maternal clinical features and the microbial communities. Using a random forest model, we showed that by including microbial topic models we improved upon clinical variables to predict maternal fever. Overall, we found a higher prevalence of Granulicatella, Streptococcus, Fusobacterium, Anaerococcus, Sneathia, Clostridium, Gemella, Mobiluncus, and Veillonella genera in febrile mothers, and higher prevalence of Lactobacillus genera (in particular L. crispatus and L. jensenii), Acinobacter, Aerococcus, and Prevotella species in afebrile mothers. By including clinical variables with microbial topics in this model, we observed young maternal age, fever reported earlier in the pregnancy, longer labor duration, and microbial communities with reduced Lactobacillus diversity were associated with intrapartum fever. These results better defined relationships between the presence or absence of intrapartum fever, demographics, peripartum course, and vaginal microbial topics, and expanded our understanding of the impact of the microbiome on maternal and potentially neonatal outcome risk.



Sign in / Sign up

Export Citation Format

Share Document