constrained clustering
Recently Published Documents

An approach to solving the constrained clustering problem has been developed, based on the aggregation of data obtained as a result of evaluating the characteristics of clustered objects by several independent experts, and the analysis of alternative variants of clustering by constraint programming methods using original heuristics. Objects clusterized are represented as multisets, which makes it possible to use appropriate methods of aggregation of expert opinions. It is proposed to solve the constrained clustering problem as a constraint satisfaction problem. The main attention is paid to the issue of reducing the number and simplifying the constraints of the constraint satisfaction problem at the stage of its formalization. Within the framework of the approach, we have created: a) a method for estimating the optimal value of the objective function by hierarchical clustering of multisets, taking into account a priori constraints of the subject domain, and b) a method for generating additional constraints on the desired solution in the form of “smart tables”, based on the obtained estimate. The approach allows us to find the best partition in the problems of the class under consideration, which are characterized by a high dimension.

Download Full-text

Применение smart-таблиц и оценок целевой функции для снижения размерности и ускорения решения задач Constrained Clustering

Вестник ВГУ. Серия: Системный анализ и информационные технологии ◽

10.17308/sait.2021.3/3738 ◽

2021 ◽

pp. 81-94

Author(s):

Александр Анатольевич Зуенко ◽

Ольга Владимировна Фридман ◽

Ольга Николаевна Зуенко

Keyword(s):

Constraint Programming ◽

Constrained Clustering

В статье представлен комплексный подход к точному решению задач Constrained Clustering, то есть задач кластеризации, предполагающих анализ, помимо матрицы расстояний, фоновых знаний о необходимости/недопустимости вхождения некоторых объектов в те или иные кластеры. Подход реализован в рамках парадигмы программирования в ограничениях (Constraint Programming), ориентированной на построение процедур систематического поиска (процедур обхода дерева поиска) для решения сложных комбинаторных задач. При этом, вся исходная информация о задаче выражается с помощью ограничений, то есть качественных и количественных зависимостей. Существенная сложность заключается в том, что в современных средах и библиотеках программирования в ограничениях обработка качественных ограничений, которыми, в частности, являются правила отнесения объектов к одному или различным кластерам, производится недостаточно эффективно. Таким образом, представляется актуальной разработка способов ускорения обработки подобных ограничений. В статье предлагается представлять и обрабатывать качественные ограничения в форме табличных ограничений нового типа, а именно smart-таблиц D-типа. Для smart-таблиц D-типа разработаны высокоэффективные процедуры вывода на ограничениях, осуществляющие раннее отсечение неперспективных ветвей дерева поиска. Другое направление работ, которое активно развивается в настоящих исследованиях, связано с уменьшением количества ограничений, используемых для представления задачи, и с упрощением их вида. Предлагается генерировать ограничения лишь для некоторых пар объектов, основываясь на интервальной оценке для оптимального значения критерия кластеризации. Для получения данной оценки используется ранее предложенный авторами метод иерархической кластеризации, который позволяет анализировать ограничения на комбинации пар объектов внутри кластера. Предложенный подход позволяет находить все варианты разбиений, обеспечивающие глобальный оптимум целевой функции для рассматриваемых задач Constrained Clustering высокой размерности. Разработанный подход проиллюстрирован на примере задачи выявления зон участка горного массива с различной степенью сейсмической активности.

Download Full-text

3SHACC: Three Stages Hybrid Agglomerative Constrained Clustering

Neurocomputing ◽

10.1016/j.neucom.2021.12.018 ◽

2021 ◽

Author(s):

Germán González-Almagro ◽

Juan Luis Suarez ◽

Julián Luengo ◽

José-Ramón Cano ◽

Salvador García

Keyword(s):

Constrained Clustering ◽

Three Stages

Download Full-text

Classification of Markov Encrypted Traffic on Gaussian Mixture Model Constrained Clustering

Wireless Communications and Mobile Computing ◽

10.1155/2021/4935108 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Junkai Yi ◽

Guanglin Gong ◽

Zeyu Liu ◽

Yacong Zhang

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Gaussian Mixture ◽

Communication Process ◽

Traffic Classification ◽

Constrained Clustering ◽

Clustering Approach ◽

Network Application ◽

Encrypted Traffic

In order to solve the problem that traditional analysis approaches of encrypted traffic in encryption transmission of network application only consider the traffic classification in the complete communication process with ignoring traffic classification in the simplified communication process, and there are a lot of duplication problems in application fingerprints during state transition, a new classification approach of encrypted traffic is proposed. The article applies the Gaussian mixture model (GMM) to analyze the length of the message, and the model is established to solve the problem of application fingerprint duplication. The fingerprints with similar lengths of the same application are divided into as few clusters as possible by constrained clustering approach, which speeds up convergence speed and improves the clustering effect. The experimental results show that compared with the other encryption traffic classification approaches, the proposed approach has 11.7%, 19.8%, 6.86%, and 5.36% improvement in TPR, FPR, Precision, and Recall, respectively, and the classification effect of encrypted traffic is significantly improved.

Download Full-text

Improving structural variant clustering to reduce the negative effect of the breakpoint uncertainty problem

BMC Bioinformatics ◽

10.1186/s12859-021-04374-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jan Geryk ◽

Alzbeta Zinkova ◽

Iveta Zedníková ◽

Halina Simková ◽

Vlastimil Stenzl ◽

...

Keyword(s):

Mendelian Inheritance ◽

Dissimilarity Measure ◽

Constrained Clustering ◽

Structural Variants ◽

Short Read Sequencing ◽

Population Analyses ◽

Heuristic Strategy ◽

Negative Effect ◽

Critical Problems ◽

Hardy Weinberg Equilibrium

Abstract Background Structural variants (SVs) represent an important source of genetic variation. One of the most critical problems in their detection is breakpoint uncertainty associated with the inability to determine their exact genomic position. Breakpoint uncertainty is a characteristic issue of structural variants detected via short-read sequencing methods and complicates subsequent population analyses. The commonly used heuristic strategy reduces this issue by clustering/merging nearby structural variants of the same type before the data from individual samples are merged. Results We compared the two most used dissimilarity measures for SV clustering in terms of Mendelian inheritance errors (MIE), kinship prediction, and deviation from Hardy–Weinberg equilibrium. We analyzed the occurrence of Mendelian-inconsistent SV clusters that can be collapsed into one Mendelian-consistent SV as a new measure of dataset consistency. We also developed a new method based on constrained clustering that explicitly identifies these types of clusters. Conclusions We found that the dissimilarity measure based on the distance between SVs breakpoints produces slightly better results than the measure based on SVs overlap. This difference is evident in trivial and corrected clustering strategy, but not in constrained clustering strategy. However, constrained clustering strategy provided the best results in all aspects, regardless of the dissimilarity measure used.

Download Full-text

CDEC: a constrained deep embedded clustering

International Journal of Intelligent Computing and Cybernetics ◽

10.1108/ijicc-03-2021-0053 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Elham Amirizadeh ◽

Reza Boostani

Keyword(s):

Side Information ◽

Semisupervised Learning ◽

Small Subset ◽

Constrained Clustering ◽

Content Type ◽

Network Training ◽

Leibler Divergence ◽

Interesting Approach ◽

Class Labels ◽

Low Dimensional

PurposeThe aim of this study is to propose a deep neural network (DNN) method that uses side information to improve clustering results for big datasets; also, the authors show that applying this information improves the performance of clustering and also increase the speed of the network training convergence.Design/methodology/approachIn data mining, semisupervised learning is an interesting approach because good performance can be achieved with a small subset of labeled data; one reason is that the data labeling is expensive, and semisupervised learning does not need all labels. One type of semisupervised learning is constrained clustering; this type of learning does not use class labels for clustering. Instead, it uses information of some pairs of instances (side information), and these instances maybe are in the same cluster (must-link [ML]) or in different clusters (cannot-link [CL]). Constrained clustering was studied extensively; however, little works have focused on constrained clustering for big datasets. In this paper, the authors have presented a constrained clustering for big datasets, and the method uses a DNN. The authors inject the constraints (ML and CL) to this DNN to promote the clustering performance and call it constrained deep embedded clustering (CDEC). In this manner, an autoencoder was implemented to elicit informative low dimensional features in the latent space and then retrain the encoder network using a proposed Kullback–Leibler divergence objective function, which captures the constraints in order to cluster the projected samples. The proposed CDEC has been compared with the adversarial autoencoder, constrained 1-spectral clustering and autoencoder + k-means was applied to the known MNIST, Reuters-10k and USPS datasets, and their performance were assessed in terms of clustering accuracy. Empirical results confirmed the statistical superiority of CDEC in terms of clustering accuracy to the counterparts.FindingsFirst of all, this is the first DNN-constrained clustering that uses side information to improve the performance of clustering without using labels in big datasets with high dimension. Second, the author defined a formula to inject side information to the DNN. Third, the proposed method improves clustering performance and network convergence speed.Originality/valueLittle works have focused on constrained clustering for big datasets; also, the studies in DNNs for clustering, with specific loss function that simultaneously extract features and clustering the data, are rare. The method improves the performance of big data clustering without using labels, and it is important because the data labeling is expensive and time-consuming, especially for big datasets.

Download Full-text