scholarly journals Order preserving hierarchical agglomerative clustering

2021 ◽  
Author(s):  
Daniel Bakkelund

AbstractPartial orders and directed acyclic graphs are commonly recurring data structures that arise naturally in numerous domains and applications and are used to represent ordered relations between entities in the domains. Examples are task dependencies in a project plan, transaction order in distributed ledgers and execution sequences of tasks in computer programs, just to mention a few. We study the problem of order preserving hierarchical clustering of this kind of ordered data. That is, if we have $$a<b$$ a < b in the original data and denote their respective clusters by [a] and [b], then we shall have $$[a]<[b]$$ [ a ] < [ b ] in the produced clustering. The clustering is similarity based and uses standard linkage functions, such as single- and complete linkage, and is an extension of classical hierarchical clustering. To achieve this, we develop a novel theory that extends classical hierarchical clustering to strictly partially ordered sets. We define the output from running classical hierarchical clustering on strictly ordered data to be partial dendrograms; sub-trees of classical dendrograms with several connected components. We then construct an embedding of partial dendrograms over a set into the family of ultrametrics over the same set. An optimal hierarchical clustering is defined as the partial dendrogram corresponding to the ultrametric closest to the original dissimilarity measure, measured in the p-norm. Thus, the method is a combination of classical hierarchical clustering and ultrametric fitting. A reference implementation is employed for experiments on both synthetic random data and real world data from a database of machine parts. When compared to existing methods, the experiments show that our method excels both in cluster quality and order preservation.

2020 ◽  
Vol 29 ◽  
pp. 1999-2012
Author(s):  
Federico Bolelli ◽  
Stefano Allegretti ◽  
Lorenzo Baraldi ◽  
Costantino Grana

2010 ◽  
Vol 439-440 ◽  
pp. 1306-1311
Author(s):  
Fang Li ◽  
Qun Xiong Zhu

LSI based hierarchical agglomerative clustering algorithm is studied. Aiming to the problems of LSI based hierarchical agglomerative clustering method, NMF based hierarchical clustering method is proposed and analyzed. Two ways of implementing NMF based method are introduced. Finally the result of two groups of experiment based on the TanCorp document corpora show that the method proposed is effective.


2016 ◽  
pp. 359-374 ◽  
Author(s):  
Ritu Chauhan ◽  
Harleen Kaur

High dimensional databases are proving to be a major concern among the researches to extract relevant information for futuristic decision making. Real world data is high dimensional in nature and comprises of irrelevant features, missing values, and redundancy, which requires serious concerns. Utilizing all such features can mislead the results for emergent prediction. Therefore, such databases are critical in nature to determine optimal solutions. To deal with such issues, the authors have developed and implemented a Cluster Analysis Study Behavior of School Children from Large Databases (CABS) framework to retrieve effective and efficient clusters from high dimensional human behavior datasets for school children in US. They have applied feature selection technique and hierarchical agglomerative clustering technique to discover clusters of vivid shape and size to retrieve knowledge from large databases. This study was conducted for Health Behavior in School-Aged Children (HBSC) using Correlation-Based Feature Selection (CFS) technique to reduce the inconsistent data records and select relevant features that will eventually extract the appropriate data to merge similar data and retrieve clusters. However, predictive analytics can facilitate a more thorough extraction of knowledge to facilitate better quality and faster decisions. The authors have implemented the current framework in R language where the clustering was emphasized using pvclust package. The proposed framework is highly efficient in discovering hidden and implicit knowledge from large databases due to its accessibility to handling and discovering clusters of variant shapes.


Author(s):  
Ritu Chauhan ◽  
Harleen Kaur

High dimensional databases are proving to be a major concern among the researches to extract relevant information for futuristic decision making. Real world data is high dimensional in nature and comprises of irrelevant features, missing values, and redundancy, which requires serious concerns. Utilizing all such features can mislead the results for emergent prediction. Therefore, such databases are critical in nature to determine optimal solutions. To deal with such issues, the authors have developed and implemented a Cluster Analysis Study Behavior of School Children from Large Databases (CABS) framework to retrieve effective and efficient clusters from high dimensional human behavior datasets for school children in US. They have applied feature selection technique and hierarchical agglomerative clustering technique to discover clusters of vivid shape and size to retrieve knowledge from large databases. This study was conducted for Health Behavior in School-Aged Children (HBSC) using Correlation-Based Feature Selection (CFS) technique to reduce the inconsistent data records and select relevant features that will eventually extract the appropriate data to merge similar data and retrieve clusters. However, predictive analytics can facilitate a more thorough extraction of knowledge to facilitate better quality and faster decisions. The authors have implemented the current framework in R language where the clustering was emphasized using pvclust package. The proposed framework is highly efficient in discovering hidden and implicit knowledge from large databases due to its accessibility to handling and discovering clusters of variant shapes.


Electronics ◽  
2022 ◽  
Vol 11 (2) ◽  
pp. 267
Author(s):  
Félix Morales ◽  
Miguel García-Torres ◽  
Gustavo Velázquez ◽  
Federico Daumas-Ladouce ◽  
Pedro E. Gardel-Sotomayor ◽  
...  

Correctly defining and grouping electrical feeders is of great importance for electrical system operators. In this paper, we compare two different clustering techniques, K-means and hierarchical agglomerative clustering, applied to real data from the east region of Paraguay. The raw data were pre-processed, resulting in four data sets, namely, (i) a weekly feeder demand, (ii) a monthly feeder demand, (iii) a statistical feature set extracted from the original data and (iv) a seasonal and daily consumption feature set obtained considering the characteristics of the Paraguayan load curve. Considering the four data sets, two clustering algorithms, two distance metrics and five linkage criteria a total of 36 models with the Silhouette, Davies–Bouldin and Calinski–Harabasz index scores was assessed. The K-means algorithms with the seasonal feature data sets showed the best performance considering the Silhouette, Calinski–Harabasz and Davies–Bouldin validation index scores with a configuration of six clusters.


2018 ◽  
Vol 7 (1) ◽  
pp. 49-56
Author(s):  
Firdaus Firdaus

This paper presents a method to improve data integrity of individual-based bibliographic repository. Integrity improvement is done by comparing individual-based publication raw data with individual-based clustered publication data. Hierarchical Agglomerative Clustering is used to cluster the publication data with similar author names. Clustering is done by two steps of clustering. The first clustering is based on the co-author relationship and the second is by title similarity and year difference. The two-step hierarchical clustering technique for name disambiguation has been applied to Universitas Sriwijaya Publication Data Center with good accuracy.


2019 ◽  
Vol 8 (12) ◽  
pp. 1013-1025 ◽  
Author(s):  
Felicitas Kuehne ◽  
Beate Jahn ◽  
Annette Conrads-Frank ◽  
Marvin Bundo ◽  
Marjan Arvandi ◽  
...  

Aim: The aim of this project is to describe a causal (counterfactual) approach for analyzing when to start statin treatment to prevent cardiovascular disease using real-world evidence. Methods: We use directed acyclic graphs to operationalize and visualize the causal research question considering selection bias, potential time-independent and time-dependent confounding. We provide a study protocol following the ‘target trial’ approach and describe the data structure needed for the causal assessment. Conclusion: The study protocol can be applied to real-world data, in general. However, the structure and quality of the database play an essential role for the validity of the results, and database-specific potential for bias needs to be explicitly considered.


2021 ◽  
Vol 11 (23) ◽  
pp. 11122
Author(s):  
Thomas Märzinger ◽  
Jan Kotík ◽  
Christoph Pfeifer

This paper is the result of the first-phase, inter-disciplinary work of a multi-disciplinary research project (“Urban pop-up housing environments and their potential as local innovation systems”) consisting of energy engineers and waste managers, landscape architects and spatial planners, innovation researchers and technology assessors. The project is aiming at globally analyzing and describing existing pop-up housings (PUH), developing modeling and assessment tools for sustainable, energy-efficient and socially innovative temporary housing solutions (THS), especially for sustainable and resilient urban structures. The present paper presents an effective application of hierarchical agglomerative clustering (HAC) for analyses of large datasets typically derived from field studies. As can be shown, the method, although well-known and successfully established in (soft) computing science, can also be used very constructively as a potential urban planning tool. The main aim of the underlying multi-disciplinary research project was to deeply analyze and structure THS and PUE. Multiple aspects are to be considered when it comes to the characterization and classification of such environments. A thorough (global) web survey of PUH and analysis of scientific literature concerning descriptive work of PUH and THS has been performed. Moreover, out of several tested different approaches and methods for classifying PUH, hierarchical clustering algorithms functioned well when properly selected metrics and cut-off criteria were applied. To be specific, the ‘Minkowski’-metric and the ‘Calinski-Harabasz’-criteria, as clustering indices, have shown the best overall results in clustering the inhomogeneous data concerning PUH. Several additional algorithms/functions derived from the field of hierarchical clustering have also been tested to exploit their potential in interpreting and graphically analyzing particular structures and dependencies in the resulting clusters. Hereby, (math.) the significance ‘S’ and (math.) proportion ‘P’ have been concluded to yield the best interpretable and comprehensible results when it comes to analyzing the given set (objects n = 85) of researched PUH-objects together with their properties (n > 190). The resulting easily readable graphs clearly demonstrate the applicability and usability of hierarchical clustering- and their derivative algorithms for scientifically profound building classification tasks in Urban Planning by effectively managing huge inhomogeneous building datasets.


Collaborative filtering algorithm will be one among the assisting techniques delivering customized suggestions in the area of ecommerce. Nevertheless, conservative techniques concentrated in operating with client’s review and will not take into account of alteration of customer’s desires along with reliability of rankings associated. Huge quantity of increase in clients along with items resulted in certain critical complexities. Fresh Suggestion strategies will be required. Slope One algorithm might perform well with the motivation of minimized inadequacy of rankings, enhanced precision of suggestion. On the other hand increase in number of clients, resulted increased consumption duration. Establishment of solutions for complexities to extend adjacency space via utilization of clustering strategies will be carried out. Fundamental motivation of the paper relies with investigating feasible influence of utilizing trust measures in enhancing the quality of suggestions. This paper highlighted the significance of Trust in determining solutions for providing suggestions. Slope one algorithm incorporated with hierarchical agglomerative clustering technique performed superiorly while evaluated with trust metrics and solved the problem of huge amount of information associated with Trust aware information.


Sign in / Sign up

Export Citation Format

Share Document