NBC: An Efficient Hierarchical Clustering Algorithm for Large Datasets

Nearest neighbor search is a key technique used in hierarchical clustering and its computing complexity decides the performance of the hierarchical clustering algorithm. The time complexity of standard agglomerative hierarchical clustering is O(n3), while the time complexity of more advanced hierarchical clustering algorithms (such as nearest neighbor chain, SLINK and CLINK) is O(n2). This paper presents a new nearest neighbor search method called nearest neighbor boundary (NNB), which first divides a large dataset into independent subset and then finds nearest neighbor of each point in subset. When NNB is used, the time complexity of hierarchical clustering can be reduced to O(n log 2n). Based on NNB, we propose a fast hierarchical clustering algorithm called nearest-neighbor boundary clustering (NBC), and the proposed algorithm can be adapted to the parallel and distributed computing framework. The experimental results demonstrate that our algorithm is practical for large datasets.

Download Full-text

NNB: An efficient nearest neighbor search method for hierarchical clustering on large datasets

Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015) ◽

10.1109/icosc.2015.7050840 ◽

2015 ◽

Cited By ~ 3

Author(s):

Wei Zhang ◽

Gongxuan Zhang ◽

Yongli Wang ◽

Zhaomeng Zhu ◽

Tao Li

Keyword(s):

Hierarchical Clustering ◽

Nearest Neighbor ◽

Large Datasets ◽

Search Method ◽

Nearest Neighbor Search ◽

Neighbor Search

Download Full-text

Symmetry Based Automatic Evolution of Clusters: A New Approach to Data Clustering

Computational Intelligence and Neuroscience ◽

10.1155/2015/796276 ◽

2015 ◽

Vol 2015 ◽

pp. 1-21 ◽

Cited By ~ 2

Author(s):

Singh Vijendra ◽

Sahoo Laxman

Keyword(s):

Clustering Algorithm ◽

Nearest Neighbor ◽

A Priori ◽

Clustering Algorithms ◽

Real Data ◽

Nearest Neighbor Search ◽

Objective Functions ◽

Neighbor Search ◽

Genetic Clustering ◽

Symmetric Points

We present a multiobjective genetic clustering approach, in which data points are assigned to clusters based on new line symmetry distance. The proposed algorithm is called multiobjective line symmetry based genetic clustering (MOLGC). Two objective functions, first the Davies-Bouldin (DB) index and second the line symmetry distance based objective functions, are used. The proposed algorithm evolves near-optimal clustering solutions using multiple clustering criteria, without a priori knowledge of the actual number of clusters. The multiple randomizedKdimensional (Kd) trees based nearest neighbor search is used to reduce the complexity of finding the closest symmetric points. Experimental results based on several artificial and real data sets show that proposed clustering algorithm can obtain optimal clustering solutions in terms of different cluster quality measures in comparison to existing SBKM and MOCK clustering algorithms.

Download Full-text

A Novel Local Density Hierarchical Clustering Algorithm Based on Reverse Nearest Neighbors

Mathematical Problems in Engineering ◽

10.1155/2019/2959017 ◽

2019 ◽

Vol 2019 ◽

pp. 1-10

Author(s):

Yaohui Liu ◽

Dong Liu ◽

Fang Yu ◽

Zhengming Ma

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Local Density ◽

Clustering Algorithms ◽

Real Data ◽

Nearest Neighbors ◽

Clustering Methods ◽

Density Peak ◽

Hierarchical Clustering Algorithm

Clustering is widely used in data analysis, and density-based methods are developed rapidly in the recent 10 years. Although the state-of-art density peak clustering algorithms are efficient and can detect arbitrary shape clusters, they are nonsphere type of centroid-based methods essentially. In this paper, a novel local density hierarchical clustering algorithm based on reverse nearest neighbors, RNN-LDH, is proposed. By constructing and using a reverse nearest neighbor graph, the extended core regions are found out as initial clusters. Then, a new local density metric is defined to calculate the density of each object; meanwhile, the density hierarchical relationships among the objects are built according to their densities and neighbor relations. Finally, each unclustered object is classified to one of the initial clusters or noise. Results of experiments on synthetic and real data sets show that RNN-LDH outperforms the current clustering methods based on density peak or reverse nearest neighbors.

Download Full-text

Handling WSD using Hierarchical Clustering Algorithm with sentences

International Journal of Scientific Research in Science Engineering and Technology ◽

10.32628/ijsrset1841120 ◽

2018 ◽

pp. 83-88

Author(s):

Mohana Priya K ◽

Pooja Ragavi S ◽

Krishna Priya G

Keyword(s):

Hierarchical Clustering ◽

Similarity Measure ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Cosine Similarity Measure ◽

Hierarchical Clustering Algorithm ◽

Multiple Levels ◽

Pos Tagger ◽

Sentence Clustering ◽

The Right

Clustering is the process of grouping objects into subsets that have meaning in the context of a particular problem. It does not rely on predefined classes. It is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. Many clustering algorithms have been proposed and are used based on different applications. Sentence clustering is one of best clustering technique. Hierarchical Clustering Algorithm is applied for multiple levels for accuracy. For tagging purpose POS tagger, porter stemmer is used. WordNet dictionary is utilized for determining the similarity by invoking the Jiang Conrath and Cosine similarity measure. Grouping is performed with respect to the highest similarity measure value with a mean threshold. This paper incorporates many parameters for finding similarity between words. In order to identify the disambiguated words, the sense identification is performed for the adjectives and comparison is performed. semcor and machine learning datasets are employed. On comparing with previous results for WSD, our work has improvised a lot which gives a percentage of 91.2%

Download Full-text

Hesitant Fuzzy Linguistic Agglomerative Hierarchical Clustering Algorithm and Its Application in Judicial Practice

Mathematics ◽

10.3390/math9040370 ◽

2021 ◽

Vol 9 (4) ◽

pp. 370

Author(s):

Shuangsheng Wu ◽

Jie Lin ◽

Zhenyu Zhang ◽

Yushu Yang

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Agglomerative Hierarchical Clustering ◽

Research Gaps ◽

Judicial Practice ◽

Linguistic Term ◽

Clustering Effect ◽

Hierarchical Clustering Algorithm ◽

Fuzzy Linguistic

The fuzzy clustering algorithm has become a research hotspot in many fields because of its better clustering effect and data expression ability. However, little research focuses on the clustering of hesitant fuzzy linguistic term sets (HFLTSs). To fill in the research gaps, we extend the data type of clustering to hesitant fuzzy linguistic information. A kind of hesitant fuzzy linguistic agglomerative hierarchical clustering algorithm is proposed. Furthermore, we propose a hesitant fuzzy linguistic Boole matrix clustering algorithm and compare the two clustering algorithms. The proposed clustering algorithms are applied in the field of judicial execution, which provides decision support for the executive judge to determine the focus of the investigation and the control. A clustering example verifies the clustering algorithm’s effectiveness in the context of hesitant fuzzy linguistic decision information.

Download Full-text

An improved OPTICS clustering algorithm for discovering clusters with uneven densities

Intelligent Data Analysis ◽

10.3233/ida-205497 ◽

2021 ◽

Vol 25 (6) ◽

pp. 1453-1471

Author(s):

Chunhua Tang ◽

Han Wang ◽

Zhiwen Wang ◽

Xiangkun Zeng ◽

Huaran Yan ◽

...

Keyword(s):

Time Complexity ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Substantial Improvement ◽

Experimental Results ◽

High Time ◽

Parameter Setting ◽

K Nearest Neighbor ◽

Density Based Clustering

Most density-based clustering algorithms have the problems of difficult parameter setting, high time complexity, poor noise recognition, and weak clustering for datasets with uneven density. To solve these problems, this paper proposes FOP-OPTICS algorithm (Finding of the Ordering Peaks Based on OPTICS), which is a substantial improvement of OPTICS (Ordering Points To Identify the Clustering Structure). The proposed algorithm finds the demarcation point (DP) from the Augmented Cluster-Ordering generated by OPTICS and uses the reachability-distance of DP as the radius of neighborhood eps of its corresponding cluster. It overcomes the weakness of most algorithms in clustering datasets with uneven densities. By computing the distance of the k-nearest neighbor of each point, it reduces the time complexity of OPTICS; by calculating density-mutation points within the clusters, it can efficiently recognize noise. The experimental results show that FOP-OPTICS has the lowest time complexity, and outperforms other algorithms in parameter setting and noise recognition.

Download Full-text

Complexity of Error-Correcting Code Based on Nearest Neighbor Search Algorithm

Mathematical Problems of Computer Science ◽

10.51408/1963-0030 ◽

2019 ◽

pp. 7-20

Author(s):

Levon Arsalanyan ◽

Hayk Danoyan

Keyword(s):

Time Complexity ◽

Nearest Neighbor ◽

Search Algorithm ◽

Error Correcting Code ◽

Error Correcting Codes ◽

Code Word ◽

Nearest Neighbor Search ◽

Perfect Codes ◽

Neighbor Search ◽

Weight Structures

The Nearest Neighbor search algorithm considered in this paper is well known (Elias algorithm). It uses error-correcting codes and constructs appropriate hash-coding schemas. These schemas preprocess the data in the form of lists. Each list is contained in some sphere, centered at a code-word. The algorithm is considered for the cases of perfect codes, so the spheres and, consequently, the lists do not intersect. As such codes exist for the limited set of parameters, the algorithm is considered for some other generalizations of perfect codes, and then the same data point may be contained in different lists. A formula of time complexity of the algorithm is obtained for these cases, using coset weight structures of the mentioned codes

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

A Multi-Relational Hierarchical Clustering Algorithm Based on Shared Nearest Neighbor Similarity

2007 International Conference on Machine Learning and Cybernetics ◽

10.1109/icmlc.2007.4370836 ◽

2007 ◽

Cited By ~ 1

Author(s):

Jing-Feng Guo ◽

Yu-Yan Zhao ◽

Jing Li

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Hierarchical Clustering Algorithm ◽

Shared Nearest Neighbor

Download Full-text

Fast document summarization using locality sensitive hashing and memory access efficient node ranking

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i3.9030 ◽

2016 ◽

Vol 6 (3) ◽

pp. 945

Author(s):

Ercan Canhasi

Keyword(s):

Time Complexity ◽

Nearest Neighbor ◽

Linear Time ◽

Nearest Neighbor Search ◽

Memory Access ◽

Locality Sensitive Hashing ◽

Document Summarization ◽

Neighbor Search ◽

Node Ranking ◽

Similarity Graph

Text modeling and sentence selection are the fundamental steps of a typical extractive document summarization algorithm. The common text modeling method connects a pair of sentences based on their similarities. Even thought it can effectively represent the sentence similarity graph of given document(s) its big drawback is a large time complexity of $O(n^2)$, where n represents the number of sentences. The quadratic time complexity makes it impractical for large documents. In this paper we propose the fast approximation algorithms for the text modeling and the sentence selection. Our text modeling algorithm reduces the time complexity to near-linear time by rapidly finding the most similar sentences to form the sentences similarity graph. In doing so we utilized Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search. For the sentence selection step we propose a simple memory-access-efficient node ranking method based on the idea of scanning sequentially only the neighborhood arrays. Experimentally, we show that sacrificing a rather small percentage of recall and precision in the quality of the produced summary can reduce the quadratic to sub-linear time complexity. We see the big potential of proposed method in text summarization for mobile devices and big text data summarization for internet of things on cloud. In our experiments, beside evaluating the presented method on the standard general and query multi-document summarization tasks, we also tested it on few alternative summarization tasks including general and query, timeline, and comparative summarization.

Download Full-text