A SEQUENCE-ELEMENT-BASED HIERARCHICAL CLUSTERING ALGORITHM FOR CATEGORICAL SEQUENCE DATA

Recently, there has been enormous growth in the amount of commercial and scientific data, such as protein sequences, retail transactions, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, few existing clustering algorithms consider sequentiality. In this paper, we study how to cluster these sequence datasets. We propose a new similarity measure to compute the similarity between two sequences. In the proposed measure, subsets of a sequence are considered, and the more identical subsets there are, the more similar the two sequences. In addition, we propose a hierarchical clustering algorithm and an efficient method for measuring similarity. Using a splice dataset and synthetic datasets, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional clustering algorithms.

Download Full-text

Handling WSD using Hierarchical Clustering Algorithm with sentences

International Journal of Scientific Research in Science Engineering and Technology ◽

10.32628/ijsrset1841120 ◽

2018 ◽

pp. 83-88

Author(s):

Mohana Priya K ◽

Pooja Ragavi S ◽

Krishna Priya G

Keyword(s):

Hierarchical Clustering ◽

Similarity Measure ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Cosine Similarity Measure ◽

Hierarchical Clustering Algorithm ◽

Multiple Levels ◽

Pos Tagger ◽

Sentence Clustering ◽

The Right

Clustering is the process of grouping objects into subsets that have meaning in the context of a particular problem. It does not rely on predefined classes. It is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. Many clustering algorithms have been proposed and are used based on different applications. Sentence clustering is one of best clustering technique. Hierarchical Clustering Algorithm is applied for multiple levels for accuracy. For tagging purpose POS tagger, porter stemmer is used. WordNet dictionary is utilized for determining the similarity by invoking the Jiang Conrath and Cosine similarity measure. Grouping is performed with respect to the highest similarity measure value with a mean threshold. This paper incorporates many parameters for finding similarity between words. In order to identify the disambiguated words, the sense identification is performed for the adjectives and comparison is performed. semcor and machine learning datasets are employed. On comparing with previous results for WSD, our work has improvised a lot which gives a percentage of 91.2%

Download Full-text

Hesitant Fuzzy Linguistic Agglomerative Hierarchical Clustering Algorithm and Its Application in Judicial Practice

Mathematics ◽

10.3390/math9040370 ◽

2021 ◽

Vol 9 (4) ◽

pp. 370

Author(s):

Shuangsheng Wu ◽

Jie Lin ◽

Zhenyu Zhang ◽

Yushu Yang

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Agglomerative Hierarchical Clustering ◽

Research Gaps ◽

Judicial Practice ◽

Linguistic Term ◽

Clustering Effect ◽

Hierarchical Clustering Algorithm ◽

Fuzzy Linguistic

The fuzzy clustering algorithm has become a research hotspot in many fields because of its better clustering effect and data expression ability. However, little research focuses on the clustering of hesitant fuzzy linguistic term sets (HFLTSs). To fill in the research gaps, we extend the data type of clustering to hesitant fuzzy linguistic information. A kind of hesitant fuzzy linguistic agglomerative hierarchical clustering algorithm is proposed. Furthermore, we propose a hesitant fuzzy linguistic Boole matrix clustering algorithm and compare the two clustering algorithms. The proposed clustering algorithms are applied in the field of judicial execution, which provides decision support for the executive judge to determine the focus of the investigation and the control. A clustering example verifies the clustering algorithm’s effectiveness in the context of hesitant fuzzy linguistic decision information.

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

A Kind of Hierarchical K-Means Web Log Clustering Algorithm

Key Engineering Materials ◽

10.4028/www.scientific.net/kem.439-440.481 ◽

2010 ◽

Vol 439-440 ◽

pp. 481-485

Author(s):

Li Xia Liu ◽

Yi Qi Zhuang

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithm ◽

Web Pages ◽

Web Log Mining ◽

Log Data ◽

Web Log ◽

Advantages And Disadvantages ◽

Hierarchical Clustering Algorithm ◽

Result Analysis ◽

Better Than

Clustering techniques are often used in Web log mining to analyze user’s interest on the web pages. Based on the analysis of advantages and disadvantages of the application of classic clustering algorithm in Web log data mining, the paper brought out a kind of hierarchical K-means Web log clustering algorithm, which integrated K-means clustering algorithm and cohesion-based hierarchical clustering algorithm and overcame shortcoming of high time complexity of hierarchical clustering algorithm. The clustering effect of the algorithm is better than K-means clustering and fit for clustering process of large amount data. The result analysis of practical Web log data clustering also proves the validity of the algorithm.

Download Full-text

A SCALABLE CLUSTERING METHOD FOR CATEGORICAL SEQUENCE DATA

International Journal of Computational Methods ◽

10.1142/s0219876205000417 ◽

2005 ◽

Vol 02 (02) ◽

pp. 167-180

Author(s):

SEUNG-JOON OH ◽

JAE-YEARN KIM

Keyword(s):

Nearest Neighbor ◽

Sequence Data ◽

Clustering Algorithms ◽

K Nearest Neighbor ◽

Clustering Method ◽

Scalable Clustering ◽

Log Files ◽

Web Access ◽

Better Than

Clustering of sequences is relatively less explored but it is becoming increasingly important in data mining applications such as web usage mining and bioinformatics. The web user segmentation problem uses web access log files to partition a set of users into clusters such that users within one cluster are more similar to one another than to the users in other clusters. Similarly, grouping protein sequences that share a similar structure can help to identify sequences with similar functions. However, few clustering algorithms consider sequentiality. In this paper, we study how to cluster sequence datasets. Due to the high computational complexity of hierarchical clustering algorithms for clustering large datasets, a new clustering method is required. Therefore, we propose a new scalable clustering method using sampling and a k-nearest-neighbor method. Using a splice dataset and a synthetic dataset, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional algorithms.

Download Full-text

MR-BIRCH: A scalable MapReduce-based birch clustering algorithm

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202079 ◽

2020 ◽

pp. 1-11

Author(s):

Yufeng Li ◽

HaiTian Jiang ◽

Jiyong Lu ◽

Xiaozhong Li ◽

Zhiwei Sun ◽

...

Keyword(s):

Big Data ◽

Real World ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Statistical Information ◽

Main Memory ◽

Acceptable Result ◽

Clustering Quality ◽

Synthetic Datasets

Many classical clustering algorithms have been fitted into MapReduce, which provides a novel solution for clustering big data. However, several iterations are required to reach an acceptable result in most of the algorithms. For each iteration, a new MapReduce job must be executed to load the dataset into main memory, which results in high I/O overhead and poor efficiency. BIRCH algorithm stores only the statistical information of objects with CF entries and CF tree to cluster big data, but with the increase of the tree nodes, the main memory will be insufficient to contain more objects. Hence, BIRCH has to reduce the tree, which will degrade the clustering quality and decelerate the whole execution efficiency. To deal with the problem, BIRCH was fitted into MapReduce called MR-BIRCH in this paper. In contrast to a great number of MapReduce-based algorithms, MR-BIRCH loads dataset only once, and the dataset is processed parallel in several machines. The complexity and scalability were analyzed to evaluate the quality of MR-BIRCH, and MR-BIRCH was compared with Python sklearn BIRCH and Apache Mahout k-means on real-world and synthetic datasets. Experimental results show, most of the time, MR-BIRCH was better or equal to sklearn BIRCH, and it was competitive to Mahout k-means.

Download Full-text

A fast hierarchical clustering algorithm for large-scale protein sequence data sets

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2014.02.016 ◽

2014 ◽

Vol 48 ◽

pp. 94-101 ◽

Cited By ~ 10

Author(s):

Sándor M. Szilágyi ◽

László Szilágyi

Keyword(s):

Hierarchical Clustering ◽

Protein Sequence ◽

Large Scale ◽

Clustering Algorithm ◽

Sequence Data ◽

Data Sets ◽

Protein Sequence Data ◽

Hierarchical Clustering Algorithm

Download Full-text

A hierarchical clustering algorithm for categorical sequence data

Information Processing Letters ◽

10.1016/j.ipl.2004.04.002 ◽

2004 ◽

Vol 91 (3) ◽

pp. 135-140 ◽

Cited By ~ 15

Author(s):

Seung-Joon Oh ◽

Jae-Yearn Kim

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithm ◽

Sequence Data ◽

Hierarchical Clustering Algorithm

Download Full-text

A Novel Local Density Hierarchical Clustering Algorithm Based on Reverse Nearest Neighbors

Mathematical Problems in Engineering ◽

10.1155/2019/2959017 ◽

2019 ◽

Vol 2019 ◽

pp. 1-10

Author(s):

Yaohui Liu ◽

Dong Liu ◽

Fang Yu ◽

Zhengming Ma

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Local Density ◽

Clustering Algorithms ◽

Real Data ◽

Nearest Neighbors ◽

Clustering Methods ◽

Density Peak ◽

Hierarchical Clustering Algorithm

Clustering is widely used in data analysis, and density-based methods are developed rapidly in the recent 10 years. Although the state-of-art density peak clustering algorithms are efficient and can detect arbitrary shape clusters, they are nonsphere type of centroid-based methods essentially. In this paper, a novel local density hierarchical clustering algorithm based on reverse nearest neighbors, RNN-LDH, is proposed. By constructing and using a reverse nearest neighbor graph, the extended core regions are found out as initial clusters. Then, a new local density metric is defined to calculate the density of each object; meanwhile, the density hierarchical relationships among the objects are built according to their densities and neighbor relations. Finally, each unclustered object is classified to one of the initial clusters or noise. Results of experiments on synthetic and real data sets show that RNN-LDH outperforms the current clustering methods based on density peak or reverse nearest neighbors.

Download Full-text

A P system for hierarchical clustering

International Journal of Modern Physics C ◽

10.1142/s0129183119500621 ◽

2019 ◽

Vol 30 (08) ◽

pp. 1950062

Author(s):

Ping Guo ◽

Wenjie Jiang ◽

Yuchi Liu

Keyword(s):

Parallel Computation ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

Membrane Computing ◽

Clustering Algorithms ◽

P System ◽

A Cell ◽

Hierarchical Clustering Algorithm

Membrane computing, also known as P system, is a distributed and parallel computation framework models. Hierarchical clustering is one of the most basic and widely applied clustering algorithms among all clustering algorithms. In this paper, the combination of membrane computing and hierarchical clustering algorithm is studied. A cell-like hierarchical clustering P system with priority evolution rules and promoters is designed by using the maximum parallelism of membrane computing. The feasibility and effectiveness of the designed P system are verified by the examples.

Download Full-text