A SEQUENCE-ELEMENT-BASED HIERARCHICAL CLUSTERING ALGORITHM FOR CATEGORICAL SEQUENCE DATA

Author(s):  
SEUNG-JOON OH ◽  
JAE-YEARN KIM

Recently, there has been enormous growth in the amount of commercial and scientific data, such as protein sequences, retail transactions, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, few existing clustering algorithms consider sequentiality. In this paper, we study how to cluster these sequence datasets. We propose a new similarity measure to compute the similarity between two sequences. In the proposed measure, subsets of a sequence are considered, and the more identical subsets there are, the more similar the two sequences. In addition, we propose a hierarchical clustering algorithm and an efficient method for measuring similarity. Using a splice dataset and synthetic datasets, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional clustering algorithms.

Author(s):  
Mohana Priya K ◽  
Pooja Ragavi S ◽  
Krishna Priya G

Clustering is the process of grouping objects into subsets that have meaning in the context of a particular problem. It does not rely on predefined classes. It is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. Many clustering algorithms have been proposed and are used based on different applications. Sentence clustering is one of best clustering technique. Hierarchical Clustering Algorithm is applied for multiple levels for accuracy. For tagging purpose POS tagger, porter stemmer is used. WordNet dictionary is utilized for determining the similarity by invoking the Jiang Conrath and Cosine similarity measure. Grouping is performed with respect to the highest similarity measure value with a mean threshold. This paper incorporates many parameters for finding similarity between words. In order to identify the disambiguated words, the sense identification is performed for the adjectives and comparison is performed. semcor and machine learning datasets are employed. On comparing with previous results for WSD, our work has improvised a lot which gives a percentage of 91.2%


Mathematics ◽  
2021 ◽  
Vol 9 (4) ◽  
pp. 370
Author(s):  
Shuangsheng Wu ◽  
Jie Lin ◽  
Zhenyu Zhang ◽  
Yushu Yang

The fuzzy clustering algorithm has become a research hotspot in many fields because of its better clustering effect and data expression ability. However, little research focuses on the clustering of hesitant fuzzy linguistic term sets (HFLTSs). To fill in the research gaps, we extend the data type of clustering to hesitant fuzzy linguistic information. A kind of hesitant fuzzy linguistic agglomerative hierarchical clustering algorithm is proposed. Furthermore, we propose a hesitant fuzzy linguistic Boole matrix clustering algorithm and compare the two clustering algorithms. The proposed clustering algorithms are applied in the field of judicial execution, which provides decision support for the executive judge to determine the focus of the investigation and the control. A clustering example verifies the clustering algorithm’s effectiveness in the context of hesitant fuzzy linguistic decision information.


2021 ◽  
Vol 8 (10) ◽  
pp. 43-50
Author(s):  
Truong et al. ◽  

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.


2010 ◽  
Vol 439-440 ◽  
pp. 481-485
Author(s):  
Li Xia Liu ◽  
Yi Qi Zhuang

Clustering techniques are often used in Web log mining to analyze user’s interest on the web pages. Based on the analysis of advantages and disadvantages of the application of classic clustering algorithm in Web log data mining, the paper brought out a kind of hierarchical K-means Web log clustering algorithm, which integrated K-means clustering algorithm and cohesion-based hierarchical clustering algorithm and overcame shortcoming of high time complexity of hierarchical clustering algorithm. The clustering effect of the algorithm is better than K-means clustering and fit for clustering process of large amount data. The result analysis of practical Web log data clustering also proves the validity of the algorithm.


2005 ◽  
Vol 02 (02) ◽  
pp. 167-180
Author(s):  
SEUNG-JOON OH ◽  
JAE-YEARN KIM

Clustering of sequences is relatively less explored but it is becoming increasingly important in data mining applications such as web usage mining and bioinformatics. The web user segmentation problem uses web access log files to partition a set of users into clusters such that users within one cluster are more similar to one another than to the users in other clusters. Similarly, grouping protein sequences that share a similar structure can help to identify sequences with similar functions. However, few clustering algorithms consider sequentiality. In this paper, we study how to cluster sequence datasets. Due to the high computational complexity of hierarchical clustering algorithms for clustering large datasets, a new clustering method is required. Therefore, we propose a new scalable clustering method using sampling and a k-nearest-neighbor method. Using a splice dataset and a synthetic dataset, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional algorithms.


2020 ◽  
pp. 1-11
Author(s):  
Yufeng Li ◽  
HaiTian Jiang ◽  
Jiyong Lu ◽  
Xiaozhong Li ◽  
Zhiwei Sun ◽  
...  

Many classical clustering algorithms have been fitted into MapReduce, which provides a novel solution for clustering big data. However, several iterations are required to reach an acceptable result in most of the algorithms. For each iteration, a new MapReduce job must be executed to load the dataset into main memory, which results in high I/O overhead and poor efficiency. BIRCH algorithm stores only the statistical information of objects with CF entries and CF tree to cluster big data, but with the increase of the tree nodes, the main memory will be insufficient to contain more objects. Hence, BIRCH has to reduce the tree, which will degrade the clustering quality and decelerate the whole execution efficiency. To deal with the problem, BIRCH was fitted into MapReduce called MR-BIRCH in this paper. In contrast to a great number of MapReduce-based algorithms, MR-BIRCH loads dataset only once, and the dataset is processed parallel in several machines. The complexity and scalability were analyzed to evaluate the quality of MR-BIRCH, and MR-BIRCH was compared with Python sklearn BIRCH and Apache Mahout k-means on real-world and synthetic datasets. Experimental results show, most of the time, MR-BIRCH was better or equal to sklearn BIRCH, and it was competitive to Mahout k-means.


2019 ◽  
Vol 2019 ◽  
pp. 1-10
Author(s):  
Yaohui Liu ◽  
Dong Liu ◽  
Fang Yu ◽  
Zhengming Ma

Clustering is widely used in data analysis, and density-based methods are developed rapidly in the recent 10 years. Although the state-of-art density peak clustering algorithms are efficient and can detect arbitrary shape clusters, they are nonsphere type of centroid-based methods essentially. In this paper, a novel local density hierarchical clustering algorithm based on reverse nearest neighbors, RNN-LDH, is proposed. By constructing and using a reverse nearest neighbor graph, the extended core regions are found out as initial clusters. Then, a new local density metric is defined to calculate the density of each object; meanwhile, the density hierarchical relationships among the objects are built according to their densities and neighbor relations. Finally, each unclustered object is classified to one of the initial clusters or noise. Results of experiments on synthetic and real data sets show that RNN-LDH outperforms the current clustering methods based on density peak or reverse nearest neighbors.


2019 ◽  
Vol 30 (08) ◽  
pp. 1950062
Author(s):  
Ping Guo ◽  
Wenjie Jiang ◽  
Yuchi Liu

Membrane computing, also known as P system, is a distributed and parallel computation framework models. Hierarchical clustering is one of the most basic and widely applied clustering algorithms among all clustering algorithms. In this paper, the combination of membrane computing and hierarchical clustering algorithm is studied. A cell-like hierarchical clustering P system with priority evolution rules and promoters is designed by using the maximum parallelism of membrane computing. The feasibility and effectiveness of the designed P system are verified by the examples.


Sign in / Sign up

Export Citation Format

Share Document