A SEQUENCE-ELEMENT-BASED HIERARCHICAL CLUSTERING ALGORITHM FOR CATEGORICAL SEQUENCE DATA
Recently, there has been enormous growth in the amount of commercial and scientific data, such as protein sequences, retail transactions, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, few existing clustering algorithms consider sequentiality. In this paper, we study how to cluster these sequence datasets. We propose a new similarity measure to compute the similarity between two sequences. In the proposed measure, subsets of a sequence are considered, and the more identical subsets there are, the more similar the two sequences. In addition, we propose a hierarchical clustering algorithm and an efficient method for measuring similarity. Using a splice dataset and synthetic datasets, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional clustering algorithms.