Time Sequence Clustering Based on Edit Distance

A time sequence clustering algorithm based on edit distance is proposed in the paper, which solves the problem that the existing clustering algorithms for time sequence data is inefficient because of ignorance of different time span of time sequence data. Firstly, the algorithm calculates the distance between time sequences on which a distance matrix is determined. In the second place, for a given time sequence set, a forest with n binary trees is established in terms of the distance matrix and then merge the trees. Finally, a cluster clustering algorithm is called to dynamically adjust the clustering results, and then real-time clustering structure is obtained. Experimental results demonstrated that the algorithm has higher efficiency and clustering quality.

Download Full-text

Metagenome sequence clustering with hash-based canopies

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720017400066 ◽

2017 ◽

Vol 15 (06) ◽

pp. 1740006 ◽

Cited By ~ 6

Author(s):

Mohammad Arifur Rahman ◽

Nathan LaPierre ◽

Huzefa Rangwala ◽

Daniel Barbara

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

State Of The Art ◽

Sequence Data ◽

Clustering Algorithms ◽

Clustering Methods ◽

Operational Taxonomic Units ◽

Sequence Clustering ◽

Scalable Clustering ◽

Metagenome Sequence

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a

Download Full-text

Comparison of dimensionality reduction and clustering methods for SARS-CoV-2 genome

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i4.2803 ◽

2021 ◽

Vol 10 (4) ◽

pp. 2170-2180

Author(s):

Untari N. Wisesty ◽

Tati Rajab Mengko

Keyword(s):

Dimensionality Reduction ◽

Dimensional Reduction ◽

Clustering Algorithm ◽

Sequence Data ◽

Clustering Algorithms ◽

Gaussian Mixture Models ◽

Reduction Process ◽

Principal Component ◽

Gaussian Mixture ◽

Clustering Methods

This paper aims to conduct an analysis of the SARS-CoV-2 genome variation was carried out by comparing the results of genome clustering using several clustering algorithms and distribution of sequence in each cluster. The clustering algorithms used are K-means, Gaussian mixture models, agglomerative hierarchical clustering, mean-shift clustering, and DBSCAN. However, the clustering algorithm has a weakness in grouping data that has very high dimensions such as genome data, so that a dimensional reduction process is needed. In this research, dimensionality reduction was carried out using principal component analysis (PCA) and autoencoder method with three models that produce 2, 10, and 50 features. The main contributions achieved were the dimensional reduction and clustering scheme of SARS-CoV-2 sequence data and the performance analysis of each experiment on each scheme and hyper parameters for each method. Based on the results of experiments conducted, PCA and DBSCAN algorithm achieve the highest silhouette score of 0.8770 with three clusters when using two features. However, dimensionality reduction using autoencoder need more iterations to converge. On the testing process with Indonesian sequence data, more than half of them enter one cluster and the rest are distributed in the other two clusters.

Download Full-text

EdClust: A heuristic sequence clustering method with higher sensitivity

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021500360 ◽

2021 ◽

Author(s):

Ming Cao ◽

Qinke Peng ◽

Ze-Gang Wei ◽

Fei Liu ◽

Yi-Fan Hou

Keyword(s):

Large Scale ◽

Sequence Data ◽

Clustering Algorithms ◽

Clustering Methods ◽

Sequencing Data ◽

Clustering Method ◽

Cluster Number ◽

Sequence Clustering ◽

Downstream Analysis ◽

Heuristic Clustering

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.

Download Full-text

Study on the Time Sequence Data Stream Clustering Algorithm Based on Property Information Contribution

International Journal of Advancements in Computing Technology ◽

10.4156/ijact.vol5.issue2.20 ◽

2013 ◽

Vol 5 (2) ◽

pp. 148-154

Author(s):

ZHANG Hongqi ◽

WANG Chunguang

Keyword(s):

Data Stream ◽

Clustering Algorithm ◽

Sequence Data ◽

Time Sequence ◽

Stream Clustering ◽

Data Stream Clustering

Download Full-text

On Expanded and Improved Affinity Propagation Clustering Algorithm

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.48-49.753 ◽

2011 ◽

Vol 48-49 ◽

pp. 753-756

Author(s):

Xin Quan Chen

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Grid Cell ◽

Space Complexity ◽

Affinity Propagation ◽

Data Sets ◽

Time And Space ◽

Affinity Propagation Clustering ◽

Clustering Quality ◽

Time And Space Complexity

Facing to the shortcoming of Affinity Propagation algorithm (AP), we present two expanded and improved AP algorithms. In the two algorithms, the AP algorithm based on Grid Cell (APGC) is an effective extension of AP algorithm on the level of grid cells, and the AP clustering algorithm based on Near neighbour Sampling (APNS) is trying to make some improving in time and space complexity. From some simulated comparison experiments of three algorithms, we know that APGC and APNS algorithms have evident improving than AP algorithm in time and space complexity. They can not only get a good clustering quality for massive data sets, but also filtrate noises and isolates well. So we can say they are two effective clustering algorithms with much applied prospect. At last, several research directions are presented.

Download Full-text

MR-BIRCH: A scalable MapReduce-based birch clustering algorithm

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202079 ◽

2020 ◽

pp. 1-11

Author(s):

Yufeng Li ◽

HaiTian Jiang ◽

Jiyong Lu ◽

Xiaozhong Li ◽

Zhiwei Sun ◽

...

Keyword(s):

Big Data ◽

Real World ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Statistical Information ◽

Main Memory ◽

Acceptable Result ◽

Clustering Quality ◽

Synthetic Datasets

Many classical clustering algorithms have been fitted into MapReduce, which provides a novel solution for clustering big data. However, several iterations are required to reach an acceptable result in most of the algorithms. For each iteration, a new MapReduce job must be executed to load the dataset into main memory, which results in high I/O overhead and poor efficiency. BIRCH algorithm stores only the statistical information of objects with CF entries and CF tree to cluster big data, but with the increase of the tree nodes, the main memory will be insufficient to contain more objects. Hence, BIRCH has to reduce the tree, which will degrade the clustering quality and decelerate the whole execution efficiency. To deal with the problem, BIRCH was fitted into MapReduce called MR-BIRCH in this paper. In contrast to a great number of MapReduce-based algorithms, MR-BIRCH loads dataset only once, and the dataset is processed parallel in several machines. The complexity and scalability were analyzed to evaluate the quality of MR-BIRCH, and MR-BIRCH was compared with Python sklearn BIRCH and Apache Mahout k-means on real-world and synthetic datasets. Experimental results show, most of the time, MR-BIRCH was better or equal to sklearn BIRCH, and it was competitive to Mahout k-means.

Download Full-text

An improved ant algorithm with LDA-based representation for text document clustering

Journal of Information Science ◽

10.1177/0165551516638784 ◽

2016 ◽

Vol 43 (2) ◽

pp. 275-292 ◽

Cited By ~ 24

Author(s):

Aytug Onan ◽

Hasan Bulut ◽

Serdar Korukoglu

Keyword(s):

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Clustering Algorithms ◽

Document Clustering ◽

Clustering Methods ◽

Initial Value ◽

Text Document ◽

Clustering Quality ◽

Text Features

Document clustering can be applied in document organisation and browsing, document summarisation and classification. The identification of an appropriate representation for textual documents is extremely important for the performance of clustering or classification algorithms. Textual documents suffer from the high dimensionality and irrelevancy of text features. Besides, conventional clustering algorithms suffer from several shortcomings, such as slow convergence and sensitivity to the initial value. To tackle the problems of conventional clustering algorithms, metaheuristic algorithms are frequently applied to clustering. In this paper, an improved ant clustering algorithm is presented, where two novel heuristic methods are proposed to enhance the clustering quality of ant-based clustering. In addition, the latent Dirichlet allocation (LDA) is used to represent textual documents in a compact and efficient way. The clustering quality of the proposed ant clustering algorithm is compared to the conventional clustering algorithms using 25 text benchmarks in terms of F-measure values. The experimental results indicate that the proposed clustering scheme outperforms the compared conventional and metaheuristic clustering methods for textual documents.

Download Full-text

Research on Improved Clustering Algorithm on Web Usage Mining Based on Scientific Analysis of Web Materials

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.63-64.863 ◽

2011 ◽

Vol 63-64 ◽

pp. 863-867 ◽

Cited By ~ 1

Author(s):

Bin Li ◽

Jin Yang ◽

Cai Ming Liu ◽

Jian Dong Zhang ◽

Yan Zhang

Keyword(s):

Clustering Algorithm ◽

Hamming Distance ◽

Clustering Algorithms ◽

Distance Matrix ◽

Threshold Value ◽

Web Usage Mining ◽

Web Usage ◽

User Clustering ◽

Similar Index ◽

Browsing Behavior

Clustering analysis is an important method to research the Web user’s browsing behavior and identify the potential customers on Web usage mining. The traditional user clustering algorithms are not quite accurate. In this paper, we give two improved user clustering algorithms, which are based on the associated matrix of the user’s hits in the process of browsing website. To this matrix, an improved Hamming distance matrix is generated by defining the minimum norm or the generalized relative Hamming distance between any two vectors. Then, similar user clustering are obtained by setting the threshold value. At the last step of our algorithm, the clustering results are confirmed by defining the clustering’s Similar Index and setting sub-algorithm. Finally, the testing examples show that the new algorithms are more accurate than the old one, and the real log data presents that the improved algorithms are practical.

Download Full-text

Genetically-Modified K-Medoid Clustering Algorithm for Heterogeneous Data Set

Handbook of Research on Applications and Implementations of Machine Learning Techniques - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-5225-9902-9.ch004 ◽

2020 ◽

pp. 63-76

Author(s):

Dhayanithi Jaganathan ◽

Akilandeswari Jeyapal

Keyword(s):

Clustering Algorithm ◽

Genetically Modified ◽

Clustering Algorithms ◽

Distance Matrix ◽

Heterogeneous Data ◽

Distance Measures ◽

Experimental Result ◽

Data Set ◽

Individual Distance ◽

Modified Algorithm

In recent days, researchers are doing research studies for clustering of data which are heterogeneous in nature. The data generated in many real-world applications like data form IoT environments and big data domains are heterogeneous in nature. Most of the available clustering algorithms deal with data in homogeneous nature, and there are few algorithms discussed in the literature to deal the data with numeric and categorical nature. Applying the clustering algorithm used by homogenous data to the heterogeneous data leads to information loss. This chapter proposes a new genetically-modified k-medoid clustering algorithm (GMODKMD) which takes fused distance matrix as input that adopts from applying individual distance measures for each attribute based on its characteristics. The GMODKMD is a modified algorithm where Davies Boudlin index is applied in the iteration phase. The proposed algorithm is compared with existing techniques based on accuracy. The experimental result shows that the modified algorithm with fused distance matrix outperforms the existing clustering technique.

Download Full-text

A SEQUENCE-ELEMENT-BASED HIERARCHICAL CLUSTERING ALGORITHM FOR CATEGORICAL SEQUENCE DATA

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622005001398 ◽

2005 ◽

Vol 04 (01) ◽

pp. 81-96 ◽

Cited By ~ 5

Author(s):

SEUNG-JOON OH ◽

JAE-YEARN KIM

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithm ◽

Sequence Data ◽

Clustering Algorithms ◽

Scientific Data ◽

Sequence Element ◽

Hierarchical Clustering Algorithm ◽

Synthetic Datasets ◽

Better Than

Recently, there has been enormous growth in the amount of commercial and scientific data, such as protein sequences, retail transactions, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, few existing clustering algorithms consider sequentiality. In this paper, we study how to cluster these sequence datasets. We propose a new similarity measure to compute the similarity between two sequences. In the proposed measure, subsets of a sequence are considered, and the more identical subsets there are, the more similar the two sequences. In addition, we propose a hierarchical clustering algorithm and an efficient method for measuring similarity. Using a splice dataset and synthetic datasets, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional clustering algorithms.

Download Full-text