Metagenome sequence clustering with hash-based canopies

2017 ◽  
Vol 15 (06) ◽  
pp. 1740006 ◽  
Author(s):  
Mohammad Arifur Rahman ◽  
Nathan LaPierre ◽  
Huzefa Rangwala ◽  
Daniel Barbara

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a

Author(s):  
Ming Cao ◽  
Qinke Peng ◽  
Ze-Gang Wei ◽  
Fei Liu ◽  
Yi-Fan Hou

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.


2021 ◽  
Vol 10 (4) ◽  
pp. 2170-2180
Author(s):  
Untari N. Wisesty ◽  
Tati Rajab Mengko

This paper aims to conduct an analysis of the SARS-CoV-2 genome variation was carried out by comparing the results of genome clustering using several clustering algorithms and distribution of sequence in each cluster. The clustering algorithms used are K-means, Gaussian mixture models, agglomerative hierarchical clustering, mean-shift clustering, and DBSCAN. However, the clustering algorithm has a weakness in grouping data that has very high dimensions such as genome data, so that a dimensional reduction process is needed. In this research, dimensionality reduction was carried out using principal component analysis (PCA) and autoencoder method with three models that produce 2, 10, and 50 features. The main contributions achieved were the dimensional reduction and clustering scheme of SARS-CoV-2 sequence data and the performance analysis of each experiment on each scheme and hyper parameters for each method. Based on the results of experiments conducted, PCA and DBSCAN algorithm achieve the highest silhouette score of 0.8770 with three clusters when using two features. However, dimensionality reduction using autoencoder need more iterations to converge. On the testing process with Indonesian sequence data, more than half of them enter one cluster and the rest are distributed in the other two clusters.


2021 ◽  
Author(s):  
Manuel Fritz ◽  
Michael Behringer ◽  
Dennis Tschechlov ◽  
Holger Schwarz

AbstractClustering is a fundamental primitive in manifold applications. In order to achieve valuable results in exploratory clustering analyses, parameters of the clustering algorithm have to be set appropriately, which is a tremendous pitfall. We observe multiple challenges for large-scale exploration processes. On the one hand, they require specific methods to efficiently explore large parameter search spaces. On the other hand, they often exhibit large runtimes, in particular when large datasets are analyzed using clustering algorithms with super-polynomial runtimes, which repeatedly need to be executed within exploratory clustering analyses. We address these challenges as follows: First, we present LOG-Means and show that it provides estimates for the number of clusters in sublinear time regarding the defined search space, i.e., provably requiring less executions of a clustering algorithm than existing methods. Second, we demonstrate how to exploit fundamental characteristics of exploratory clustering analyses in order to significantly accelerate the (repetitive) execution of clustering algorithms on large datasets. Third, we show how these challenges can be tackled at the same time. To the best of our knowledge, this is the first work which simultaneously addresses the above-mentioned challenges. In our comprehensive evaluation, we unveil that our proposed methods significantly outperform state-of-the-art methods, thus especially supporting novice analysts for exploratory clustering analyses in large-scale exploration processes.


mSystems ◽  
2016 ◽  
Vol 1 (1) ◽  
Author(s):  
Evguenia Kopylova ◽  
Jose A. Navas-Molina ◽  
Céline Mercier ◽  
Zhenjiang Zech Xu ◽  
Frédéric Mahé ◽  
...  

ABSTRACT Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1 ). Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1 ).


2013 ◽  
Vol 401-403 ◽  
pp. 1428-1431 ◽  
Author(s):  
Hai Yan Zhou ◽  
Jing Song Shan

A time sequence clustering algorithm based on edit distance is proposed in the paper, which solves the problem that the existing clustering algorithms for time sequence data is inefficient because of ignorance of different time span of time sequence data. Firstly, the algorithm calculates the distance between time sequences on which a distance matrix is determined. In the second place, for a given time sequence set, a forest with n binary trees is established in terms of the distance matrix and then merge the trees. Finally, a cluster clustering algorithm is called to dynamically adjust the clustering results, and then real-time clustering structure is obtained. Experimental results demonstrated that the algorithm has higher efficiency and clustering quality.


2020 ◽  
Vol 34 (04) ◽  
pp. 4412-4419 ◽  
Author(s):  
Zhao Kang ◽  
Wangtao Zhou ◽  
Zhitong Zhao ◽  
Junming Shao ◽  
Meng Han ◽  
...  

A plethora of multi-view subspace clustering (MVSC) methods have been proposed over the past few years. Researchers manage to boost clustering accuracy from different points of view. However, many state-of-the-art MVSC algorithms, typically have a quadratic or even cubic complexity, are inefficient and inherently difficult to apply at large scales. In the era of big data, the computational issue becomes critical. To fill this gap, we propose a large-scale MVSC (LMVSC) algorithm with linear order complexity. Inspired by the idea of anchor graph, we first learn a smaller graph for each view. Then, a novel approach is designed to integrate those graphs so that we can implement spectral clustering on a smaller graph. Interestingly, it turns out that our model also applies to single-view scenario. Extensive experiments on various large-scale benchmark data sets validate the effectiveness and efficiency of our approach with respect to state-of-the-art clustering methods.


2015 ◽  
Vol 11 (2) ◽  
pp. 23-39 ◽  
Author(s):  
B. Senthilnayaki ◽  
K. Venkatalakshmi ◽  
A. Kannan

E-Learning is a fast, just-in-time, and non-linear learning process, which is now widely applied in distributed and dynamic environments such as the World Wide Web. Ontology plays an important role in capturing and disseminating the real world knowledge for effective human computer interactions. However, engineering of domain ontologies is very labor intensive and time consuming. Some machine learning methods have been explored for automatic or semi-automatic discovery of domain ontologies. Nevertheless, both the accuracy and the computational efficiency of these methods need to be improved. While constructing large scale ontology for real-world applications such as e-learning, the ability to monitor the progress of students' learning performance is a critical issue. In this paper, a system is proposed for analyzing students' knowledge level obtained using Kolb's classification based on the students level of understanding and their learning style using cluster analysis. This system uses fuzzy logic and clustering algorithms to arrange their documents according to the level of their performance. Moreover, a new domain ontology discovery method is proposed uses contextual information of the knowledge sources from the e-Learning domain. This proposed system constructs ontology to provide an effective assistance in e-Learning. The proposed ontology discovery method has been empirically tested in an e-Learning environment for teaching the subject Database Management Systems. The salient contributions of this paper are the use of Jaccard Similarity measure and K-Means clustering algorithm for clustering of learners and the use of ontology for concept understanding and learning style identification. This helps in adaptive e-learning by providing suitable suggestions for decision making and it uses decision rules for providing intelligent e-Learning.


2011 ◽  
Vol 301-303 ◽  
pp. 1133-1138 ◽  
Author(s):  
Yan Xiang Fu ◽  
Wei Zhong Zhao ◽  
Hui Fang Ma

Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, more researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel DBSCAN clustering algorithm based on Hadoop, which is a simple yet powerful parallel programming platform. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.


2005 ◽  
Vol 02 (02) ◽  
pp. 167-180
Author(s):  
SEUNG-JOON OH ◽  
JAE-YEARN KIM

Clustering of sequences is relatively less explored but it is becoming increasingly important in data mining applications such as web usage mining and bioinformatics. The web user segmentation problem uses web access log files to partition a set of users into clusters such that users within one cluster are more similar to one another than to the users in other clusters. Similarly, grouping protein sequences that share a similar structure can help to identify sequences with similar functions. However, few clustering algorithms consider sequentiality. In this paper, we study how to cluster sequence datasets. Due to the high computational complexity of hierarchical clustering algorithms for clustering large datasets, a new clustering method is required. Therefore, we propose a new scalable clustering method using sampling and a k-nearest-neighbor method. Using a splice dataset and a synthetic dataset, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional algorithms.


Sign in / Sign up

Export Citation Format

Share Document