scholarly journals manta: a Clustering Algorithm for Weighted Ecological Networks

mSystems ◽  
2020 ◽  
Vol 5 (1) ◽  
Author(s):  
Lisa Röttjers ◽  
Karoline Faust

ABSTRACT Microbial network inference and analysis have become successful approaches to extract biological hypotheses from microbial sequencing data. Network clustering is a crucial step in this analysis. Here, we present a novel heuristic network clustering algorithm, manta, which clusters nodes in weighted networks. In contrast to existing algorithms, manta exploits negative edges while differentiating between weak and strong cluster assignments. For this reason, manta can tackle gradients and is able to avoid clustering problematic nodes. In addition, manta assesses the robustness of cluster assignment, which makes it more robust to noisy data than most existing tools. On noise-free synthetic data, manta equals or outperforms existing algorithms, while it identifies biologically relevant subcompositions in real-world data sets. On a cheese rind data set, manta identifies groups of taxa that correspond to intermediate moisture content in the rinds, while on an ocean data set, the algorithm identifies a cluster of organisms that were reduced in abundance during a transition period but did not correlate strongly to biochemical parameters that changed during the transition period. These case studies demonstrate the power of manta as a tool that identifies biologically informative groups within microbial networks. IMPORTANCE manta comes with unique strengths, such as the abilities to identify nodes that represent an intermediate between clusters, to exploit negative edges, and to assess the robustness of cluster membership. manta does not require parameter tuning, is straightforward to install and run, and can be easily combined with existing microbial network inference tools.

2019 ◽  
Author(s):  
Lisa Röttjers ◽  
Karoline Faust

AbstractMicrobial network inference and analysis has become a successful approach to generate biological hypotheses from microbial sequencing data. Network clustering is a crucial step in this analysis. Here, we present a novel heuristic flow-based network clustering algorithm, which equals or outperforms existing algorithms on noise-free synthetic data. manta comes with unique strengths such as the ability to identify nodes that represent an intermediate between clusters, to exploit negative edges and to assess the robustness of cluster membership. manta does not require parameter tuning, is straightforward to install and run, and can easily be combined with existing microbial network inference tools.


2018 ◽  
Vol 15 (3) ◽  
pp. 18-37 ◽  
Author(s):  
Weifeng Pan ◽  
Jilei Dong ◽  
Kun Liu ◽  
Jing Wang

This article describes how the number of services and their types being so numerous makes accurately discovering desired services become a problem. Service clustering is an effective way to facilitate service discovery. However, the existing approaches are usually designed for a single type of service documents, neglecting to fully use the topic and topological information in service profiles and usage histories. To avoid these limitations, this article presents a novel service clustering approach. It adopts a bipartite network to describe the topological structure of service usage histories and uses a SimRank algorithm to measure the topological similarity of services; It applies Latent Dirichlet Allocation to extract topics from service profiles and further quantifies the topic similarity of services; It quantifies the similarity of services by integrating topological and topic similarities; It uses the Chameleon clustering algorithm to cluster the services. The empirical evaluation on real-world data set highlights the benefits provided by the combination of topological and topic similarities.


2011 ◽  
Vol 2011 ◽  
pp. 1-14 ◽  
Author(s):  
Chunzhong Li ◽  
Zongben Xu

Structure of data set is of critical importance in identifying clusters, especially the density difference feature. In this paper, we present a clustering algorithm based on density consistency, which is a filtering process to identify same structure feature and classify them into same cluster. This method is not restricted by the shapes and high dimension data set, and meanwhile it is robust to noises and outliers. Extensive experiments on synthetic and real world data sets validate the proposed the new clustering algorithm.


2021 ◽  
pp. 1-12
Author(s):  
Anjana Gosain ◽  
Sonika Dahiya

DKIFCM (Density Based Kernelized Intuitionistic Fuzzy C Means) is the new proposed clustering algorithm that is based on outlier identification, kernel functions, and intuitionist fuzzy approach. DKIFCM is an inspiration from Kernelized Intuitionistic Fuzzy C Means (KIFCM) algorithm and it addresses the performance issue in the presence of outliers. It first identifies outliers based on density of data and then clusters are computed accurately by mapping the data to high dimensional feature space. Performance and effectiveness of various algorithms are evaluated on synthetic 2D data sets such as Diamond data set (D10, D12, and D15), and noisy Dunn data set as well as on high dimension real-world data set such as Fisher-Iris, Wine, and Wisconsin Breast Cancer Data-set. Results of DKIFCM are compared with results of other algorithms such as Fuzzy-C-Means (FCM), Intuitionistic FCM (IFCM), Kernel-Intuitionistic FCM (KIFCM), and density-oriented FCM (DOFCM), and the performance of proposed algorithm is found to be superior even in the presence of outliers and noise. Key advantages of DKIFCM are outlier identification, robustness to noise, and accurate centroid computation.


2020 ◽  
Author(s):  
Renato Cordeiro de Amorim

In a real-world data set there is always the possibility, rather high in our opinion, that different features may have different degrees of relevance. Most machine learning algorithms deal with this fact by either selecting or deselecting features in the data preprocessing phase. However, we maintain that even among relevant features there may be different degrees of relevance, and this should be taken into account during the clustering process. With over 50 years of history, K-Means is arguably the most popular partitional clustering algorithm there is. The first K-Means based clustering algorithm to compute feature weights was designed just over 30 years ago. Various such algorithms have been designed since but there has not been, to our knowledge, a survey integrating empirical evidence of cluster recovery ability, common flaws, and possible directions for future research. This paper elaborates on the concept of feature weighting and addresses these issues by critically analysing some of the most popular, or innovative, feature weighting mechanisms based in K-Means


2009 ◽  
Vol 2009 ◽  
pp. 1-16 ◽  
Author(s):  
David J. Miller ◽  
Carl A. Nelson ◽  
Molly Boeka Cannon ◽  
Kenneth P. Cannon

Fuzzy clustering algorithms are helpful when there exists a dataset with subgroupings of points having indistinct boundaries and overlap between the clusters. Traditional methods have been extensively studied and used on real-world data, but require users to have some knowledge of the outcome a priori in order to determine how many clusters to look for. Additionally, iterative algorithms choose the optimal number of clusters based on one of several performance measures. In this study, the authors compare the performance of three algorithms (fuzzy c-means, Gustafson-Kessel, and an iterative version of Gustafson-Kessel) when clustering a traditional data set as well as real-world geophysics data that were collected from an archaeological site in Wyoming. Areas of interest in the were identified using a crisp cutoff value as well as a fuzzyα-cut to determine which provided better elimination of noise and non-relevant points. Results indicate that theα-cut method eliminates more noise than the crisp cutoff values and that the iterative version of the fuzzy clustering algorithm is able to select an optimum number of subclusters within a point set (in both the traditional and real-world data), leading to proper indication of regions of interest for further expert analysis


2013 ◽  
Vol 2013 ◽  
pp. 1-12 ◽  
Author(s):  
Singh Vijendra ◽  
Sahoo Laxman

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.


2018 ◽  
Vol 30 (6) ◽  
pp. 1624-1646 ◽  
Author(s):  
Qidong Liu ◽  
Ruisheng Zhang ◽  
Zhili Zhao ◽  
Zhenghai Wang ◽  
Mengyao Jiao ◽  
...  

Minimax similarity stresses the connectedness of points via mediating elements rather than favoring high mutual similarity. The grouping principle yields superior clustering results when mining arbitrarily-shaped clusters in data. However, it is not robust against noises and outliers in the data. There are two main problems with the grouping principle: first, a single object that is far away from all other objects defines a separate cluster, and second, two connected clusters would be regarded as two parts of one cluster. In order to solve such problems, we propose robust minimum spanning tree (MST)-based clustering algorithm in this letter. First, we separate the connected objects by applying a density-based coarsening phase, resulting in a low-rank matrix in which the element denotes the supernode by combining a set of nodes. Then a greedy method is presented to partition those supernodes through working on the low-rank matrix. Instead of removing the longest edges from MST, our algorithm groups the data set based on the minimax similarity. Finally, the assignment of all data points can be achieved through their corresponding supernodes. Experimental results on many synthetic and real-world data sets show that our algorithm consistently outperforms compared clustering algorithms.


2021 ◽  
pp. 016555152110184
Author(s):  
Gunjan Chandwani ◽  
Anil Ahlawat ◽  
Gaurav Dubey

Document retrieval plays an important role in knowledge management as it facilitates us to discover the relevant information from the existing data. This article proposes a cluster-based inverted indexing algorithm for document retrieval. First, the pre-processing is done to remove the unnecessary and redundant words from the documents. Then, the indexing of documents is done by the cluster-based inverted indexing algorithm, which is developed by integrating the piecewise fuzzy C-means (piFCM) clustering algorithm and inverted indexing. After providing the index to the documents, the query matching is performed for the user queries using the Bhattacharyya distance. Finally, the query optimisation is done by the Pearson correlation coefficient, and the relevant documents are retrieved. The performance of the proposed algorithm is analysed by the WebKB data set and Twenty Newsgroups data set. The analysis exposes that the proposed algorithm offers high performance with a precision of 1, recall of 0.70 and F-measure of 0.8235. The proposed document retrieval system retrieves the most relevant documents and speeds up the storing and retrieval of information.


Database ◽  
2021 ◽  
Vol 2021 ◽  
Author(s):  
Shaikh Farhad Hossain ◽  
Ming Huang ◽  
Naoaki Ono ◽  
Aki Morita ◽  
Shigehiko Kanaya ◽  
...  

Abstract A biomarker is a measurable indicator of a disease or abnormal state of a body that plays an important role in disease diagnosis, prognosis and treatment. The biomarker has become a significant topic due to its versatile usage in the medical field and in rapid detection of the presence or severity of some diseases. The volume of biomarker data is rapidly increasing and the identified data are scattered. To provide comprehensive information, the explosively growing data need to be recorded in a single platform. There is no open-source freely available comprehensive online biomarker database. To fulfill this purpose, we have developed a human biomarker database as part of the KNApSAcK family databases which contain a vast quantity of information on the relationships between biomarkers and diseases. We have classified the diseases into 18 disease classes, mostly according to the National Center for Biotechnology Information definitions. Apart from this database development, we also have performed disease classification by separately using protein and metabolite biomarkers based on the network clustering algorithm DPClusO and hierarchical clustering. Finally, we reached a conclusion about the relationships among the disease classes. The human biomarker database can be accessed online and the inter-disease relationships may be helpful in understanding the molecular mechanisms of diseases. To our knowledge, this is one of the first approaches to classify diseases based on biomarkers. Database URL:  http://www.knapsackfamily.com/Biomarker/top.php


Sign in / Sign up

Export Citation Format

Share Document