Topology and Topic-Aware Service Clustering

2018 ◽  
Vol 15 (3) ◽  
pp. 18-37 ◽  
Author(s):  
Weifeng Pan ◽  
Jilei Dong ◽  
Kun Liu ◽  
Jing Wang

This article describes how the number of services and their types being so numerous makes accurately discovering desired services become a problem. Service clustering is an effective way to facilitate service discovery. However, the existing approaches are usually designed for a single type of service documents, neglecting to fully use the topic and topological information in service profiles and usage histories. To avoid these limitations, this article presents a novel service clustering approach. It adopts a bipartite network to describe the topological structure of service usage histories and uses a SimRank algorithm to measure the topological similarity of services; It applies Latent Dirichlet Allocation to extract topics from service profiles and further quantifies the topic similarity of services; It quantifies the similarity of services by integrating topological and topic similarities; It uses the Chameleon clustering algorithm to cluster the services. The empirical evaluation on real-world data set highlights the benefits provided by the combination of topological and topic similarities.

Genetics ◽  
2001 ◽  
Vol 159 (2) ◽  
pp. 699-713
Author(s):  
Noah A Rosenberg ◽  
Terry Burke ◽  
Kari Elo ◽  
Marcus W Feldman ◽  
Paul J Freidlin ◽  
...  

Abstract We tested the utility of genetic cluster analysis in ascertaining population structure of a large data set for which population structure was previously known. Each of 600 individuals representing 20 distinct chicken breeds was genotyped for 27 microsatellite loci, and individual multilocus genotypes were used to infer genetic clusters. Individuals from each breed were inferred to belong mostly to the same cluster. The clustering success rate, measuring the fraction of individuals that were properly inferred to belong to their correct breeds, was consistently ~98%. When markers of highest expected heterozygosity were used, genotypes that included at least 8–10 highly variable markers from among the 27 markers genotyped also achieved >95% clustering success. When 12–15 highly variable markers and only 15–20 of the 30 individuals per breed were used, clustering success was at least 90%. We suggest that in species for which population structure is of interest, databases of multilocus genotypes at highly variable markers should be compiled. These genotypes could then be used as training samples for genetic cluster analysis and to facilitate assignments of individuals of unknown origin to populations. The clustering algorithm has potential applications in defining the within-species genetic units that are useful in problems of conservation.


2021 ◽  
Vol 18 (1) ◽  
pp. 34-57
Author(s):  
Weifeng Pan ◽  
Xinxin Xu ◽  
Hua Ming ◽  
Carl K. Chang

Mashup technology has become a promising way to develop and deliver applications on the web. Automatically organizing Mashups into functionally similar clusters helps improve the performance of Mashup discovery. Although there are many approaches aiming to cluster Mashups, they solely focus on utilizing semantic similarities to guide the Mashup clustering process and are unable to utilize both the structural and semantic information in Mashup profiles. In this paper, a novel approach to cluster Mashups into groups is proposed, which integrates structural similarity and semantic similarity using fuzzy AHP (fuzzy analytic hierarchy process). The structural similarity is computed from usage histories between Mashups and Web APIs using SimRank algorithm. The semantic similarity is computed from the descriptions and tags of Mashups using LDA (latent dirichlet allocation). A clustering algorithm based on the genetic algorithm is employed to cluster Mashups. Comprehensive experiments are performed on a real data set collected from ProgrammableWeb. The results show the effectiveness of the approach when compared with two kinds of conventional approaches.


mSystems ◽  
2020 ◽  
Vol 5 (1) ◽  
Author(s):  
Lisa Röttjers ◽  
Karoline Faust

ABSTRACT Microbial network inference and analysis have become successful approaches to extract biological hypotheses from microbial sequencing data. Network clustering is a crucial step in this analysis. Here, we present a novel heuristic network clustering algorithm, manta, which clusters nodes in weighted networks. In contrast to existing algorithms, manta exploits negative edges while differentiating between weak and strong cluster assignments. For this reason, manta can tackle gradients and is able to avoid clustering problematic nodes. In addition, manta assesses the robustness of cluster assignment, which makes it more robust to noisy data than most existing tools. On noise-free synthetic data, manta equals or outperforms existing algorithms, while it identifies biologically relevant subcompositions in real-world data sets. On a cheese rind data set, manta identifies groups of taxa that correspond to intermediate moisture content in the rinds, while on an ocean data set, the algorithm identifies a cluster of organisms that were reduced in abundance during a transition period but did not correlate strongly to biochemical parameters that changed during the transition period. These case studies demonstrate the power of manta as a tool that identifies biologically informative groups within microbial networks. IMPORTANCE manta comes with unique strengths, such as the abilities to identify nodes that represent an intermediate between clusters, to exploit negative edges, and to assess the robustness of cluster membership. manta does not require parameter tuning, is straightforward to install and run, and can be easily combined with existing microbial network inference tools.


2011 ◽  
Vol 2011 ◽  
pp. 1-14 ◽  
Author(s):  
Chunzhong Li ◽  
Zongben Xu

Structure of data set is of critical importance in identifying clusters, especially the density difference feature. In this paper, we present a clustering algorithm based on density consistency, which is a filtering process to identify same structure feature and classify them into same cluster. This method is not restricted by the shapes and high dimension data set, and meanwhile it is robust to noises and outliers. Extensive experiments on synthetic and real world data sets validate the proposed the new clustering algorithm.


2021 ◽  
pp. 1-12
Author(s):  
Anjana Gosain ◽  
Sonika Dahiya

DKIFCM (Density Based Kernelized Intuitionistic Fuzzy C Means) is the new proposed clustering algorithm that is based on outlier identification, kernel functions, and intuitionist fuzzy approach. DKIFCM is an inspiration from Kernelized Intuitionistic Fuzzy C Means (KIFCM) algorithm and it addresses the performance issue in the presence of outliers. It first identifies outliers based on density of data and then clusters are computed accurately by mapping the data to high dimensional feature space. Performance and effectiveness of various algorithms are evaluated on synthetic 2D data sets such as Diamond data set (D10, D12, and D15), and noisy Dunn data set as well as on high dimension real-world data set such as Fisher-Iris, Wine, and Wisconsin Breast Cancer Data-set. Results of DKIFCM are compared with results of other algorithms such as Fuzzy-C-Means (FCM), Intuitionistic FCM (IFCM), Kernel-Intuitionistic FCM (KIFCM), and density-oriented FCM (DOFCM), and the performance of proposed algorithm is found to be superior even in the presence of outliers and noise. Key advantages of DKIFCM are outlier identification, robustness to noise, and accurate centroid computation.


2020 ◽  
Author(s):  
Renato Cordeiro de Amorim

In a real-world data set there is always the possibility, rather high in our opinion, that different features may have different degrees of relevance. Most machine learning algorithms deal with this fact by either selecting or deselecting features in the data preprocessing phase. However, we maintain that even among relevant features there may be different degrees of relevance, and this should be taken into account during the clustering process. With over 50 years of history, K-Means is arguably the most popular partitional clustering algorithm there is. The first K-Means based clustering algorithm to compute feature weights was designed just over 30 years ago. Various such algorithms have been designed since but there has not been, to our knowledge, a survey integrating empirical evidence of cluster recovery ability, common flaws, and possible directions for future research. This paper elaborates on the concept of feature weighting and addresses these issues by critically analysing some of the most popular, or innovative, feature weighting mechanisms based in K-Means


Author(s):  
Kai Liu ◽  
Hua Wang

Different to traditional clustering methods that deal with one single type of data, High-Order Co- Clustering (HOCC) aims to cluster multiple types of data simultaneously by utilizing the inter- or/and intra-type relationships across different data types. In existing HOCC methods, data points routinely enter the objective functions with squared residual errors. As a result, outlying data samples can dominate the objective functions, which may lead to incorrect clustering results. Moreover, existing methods usually suffer from soft clustering, where the probabilities to different groups can be very close. In this paper, we propose an L1 -norm symmetric nonnegative matrix tri-factorization method to solve the HOCC problem. Due to the orthogonal constraints and the symmetric L1 -norm formulation in our new objective, conventional auxiliary function approach no longer works. Thus we derive the solution algorithm using the alternating direction method of multipliers. Extensive experiments have been conducted on a real world data set, in which promising empirical results, including less time consumption, strictly orthogonal membership matrix, lower local minima etc., have demonstrated the effectiveness of our proposed method.


2019 ◽  
Vol 2019 ◽  
pp. 1-11 ◽  
Author(s):  
Shiyuan Zhou ◽  
Yinglin Wang

Service-oriented computing has become a promising way to develop software by composing existing services on the Internet. However, with the increasing number of services on the Internet, how to match requirements and services becomes a difficult problem. Service clustering has been regarded as one of the effective ways to improve service matching. Related work shows that structure-related similarity metrics perform better than semantic-related similarity metrics in clustering services. Therefore, it is of great importance to propose much more useful structure-related similarity metrics to improve the performance of service clustering approaches. However, in the existing work, this kind of work is very rare. In this paper, we propose a SCAS (service clustering approach using structural metrics) to group services into different clusters. SCAS proposes a novel metric A2S (atomic service similarity) to characterize the atomic service similarity as a whole, which is a linear combination of C2S (composite-sharing similarity) and A3S (atomic-service-sharing similarity). Then, SCAS applies a guided community detection algorithm to group atomic services into clusters. Experimental results on a real-world data set show that our SCAS performs better than the existing approaches. Our A2S metric is promising in improving the performance of service clustering approaches.


Sign in / Sign up

Export Citation Format

Share Document