A Text Hybrid Clustering Algorithm Based on HowNet Semantics

2011 ◽  
Vol 474-476 ◽  
pp. 2071-2078 ◽  
Author(s):  
Zheng Yu Zhu ◽  
Shu Jia Dong ◽  
Chun Lei Yu ◽  
Jie He

Many existing text clustering algorithms overlook the semantic information between words and so they possess a lower accuracy of text similarity computation. A new text hybrid clustering algorithm (HCA) based on HowNet semantics has been proposed in this paper. It calculates the semantic similarity of words by using the words’ semantic concept description in HowNet and then combines it with the method of maximum weight matching of bipartite graph to calculate a semantic-based text similarity. Based on the new text similarity and by combining an improved genetic algorithm with k-medoids algorithm, HCA has been designed. The comparative experiments show that: 1) compared with two existing traditional clustering algorithms, HCA can get better quality and 2) when their text cosine similarity is replaced with the new semantic-based text similarity, all the qualities of the three clustering algorithms can be improved significantly.

Author(s):  
Hongtao Huang ◽  
Cunliang Liang ◽  
Haizhi Ye

Probability information content-based FCA concepts similarity computation method relies on the frequency of concepts in corpus, it takes only the occurrence probability as information content metric to compute FCA concept similarity, which leads to lower accuracy. This article introduces a semantic information content-based method for FCA concept similarity evaluation, in addition to the occurrence probability, it takes the superordinate and subordinate semantic relationship of concepts to measure information content, which makes the generic and specific degree of concepts more accurate. Then the semantic information content similarity can be calculated with the help of an ISA hierarchy which is derived from the domain ontology. The difference between this method and probability information content is that the evaluation of semantic information content is independent of corpus. Furthermore, semantic information content can be used for FCA concept similarity evaluation, and the weighted bipartite graph is also utilized to help improve the efficiency of the similarity evaluation. The experimental results show that this semantic information content based FCA concept similarity computation method improves the accuracy of probabilistic information content based method effectively without loss of time performance.


2017 ◽  
Vol 2017 ◽  
pp. 1-8 ◽  
Author(s):  
Yongchang Cai

Rotary kiln is important equipment in heavy industries and its calcination process is the key impact to the product quality. Due to the difficulty in obtaining the accurate algebraic model of the calcination process, an intelligent modeling method based on ANFIS and clustering algorithms is studied. In the model, ANFIS is employed as the core structure, and aiming to improve both its performance in reduced computation and accuracy, a novel hybrid clustering algorithm is proposed by combining FCM and Subtractive methods. A quasi-random data set is then hired to test the new hybrid clustering algorithm and results indicate its superiority to FCM and Subtractive methods. Further, a set of data from the successful control activity of sophisticated workers in manufacturing field is used to train the model, and the model demonstrates its advantages in both fast convergence and more accuracy approaching.


Author(s):  
Na Zheng ◽  
Jie Yu Wu

A clustering method based on the Latent Dirichlet Allocation and the VSM model to compute the text similarity is presented. The Latent Dirichlet Allocation subject models and the VSM vector space model weights strategy are used respectively to calculate the text similarity. The linear combination of the two results is used to get the text similarity. Then the k-means clustering algorithm is chosen for cluster analysis. It can not only solve the deep semantic information leakage problems of traditional text clustering, but also solve the problem of the LDA that could not distinguish the texts because of too much dimension reduction. So the deep semantic information is mined from the text, and the clustering efficiency is improved. Through the comparisons with the traditional methods, the result shows that this algorithm can improve the performance of text clustering.


Author(s):  
Mohana Priya K ◽  
Pooja Ragavi S ◽  
Krishna Priya G

Clustering is the process of grouping objects into subsets that have meaning in the context of a particular problem. It does not rely on predefined classes. It is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. Many clustering algorithms have been proposed and are used based on different applications. Sentence clustering is one of best clustering technique. Hierarchical Clustering Algorithm is applied for multiple levels for accuracy. For tagging purpose POS tagger, porter stemmer is used. WordNet dictionary is utilized for determining the similarity by invoking the Jiang Conrath and Cosine similarity measure. Grouping is performed with respect to the highest similarity measure value with a mean threshold. This paper incorporates many parameters for finding similarity between words. In order to identify the disambiguated words, the sense identification is performed for the adjectives and comparison is performed. semcor and machine learning datasets are employed. On comparing with previous results for WSD, our work has improvised a lot which gives a percentage of 91.2%


2015 ◽  
pp. 125-138 ◽  
Author(s):  
I. V. Goncharenko

In this article we proposed a new method of non-hierarchical cluster analysis using k-nearest-neighbor graph and discussed it with respect to vegetation classification. The method of k-nearest neighbor (k-NN) classification was originally developed in 1951 (Fix, Hodges, 1951). Later a term “k-NN graph” and a few algorithms of k-NN clustering appeared (Cover, Hart, 1967; Brito et al., 1997). In biology k-NN is used in analysis of protein structures and genome sequences. Most of k-NN clustering algorithms build «excessive» graph firstly, so called hypergraph, and then truncate it to subgraphs, just partitioning and coarsening hypergraph. We developed other strategy, the “upward” clustering in forming (assembling consequentially) one cluster after the other. Until today graph-based cluster analysis has not been considered concerning classification of vegetation datasets.


Author(s):  
Yuancheng Li ◽  
Yaqi Cui ◽  
Xiaolong Zhang

Background: Advanced Metering Infrastructure (AMI) for the smart grid is growing rapidly which results in the exponential growth of data collected and transmitted in the device. By clustering this data, it can give the electricity company a better understanding of the personalized and differentiated needs of the user. Objective: The existing clustering algorithms for processing data generally have some problems, such as insufficient data utilization, high computational complexity and low accuracy of behavior recognition. Methods: In order to improve the clustering accuracy, this paper proposes a new clustering method based on the electrical behavior of the user. Starting with the analysis of user load characteristics, the user electricity data samples were constructed. The daily load characteristic curve was extracted through improved extreme learning machine clustering algorithm and effective index criteria. Moreover, clustering analysis was carried out for different users from industrial areas, commercial areas and residential areas. The improved extreme learning machine algorithm, also called Unsupervised Extreme Learning Machine (US-ELM), is an extension and improvement of the original Extreme Learning Machine (ELM), which realizes the unsupervised clustering task on the basis of the original ELM. Results: Four different data sets have been experimented and compared with other commonly used clustering algorithms by MATLAB programming. The experimental results show that the US-ELM algorithm has higher accuracy in processing power data. Conclusion: The unsupervised ELM algorithm can greatly reduce the time consumption and improve the effectiveness of clustering.


Algorithms ◽  
2021 ◽  
Vol 14 (6) ◽  
pp. 160
Author(s):  
Qiaoji Xu ◽  
Lingling Jin ◽  
James H. Leebens-Mack ◽  
David Sankoff

The RACCROCHE pipeline reconstructs ancestral gene orders and chromosomal contents of the ancestral genomes at all internal vertices of a phylogenetic tree. The strategy is to accumulate a very large number of generalized adjacencies, phylogenetically justified for each ancestor, to produce long ancestral contigs through maximum weight matching. It constructs chromosomes by counting the frequencies of ancestral contig co-occurrences on the extant genomes, clustering these for each ancestor and ordering them. The main objective of this paper is to closely simulate the evolutionary process giving rise to the gene content and order of a set of extant genomes (six distantly related monocots), and to assess to what extent an updated version of RACCROCHE can recover the artificial ancestral genome at the root of the phylogenetic tree relating to the simulated genomes.


Author(s):  
M. Tanveer ◽  
Tarun Gupta ◽  
Miten Shah ◽  

Twin Support Vector Clustering (TWSVC) is a clustering algorithm inspired by the principles of Twin Support Vector Machine (TWSVM). TWSVC has already outperformed other traditional plane based clustering algorithms. However, TWSVC uses hinge loss, which maximizes shortest distance between clusters and hence suffers from noise-sensitivity and low re-sampling stability. In this article, we propose Pinball loss Twin Support Vector Clustering (pinTSVC) as a clustering algorithm. The proposed pinTSVC model incorporates the pinball loss function in the plane clustering formulation. Pinball loss function introduces favorable properties such as noise-insensitivity and re-sampling stability. The time complexity of the proposed pinTSVC remains equivalent to that of TWSVC. Extensive numerical experiments on noise-corrupted benchmark UCI and artificial datasets have been provided. Results of the proposed pinTSVC model are compared with TWSVC, Twin Bounded Support Vector Clustering (TBSVC) and Fuzzy c-means clustering (FCM). Detailed and exhaustive comparisons demonstrate the better performance and generalization of the proposed pinTSVC for noise-corrupted datasets. Further experiments and analysis on the performance of the above-mentioned clustering algorithms on structural MRI (sMRI) images taken from the ADNI database, face clustering, and facial expression clustering have been done to demonstrate the effectiveness and feasibility of the proposed pinTSVC model.


Symmetry ◽  
2021 ◽  
Vol 13 (4) ◽  
pp. 596
Author(s):  
Krishna Kumar Sharma ◽  
Ayan Seal ◽  
Enrique Herrera-Viedma ◽  
Ondrej Krejcar

Calculating and monitoring customer churn metrics is important for companies to retain customers and earn more profit in business. In this study, a churn prediction framework is developed by modified spectral clustering (SC). However, the similarity measure plays an imperative role in clustering for predicting churn with better accuracy by analyzing industrial data. The linear Euclidean distance in the traditional SC is replaced by the non-linear S-distance (Sd). The Sd is deduced from the concept of S-divergence (SD). Several characteristics of Sd are discussed in this work. Assays are conducted to endorse the proposed clustering algorithm on four synthetics, eight UCI, two industrial databases and one telecommunications database related to customer churn. Three existing clustering algorithms—k-means, density-based spatial clustering of applications with noise and conventional SC—are also implemented on the above-mentioned 15 databases. The empirical outcomes show that the proposed clustering algorithm beats three existing clustering algorithms in terms of its Jaccard index, f-score, recall, precision and accuracy. Finally, we also test the significance of the clustering results by the Wilcoxon’s signed-rank test, Wilcoxon’s rank-sum test, and sign tests. The relative study shows that the outcomes of the proposed algorithm are interesting, especially in the case of clusters of arbitrary shape.


Mathematics ◽  
2021 ◽  
Vol 9 (4) ◽  
pp. 370
Author(s):  
Shuangsheng Wu ◽  
Jie Lin ◽  
Zhenyu Zhang ◽  
Yushu Yang

The fuzzy clustering algorithm has become a research hotspot in many fields because of its better clustering effect and data expression ability. However, little research focuses on the clustering of hesitant fuzzy linguistic term sets (HFLTSs). To fill in the research gaps, we extend the data type of clustering to hesitant fuzzy linguistic information. A kind of hesitant fuzzy linguistic agglomerative hierarchical clustering algorithm is proposed. Furthermore, we propose a hesitant fuzzy linguistic Boole matrix clustering algorithm and compare the two clustering algorithms. The proposed clustering algorithms are applied in the field of judicial execution, which provides decision support for the executive judge to determine the focus of the investigation and the control. A clustering example verifies the clustering algorithm’s effectiveness in the context of hesitant fuzzy linguistic decision information.


Sign in / Sign up

Export Citation Format

Share Document