scholarly journals Improving Clustering Methods By Exploiting Richness Of Text Data

2021 ◽  
Author(s):  
◽  
Abdul Wahid

<p>Clustering is an unsupervised machine learning technique, which involves discovering different clusters (groups) of similar objects in unlabeled data and is generally considered to be a NP hard problem. Clustering methods are widely used in a verity of disciplines for analyzing different types of data, and a small improvement in clustering method can cause a ripple effect in advancing research of multiple fields.  Clustering any type of data is challenging and there are many open research questions. The clustering problem is exacerbated in the case of text data because of the additional challenges such as issues in capturing semantics of a document, handling rich features of text data and dealing with the well known problem of the curse of dimensionality.  In this thesis, we investigate the limitations of existing text clustering methods and address these limitations by providing five new text clustering methods--Query Sense Clustering (QSC), Dirichlet Weighted K-means (DWKM), Multi-View Multi-Objective Evolutionary Algorithm (MMOEA), Multi-objective Document Clustering (MDC) and Multi-Objective Multi-View Ensemble Clustering (MOMVEC). These five new clustering methods showed that the use of rich features in text clustering methods could outperform the existing state-of-the-art text clustering methods.  The first new text clustering method QSC exploits user queries (one of the rich features in text data) to generate better quality clusters and cluster labels.  The second text clustering method DWKM uses probability based weighting scheme to formulate a semantically weighted distance measure to improve the clustering results.  The third text clustering method MMOEA is based on a multi-objective evolutionary algorithm. MMOEA exploits rich features to generate a diverse set of candidate clustering solutions, and forms a better clustering solution using a cluster-oriented approach.  The fourth and the fifth text clustering method MDC and MOMVEC address the limitations of MMOEA. MDC and MOMVEC differ in terms of the implementation of their multi-objective evolutionary approaches.  All five methods are compared with existing state-of-the-art methods. The results of the comparisons show that the newly developed text clustering methods out-perform existing methods by achieving up to 16\% improvement for some comparisons. In general, almost all newly developed clustering algorithms showed statistically significant improvements over other existing methods.  The key ideas of the thesis highlight that exploiting user queries improves Search Result Clustering(SRC); utilizing rich features in weighting schemes and distance measures improves soft subspace clustering; utilizing multiple views and a multi-objective cluster oriented method improves clustering ensemble methods; and better evolutionary operators and objective functions improve multi-objective evolutionary clustering ensemble methods.  The new text clustering methods introduced in this thesis can be widely applied in various domains that involve analysis of text data. The contributions of this thesis which include five new text clustering methods, will not only help researchers in the data mining field but also to help a wide range of researchers in other fields.</p>

2021 ◽  
Author(s):  
◽  
Abdul Wahid

<p>Clustering is an unsupervised machine learning technique, which involves discovering different clusters (groups) of similar objects in unlabeled data and is generally considered to be a NP hard problem. Clustering methods are widely used in a verity of disciplines for analyzing different types of data, and a small improvement in clustering method can cause a ripple effect in advancing research of multiple fields.  Clustering any type of data is challenging and there are many open research questions. The clustering problem is exacerbated in the case of text data because of the additional challenges such as issues in capturing semantics of a document, handling rich features of text data and dealing with the well known problem of the curse of dimensionality.  In this thesis, we investigate the limitations of existing text clustering methods and address these limitations by providing five new text clustering methods--Query Sense Clustering (QSC), Dirichlet Weighted K-means (DWKM), Multi-View Multi-Objective Evolutionary Algorithm (MMOEA), Multi-objective Document Clustering (MDC) and Multi-Objective Multi-View Ensemble Clustering (MOMVEC). These five new clustering methods showed that the use of rich features in text clustering methods could outperform the existing state-of-the-art text clustering methods.  The first new text clustering method QSC exploits user queries (one of the rich features in text data) to generate better quality clusters and cluster labels.  The second text clustering method DWKM uses probability based weighting scheme to formulate a semantically weighted distance measure to improve the clustering results.  The third text clustering method MMOEA is based on a multi-objective evolutionary algorithm. MMOEA exploits rich features to generate a diverse set of candidate clustering solutions, and forms a better clustering solution using a cluster-oriented approach.  The fourth and the fifth text clustering method MDC and MOMVEC address the limitations of MMOEA. MDC and MOMVEC differ in terms of the implementation of their multi-objective evolutionary approaches.  All five methods are compared with existing state-of-the-art methods. The results of the comparisons show that the newly developed text clustering methods out-perform existing methods by achieving up to 16\% improvement for some comparisons. In general, almost all newly developed clustering algorithms showed statistically significant improvements over other existing methods.  The key ideas of the thesis highlight that exploiting user queries improves Search Result Clustering(SRC); utilizing rich features in weighting schemes and distance measures improves soft subspace clustering; utilizing multiple views and a multi-objective cluster oriented method improves clustering ensemble methods; and better evolutionary operators and objective functions improve multi-objective evolutionary clustering ensemble methods.  The new text clustering methods introduced in this thesis can be widely applied in various domains that involve analysis of text data. The contributions of this thesis which include five new text clustering methods, will not only help researchers in the data mining field but also to help a wide range of researchers in other fields.</p>


Author(s):  
Katti Faceli ◽  
Andre C.P.L.F. de Carvalho ◽  
Marcilio C.P. de Souto

Clustering is an important tool for data exploration. Several clustering algorithms exist, and new algorithms are frequently proposed in the literature. These algorithms have been very successful in a large number of real-world problems. However, there is no clustering algorithm, optimizing only a single criterion, able to reveal all types of structures (homogeneous or heterogeneous) present in a dataset. In order to deal with this problem, several multi-objective clustering and cluster ensemble methods have been proposed in the literature, including our multi-objective clustering ensemble algorithm. In this chapter, we present an overview of these methods, which, to a great extent, are based on the combination of various aspects of traditional clustering algorithms.


Author(s):  
Nurshazwani Muhamad Mahfuz ◽  
Marina Yusoff ◽  
Zakiah Ahmad

<div style="’text-align: justify;">Clustering provides a prime important role as an unsupervised learning method in data analytics to assist many real-world problems such as image segmentation, object recognition or information retrieval. It is often an issue of difficulty for traditional clustering technique due to non-optimal result exist because of the presence of outliers and noise data.  This review paper provides a review of single clustering methods that were applied in various domains.  The aim is to see the potential suitable applications and aspect of improvement of the methods. Three categories of single clustering methods were suggested, and it would be beneficial to the researcher to see the clustering aspects as well as to determine the requirement for clustering method for an employment based on the state of the art of the previous research findings.</div>


2003 ◽  
Vol 11 (2) ◽  
pp. 151-167 ◽  
Author(s):  
Andrea Toffolo ◽  
Ernesto Benini

A key feature of an efficient and reliable multi-objective evolutionary algorithm is the ability to maintain genetic diversity within a population of solutions. In this paper, we present a new diversity-preserving mechanism, the Genetic Diversity Evaluation Method (GeDEM), which considers a distance-based measure of genetic diversity as a real objective in fitness assignment. This provides a dual selection pressure towards the exploitation of current non-dominated solutions and the exploration of the search space. We also introduce a new multi-objective evolutionary algorithm, the Genetic Diversity Evolutionary Algorithm (GDEA), strictly designed around GeDEM and then we compare it with other state-of-the-art algorithms on a well-established suite of test problems. Experimental results clearly indicate that the performance of GDEA is top-level.


2018 ◽  
Author(s):  
Biao Zhang ◽  
Quan-ke Pan ◽  
Liang Gao ◽  
Yao-bang Zhao

In this paper, a multi-objective hybrid flowshop rescheduling problem (HFRP) is addressed in a dynamic shop environment where two types of real-time events, namely machine breakdown and job cancellation, simultaneously happen. For the addressed problem, two objectives are considered. One objective concerning the production efficiency is minimizing the maximum completion time or makespan, while regarding with the instability, the total number of the jobs assigned to different machines between the revised and the origin schedule is considered. A multi-objective evolutionary algorithm based on decomposition (MOEA/D) is applied to solve this problem. In the algorithm, the weighted sum approach is used as the decomposition strategy. The algorithm is, then, rigorously compared with three state-of-the-art evolutionary multi-objective optimizers, and the computational results demonstrate the effectiveness and efficiency of the algorithm.


Entropy ◽  
2021 ◽  
Vol 23 (5) ◽  
pp. 522
Author(s):  
Minhui Hu ◽  
Kaiwei Zeng ◽  
Yaohua Wang ◽  
Yang Guo

Unsupervised domain adaptation is a challenging task in person re-identification (re-ID). Recently, cluster-based methods achieve good performance; clustering and training are two important phases in these methods. For clustering, one major issue of existing methods is that they do not fully exploit the information in outliers by either discarding outliers in clusters or simply merging outliers. For training, existing methods only use source features for pretraining and target features for fine-tuning and do not make full use of all valuable information in source datasets and target datasets. To solve these problems, we propose a Threshold-based Hierarchical clustering method with Contrastive loss (THC). There are two features of THC: (1) it regards outliers as single-sample clusters to participate in training. It well preserves the information in outliers without setting cluster number and combines advantages of existing clustering methods; (2) it uses contrastive loss to make full use of all valuable information, including source-class centroids, target-cluster centroids and single-sample clusters, thus achieving better performance. We conduct extensive experiments on Market-1501, DukeMTMC-reID and MSMT17. Results show our method achieves state of the art.


2018 ◽  
Author(s):  
Hyunki Woo ◽  
Kyunga Kim ◽  
KyeongMin Cha ◽  
Jin-Young Lee ◽  
Hansong Mun ◽  
...  

BACKGROUND Since medical research based on big data has become more common, the community’s interest and effort to analyze a large amount of semistructured or unstructured text data, such as examination reports, have rapidly increased. However, these large-scale text data are often not readily applicable to analysis owing to typographical errors, inconsistencies, or data entry problems. Therefore, an efficient data cleaning process is required to ensure the veracity of such data. OBJECTIVE In this paper, we proposed an efficient data cleaning process for large-scale medical text data, which employs text clustering methods and value-converting technique, and evaluated its performance with medical examination text data. METHODS The proposed data cleaning process consists of text clustering and value-merging. In the text clustering step, we suggested the use of key collision and nearest neighbor methods in a complementary manner. Words (called values) in the same cluster would be expected as a correct value and its wrong representations. In the value-converting step, wrong values for each identified cluster would be converted into their correct value. We applied these data cleaning process to 574,266 stool examination reports produced for parasite analysis at Samsung Medical Center from 1995 to 2015. The performance of the proposed process was examined and compared with data cleaning processes based on a single clustering method. We used OpenRefine 2.7, an open source application that provides various text clustering methods and an efficient user interface for value-converting with common-value suggestion. RESULTS A total of 1,167,104 words in stool examination reports were surveyed. In the data cleaning process, we discovered 30 correct words and 45 patterns of typographical errors and duplicates. We observed high correction rates for words with typographical errors (98.61%) and typographical error patterns (97.78%). The resulting data accuracy was nearly 100% based on the number of total words. CONCLUSIONS Our data cleaning process based on the combinatorial use of key collision and nearest neighbor methods provides an efficient cleaning of large-scale text data and hence improves data accuracy.


2012 ◽  
Vol 2012 ◽  
pp. 1-11 ◽  
Author(s):  
Yufang Qin ◽  
Junzhong Ji ◽  
Chunnian Liu

Multiobjective optimization problem (MOP) is an important and challenging topic in the fields of industrial design and scientific research. Multi-objective evolutionary algorithm (MOEA) has proved to be one of the most efficient algorithms solving the multi-objective optimization. In this paper, we propose an entropy-based multi-objective evolutionary algorithm with an enhanced elite mechanism (E-MOEA), which improves the convergence and diversity of solution set in MOPs effectively. In this algorithm, an enhanced elite mechanism is applied to guide the direction of the evolution of the population. Specifically, it accelerates the population to approach the true Pareto front at the early stage of the evolution process. A strategy based on entropy is used to maintain the diversity of population when the population is near to the Pareto front. The proposed algorithm is executed on widely used test problems, and the simulated results show that the algorithm has better or comparative performances in convergence and diversity of solutions compared with two state-of-the-art evolutionary algorithms: NSGA-II, SPEA2 and the MOSADE.


Author(s):  
Yonghua Zhu ◽  
Xiaofeng Zhu ◽  
Wei Zheng

Although multi-view clustering is capable to usemore information than single view clustering, existing multi-view clustering methods still have issues to be addressed, such as initialization sensitivity, the specification of the number of clusters,and the influence of outliers. In this paper, we propose a robust multi-view clustering method to address these issues. Specifically, we first propose amulti-view based sum-of-square error estimation tomake the initialization easy and simple as well asuse a sum-of-norm regularization to automaticallylearn the number of clusters according to data distribution. We further employ robust estimators constructed by the half-quadratic theory to avoid theinfluence of outliers for conducting robust estimations of both sum-of-square error and the numberof clusters. Experimental results on both syntheticand real datasets demonstrate that our method outperforms the state-of-the-art methods.  


Sign in / Sign up

Export Citation Format

Share Document