A Soft Subspace Clustering Method for Text Data Using a Probability Based Feature Weighting Scheme

Subspace Clustering of Text Documents with Feature Weighting K-Means Algorithm

Advances in Knowledge Discovery and Data Mining - Lecture Notes in Computer Science ◽

10.1007/11430919_94 ◽

2005 ◽

pp. 802-812 ◽

Cited By ~ 38

Author(s):

Liping Jing ◽

Michael K. Ng ◽

Jun Xu ◽

Joshua Zhexue Huang

Keyword(s):

Subspace Clustering ◽

Feature Weighting ◽

Text Documents

Download Full-text

Integrating a differential evolution feature weighting scheme into prototype generation

Neurocomputing ◽

10.1016/j.neucom.2012.06.009 ◽

2012 ◽

Vol 97 ◽

pp. 332-343 ◽

Cited By ~ 15

Author(s):

Isaac Triguero ◽

Joaquín Derrac ◽

Salvador García ◽

Francisco Herrera

Keyword(s):

Differential Evolution ◽

Weighting Scheme ◽

Feature Weighting ◽

Evolution Feature

Download Full-text

Feature Weighting Scheme for Text Categorization Based on Rough Set

2010 First International Conference on Networking and Distributed Computing ◽

10.1109/icndc.2010.46 ◽

2010 ◽

Author(s):

Huang Lican ◽

Xu Xin ◽

Zhao Yuhong ◽

Gao Junzhou

Keyword(s):

Rough Set ◽

Text Categorization ◽

Weighting Scheme ◽

Feature Weighting

Download Full-text

Improving Clustering Methods By Exploiting Richness Of Text Data

10.26686/wgtn.17019287.v1 ◽

2021 ◽

Author(s):

◽

Abdul Wahid

Keyword(s):

Evolutionary Algorithm ◽

State Of The Art ◽

Ensemble Methods ◽

Text Clustering ◽

Clustering Methods ◽

Clustering Method ◽

Clustering Ensemble ◽

Text Data ◽

Multi Objective ◽

User Queries

<p>Clustering is an unsupervised machine learning technique, which involves discovering different clusters (groups) of similar objects in unlabeled data and is generally considered to be a NP hard problem. Clustering methods are widely used in a verity of disciplines for analyzing different types of data, and a small improvement in clustering method can cause a ripple effect in advancing research of multiple fields. Clustering any type of data is challenging and there are many open research questions. The clustering problem is exacerbated in the case of text data because of the additional challenges such as issues in capturing semantics of a document, handling rich features of text data and dealing with the well known problem of the curse of dimensionality. In this thesis, we investigate the limitations of existing text clustering methods and address these limitations by providing five new text clustering methods--Query Sense Clustering (QSC), Dirichlet Weighted K-means (DWKM), Multi-View Multi-Objective Evolutionary Algorithm (MMOEA), Multi-objective Document Clustering (MDC) and Multi-Objective Multi-View Ensemble Clustering (MOMVEC). These five new clustering methods showed that the use of rich features in text clustering methods could outperform the existing state-of-the-art text clustering methods. The first new text clustering method QSC exploits user queries (one of the rich features in text data) to generate better quality clusters and cluster labels. The second text clustering method DWKM uses probability based weighting scheme to formulate a semantically weighted distance measure to improve the clustering results. The third text clustering method MMOEA is based on a multi-objective evolutionary algorithm. MMOEA exploits rich features to generate a diverse set of candidate clustering solutions, and forms a better clustering solution using a cluster-oriented approach. The fourth and the fifth text clustering method MDC and MOMVEC address the limitations of MMOEA. MDC and MOMVEC differ in terms of the implementation of their multi-objective evolutionary approaches. All five methods are compared with existing state-of-the-art methods. The results of the comparisons show that the newly developed text clustering methods out-perform existing methods by achieving up to 16\% improvement for some comparisons. In general, almost all newly developed clustering algorithms showed statistically significant improvements over other existing methods. The key ideas of the thesis highlight that exploiting user queries improves Search Result Clustering(SRC); utilizing rich features in weighting schemes and distance measures improves soft subspace clustering; utilizing multiple views and a multi-objective cluster oriented method improves clustering ensemble methods; and better evolutionary operators and objective functions improve multi-objective evolutionary clustering ensemble methods. The new text clustering methods introduced in this thesis can be widely applied in various domains that involve analysis of text data. The contributions of this thesis which include five new text clustering methods, will not only help researchers in the data mining field but also to help a wide range of researchers in other fields.</p>

Download Full-text

Soft Subspace Clustering for High-Dimensional Data

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch276 ◽

2011 ◽

pp. 1810-1814

Author(s):

Liping Jing ◽

Michael K. Ng ◽

Joshua Zhexue Huang

Keyword(s):

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Special Treatment ◽

Clustering Methods ◽

Real World Data ◽

Text Data ◽

Data Set ◽

Dna Microarray Data ◽

Text Document

High dimensional data is a phenomenon in real-world data mining applications. Text data is a typical example. In text mining, a text document is viewed as a vector of terms whose dimension is equal to the total number of unique terms in a data set, which is usually in thousands. High dimensional data occurs in business as well. In retails, for example, to effectively manage supplier relationship, suppliers are often categorized according to their business behaviors (Zhang, Huang, Qian, Xu, & Jing, 2006). The supplier’s behavior data is high dimensional, which contains thousands of attributes to describe the supplier’s behaviors, including product items, ordered amounts, order frequencies, product quality and so forth. One more example is DNA microarray data. Clustering high-dimensional data requires special treatment (Swanson, 1990; Jain, Murty, & Flynn, 1999; Cai, He, & Han, 2005; Kontaki, Papadopoulos & Manolopoulos., 2007), although various methods for clustering are available (Jain & Dubes, 1988). One type of clustering methods for high dimensional data is referred to as subspace clustering, aiming at finding clusters from subspaces instead of the entire data space. In a subspace clustering, each cluster is a set of objects identified by a subset of dimensions and different clusters are represented in different subsets of dimensions. Soft subspace clustering considers that different dimensions make different contributions to the identification of objects in a cluster. It represents the importance of a dimension as a weight that can be treated as the degree of the dimension in contribution to the cluster. Soft subspace clustering can find the cluster memberships of objects and identify the subspace of each cluster in the same clustering process.

Download Full-text

A subspace clustering method for satisfying stoimetric constraints in scRNA -seq

10.1109/bibe52308.2021.9635324 ◽

2021 ◽

Author(s):

Angela Huang ◽

Junhyong Kim

Keyword(s):

Subspace Clustering ◽

Clustering Method

Download Full-text

A comparative analysis of euphemistic sentences in news using feature weight scheme and intelligent techniques

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211295 ◽

2021 ◽

pp. 1-12

Author(s):

K. Seethappan ◽

K. Premalatha

Keyword(s):

Neural Network ◽

Deep Learning ◽

Convolutional Neural Network ◽

Short Term Memory ◽

Automatic Classification ◽

Weighting Scheme ◽

Feature Weighting ◽

Classification Algorithms ◽

Weighting Schemes

Although there have been various researches in the detection of different figurative language, there is no single work in the automatic classification of euphemisms. Our primary work is to present a system for the automatic classification of euphemistic phrases in a document. In this research, a large dataset consisting of 100,000 sentences is collected from different resources for identifying euphemism or non-euphemism utterances. In this work, several approaches are focused to improve the euphemism classification: 1. A Combination of lexical n-gram features 2.Three Feature-weighting schemes 3.Deep learning classification algorithms. In this paper, four machine learning (J48, Random Forest, Multinomial Naïve Bayes, and SVM) and three deep learning algorithms (Multilayer Perceptron, Convolutional Neural Network, and Long Short-Term Memory) are investigated with various combinations of features and feature weighting schemes to classify the sentences. According to our experiments, Convolutional Neural Network (CNN) achieves precision 95.43%, recall 95.06%, F-Score 95.25%, accuracy 95.26%, and Kappa 0.905 by using a combination of unigram and bigram features with TF-IDF feature weighting scheme in the classification of euphemism. These results of experiments show CNN with a strong combination of unigram and bigram features set with TF-IDF feature weighting scheme outperforms another six classification algorithms in detecting the euphemisms in our dataset.

Download Full-text

A Real-Time Categorization and Clustering Method for Text Data of Laws and Regulations

2010 International Conference on Computational Intelligence and Software Engineering ◽

10.1109/wicom.2010.5600178 ◽

2010 ◽

Author(s):

Bianping Su ◽

Rong Wang ◽

Yiping Wang

Keyword(s):

Real Time ◽

Clustering Method ◽

Text Data ◽

Laws And Regulations

Download Full-text

Enhanced Similarity Measure for Sparse Subspace Clustering Method

Advances in Computational Intelligence - Lecture Notes in Computer Science ◽

10.1007/978-3-319-59153-7_26 ◽

2017 ◽

pp. 291-301

Author(s):

Sabra Hechmi ◽

Abir Gallas ◽

Ezzeddine Zagrouba

Keyword(s):

Similarity Measure ◽

Subspace Clustering ◽

Clustering Method ◽

Sparse Subspace Clustering

Download Full-text

A LDA Feature Grouping Method for Subspace Clustering of Text Data

Intelligence and Security Informatics - Lecture Notes in Computer Science ◽

10.1007/978-3-319-06677-6_7 ◽

2014 ◽

pp. 78-90 ◽

Cited By ~ 2

Author(s):

Yeshou Cai ◽

Xiaojun Chen ◽

Patrick Xiaogang Peng ◽

Joshua Zhexue Huang

Keyword(s):

Subspace Clustering ◽

Text Data ◽

Feature Grouping ◽

Grouping Method

Download Full-text