A LDA Feature Grouping Method for Subspace Clustering of Text Data

High dimensional data is a phenomenon in real-world data mining applications. Text data is a typical example. In text mining, a text document is viewed as a vector of terms whose dimension is equal to the total number of unique terms in a data set, which is usually in thousands. High dimensional data occurs in business as well. In retails, for example, to effectively manage supplier relationship, suppliers are often categorized according to their business behaviors (Zhang, Huang, Qian, Xu, & Jing, 2006). The supplier’s behavior data is high dimensional, which contains thousands of attributes to describe the supplier’s behaviors, including product items, ordered amounts, order frequencies, product quality and so forth. One more example is DNA microarray data. Clustering high-dimensional data requires special treatment (Swanson, 1990; Jain, Murty, & Flynn, 1999; Cai, He, & Han, 2005; Kontaki, Papadopoulos & Manolopoulos., 2007), although various methods for clustering are available (Jain & Dubes, 1988). One type of clustering methods for high dimensional data is referred to as subspace clustering, aiming at finding clusters from subspaces instead of the entire data space. In a subspace clustering, each cluster is a set of objects identified by a subset of dimensions and different clusters are represented in different subsets of dimensions. Soft subspace clustering considers that different dimensions make different contributions to the identification of objects in a cluster. It represents the importance of a dimension as a weight that can be treated as the degree of the dimension in contribution to the cluster. Soft subspace clustering can find the cluster memberships of objects and identify the subspace of each cluster in the same clustering process.

Download Full-text

Research on Community Detection of Online Social Network Members Based on the Sparse Subspace Clustering Approach

Future Internet ◽

10.3390/fi11120254 ◽

2019 ◽

Vol 11 (12) ◽

pp. 254

Author(s):

Zihe Zhou ◽

Bo Tian

Keyword(s):

Social Network ◽

Community Detection ◽

Clustering Algorithm ◽

Detection Method ◽

Online Social Network ◽

Subspace Clustering ◽

High Dimensional ◽

Data Sets ◽

Text Data ◽

New Community

The text data of the social network platforms take the form of short texts, and the massive text data have high-dimensional and sparse characteristics, which does not make the traditional clustering algorithm perform well. In this paper, a new community detection method based on the sparse subspace clustering (SSC) algorithm is proposed to deal with the problem of sparsity and the high-dimensional characteristic of short texts in online social networks. The main ideal is as follows. First, the structured data including users’ attributions and user behavior and unstructured data such as user reviews are used to construct the vector space for the network. And the similarity of the feature words is calculated by the location relation of the feature words in the synonym word forest. Then, the dimensions of data are deduced based on the principal component analysis in order to improve the clustering accuracy. Further, a new community detection method of social network members based on the SSC is proposed. Finally, experiments on several data sets are performed and compared with the K-means clustering algorithm. Experimental results show that proper dimension reduction for high dimensional data can improve the clustering accuracy and efficiency of the SSC approach. The proposed method can achieve suitable community partition effect on online social network data sets.

Download Full-text

A feature grouping method for ensemble clustering of high-dimensional genomic big data

2016 Future Technologies Conference (FTC) ◽

10.1109/ftc.2016.7821620 ◽

2016 ◽

Cited By ~ 5

Author(s):

Dewan Md. Farid ◽

Ann Nowe ◽

Bernard Manderick

Keyword(s):

Big Data ◽

High Dimensional ◽

Ensemble Clustering ◽

Feature Grouping ◽

Grouping Method

Download Full-text

Ensemble subspace clustering of text data using two-level features

International Journal of Machine Learning and Cybernetics ◽

10.1007/s13042-016-0556-5 ◽

2016 ◽

Vol 8 (6) ◽

pp. 1751-1766 ◽

Cited By ~ 1

Author(s):

He Zhao ◽

Salman Salloum ◽

Yeshou Cai ◽

Joshua Zhexue Huang

Keyword(s):

Subspace Clustering ◽

Text Data

Download Full-text

A Soft Subspace Clustering Method for Text Data Using a Probability Based Feature Weighting Scheme

Lecture Notes in Computer Science - Web Information Systems Engineering – WISE 2015 ◽

10.1007/978-3-319-26187-4_9 ◽

2015 ◽

pp. 124-138

Author(s):

Abdul Wahid ◽

Xiaoying Gao ◽

Peter Andreae

Keyword(s):

Subspace Clustering ◽

Weighting Scheme ◽

Feature Weighting ◽

Clustering Method ◽

Text Data

Download Full-text

Using Correlation Based Subspace Clustering for Multi-label Text Data Classification

2010 22nd IEEE International Conference on Tools with Artificial Intelligence ◽

10.1109/ictai.2010.115 ◽

2010 ◽

Cited By ~ 3

Author(s):

Mohammad Salim Ahmed ◽

Latifur Khan ◽

Mandava Rajeswari

Keyword(s):

Subspace Clustering ◽

Data Classification ◽

Text Data

Download Full-text

Application of High Dimensional Feature Grouping Method in Near-Infrared Spectra of Identification of Tobacco Growing Areas

2016 3rd International Conference on Information Science and Control Engineering (ICISCE) ◽

10.1109/icisce.2016.58 ◽

2016 ◽

Author(s):

Cheng Zhu ◽

Huili Gong ◽

Zhongren Li ◽

Chunxia Yu

Keyword(s):

Infrared Spectra ◽

Near Infrared ◽

High Dimensional ◽

Near Infrared Spectra ◽

Feature Grouping ◽

Grouping Method

Download Full-text

Subspace clustering with automatic feature grouping

Pattern Recognition ◽

10.1016/j.patcog.2015.05.016 ◽

2015 ◽

Vol 48 (11) ◽

pp. 3703-3713 ◽

Cited By ~ 28

Author(s):

Guojun Gan ◽

Michael Kwok-Po Ng

Keyword(s):

Subspace Clustering ◽

Feature Grouping

Download Full-text

A NOVEL FEATURE GROUPING METHOD FOR ENSEMBLE NEURAL NETWORK USING LOCALIZED GENERALIZATION ERROR MODEL

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001408006041 ◽

2008 ◽

Vol 22 (01) ◽

pp. 137-151 ◽

Cited By ~ 2

Author(s):

AKI P. F. CHAN ◽

PATRICK P. K. CHAN ◽

WING W. Y. NG ◽

ERIC C. C. TSANG ◽

DANIEL S. YEUNG

Keyword(s):

Error Model ◽

Generalization Error ◽

Multiple Classifier System ◽

Feature Grouping ◽

Multiple Classifier ◽

Key Issues ◽

Grouping Method ◽

New Feature ◽

Better Than ◽

Localized Generalization Error

Multiple Classifier System (MCS) is a very popular research topic in recent years. It has been proved theoretically and empirically to be better than single classifiers in many scenarios. Creating diverse sets of classifier is one of the key issues in building MCSs. Feature grouping is one of the methods to create diverse classifiers and it has been shown to improve the accuracy of an MCS. In this paper, we propose a new feature grouping method based on Genetic Algorithm (GA) with the localized Generalization Error Model as the evaluation criterion. The combined individual classifiers using the weighted sum are examined in this paper. Moreover, several feature grouping methods are compared with the proposed method in this work. The experimental results on benchmark dataset show that the MCS trained by the proposed method is promising.

Download Full-text

A Numerically Coded File of Operative Procedures Derived from a Free Text Data Collection System : A Measure of the Accuracy

Methods of Information in Medicine ◽

10.1055/s-0038-1635717 ◽

1976 ◽

Vol 15 (01) ◽

pp. 21-28 ◽

Cited By ~ 3

Author(s):

Carmen A. Scudiero ◽

Ruth L. Wong

Keyword(s):

Data Collection ◽

Pap Smear ◽

Operative Procedures ◽

Free Text ◽

Collection System ◽

Process Data ◽

Text Data ◽

Data Collection System ◽

History Of ◽

Correlation System

A free text data collection system has been developed at the University of Illinois utilizing single word, syntax free dictionary lookup to process data for retrieval. The source document for the system is the Surgical Pathology Request and Report form. To date 12,653 documents have been entered into the system.The free text data was used to create an IRS (Information Retrieval System) database. A program to interrogate this database has been developed to numerically coded operative procedures. A total of 16,519 procedures records were generated. One and nine tenths percent of the procedures could not be fitted into any procedures category; 6.1% could not be specifically coded, while 92% were coded into specific categories. A system of PL/1 programs has been developed to facilitate manual editing of these records, which can be performed in a reasonable length of time (1 week). This manual check reveals that these 92% were coded with precision = 0.931 and recall = 0.924. Correction of the readily correctable errors could improve these figures to precision = 0.977 and recall = 0.987. Syntax errors were relatively unimportant in the overall coding process, but did introduce significant error in some categories, such as when right-left-bilateral distinction was attempted.The coded file that has been constructed will be used as an input file to a gynecological disease/PAP smear correlation system. The outputs of this system will include retrospective information on the natural history of selected diseases and a patient log providing information to the clinician on patient follow-up.Thus a free text data collection system can be utilized to produce numerically coded files of reasonable accuracy. Further, these files can be used as a source of useful information both for the clinician and for the medical researcher.

Download Full-text