A LDA Feature Grouping Method for Subspace Clustering of Text Data

Author(s):  
Yeshou Cai ◽  
Xiaojun Chen ◽  
Patrick Xiaogang Peng ◽  
Joshua Zhexue Huang
Author(s):  
Liping Jing ◽  
Michael K. Ng ◽  
Joshua Zhexue Huang

High dimensional data is a phenomenon in real-world data mining applications. Text data is a typical example. In text mining, a text document is viewed as a vector of terms whose dimension is equal to the total number of unique terms in a data set, which is usually in thousands. High dimensional data occurs in business as well. In retails, for example, to effectively manage supplier relationship, suppliers are often categorized according to their business behaviors (Zhang, Huang, Qian, Xu, & Jing, 2006). The supplier’s behavior data is high dimensional, which contains thousands of attributes to describe the supplier’s behaviors, including product items, ordered amounts, order frequencies, product quality and so forth. One more example is DNA microarray data. Clustering high-dimensional data requires special treatment (Swanson, 1990; Jain, Murty, & Flynn, 1999; Cai, He, & Han, 2005; Kontaki, Papadopoulos & Manolopoulos., 2007), although various methods for clustering are available (Jain & Dubes, 1988). One type of clustering methods for high dimensional data is referred to as subspace clustering, aiming at finding clusters from subspaces instead of the entire data space. In a subspace clustering, each cluster is a set of objects identified by a subset of dimensions and different clusters are represented in different subsets of dimensions. Soft subspace clustering considers that different dimensions make different contributions to the identification of objects in a cluster. It represents the importance of a dimension as a weight that can be treated as the degree of the dimension in contribution to the cluster. Soft subspace clustering can find the cluster memberships of objects and identify the subspace of each cluster in the same clustering process.


2019 ◽  
Vol 11 (12) ◽  
pp. 254
Author(s):  
Zihe Zhou ◽  
Bo Tian

The text data of the social network platforms take the form of short texts, and the massive text data have high-dimensional and sparse characteristics, which does not make the traditional clustering algorithm perform well. In this paper, a new community detection method based on the sparse subspace clustering (SSC) algorithm is proposed to deal with the problem of sparsity and the high-dimensional characteristic of short texts in online social networks. The main ideal is as follows. First, the structured data including users’ attributions and user behavior and unstructured data such as user reviews are used to construct the vector space for the network. And the similarity of the feature words is calculated by the location relation of the feature words in the synonym word forest. Then, the dimensions of data are deduced based on the principal component analysis in order to improve the clustering accuracy. Further, a new community detection method of social network members based on the SSC is proposed. Finally, experiments on several data sets are performed and compared with the K-means clustering algorithm. Experimental results show that proper dimension reduction for high dimensional data can improve the clustering accuracy and efficiency of the SSC approach. The proposed method can achieve suitable community partition effect on online social network data sets.


2016 ◽  
Vol 8 (6) ◽  
pp. 1751-1766 ◽  
Author(s):  
He Zhao ◽  
Salman Salloum ◽  
Yeshou Cai ◽  
Joshua Zhexue Huang

2015 ◽  
Vol 48 (11) ◽  
pp. 3703-3713 ◽  
Author(s):  
Guojun Gan ◽  
Michael Kwok-Po Ng

Author(s):  
AKI P. F. CHAN ◽  
PATRICK P. K. CHAN ◽  
WING W. Y. NG ◽  
ERIC C. C. TSANG ◽  
DANIEL S. YEUNG

Multiple Classifier System (MCS) is a very popular research topic in recent years. It has been proved theoretically and empirically to be better than single classifiers in many scenarios. Creating diverse sets of classifier is one of the key issues in building MCSs. Feature grouping is one of the methods to create diverse classifiers and it has been shown to improve the accuracy of an MCS. In this paper, we propose a new feature grouping method based on Genetic Algorithm (GA) with the localized Generalization Error Model as the evaluation criterion. The combined individual classifiers using the weighted sum are examined in this paper. Moreover, several feature grouping methods are compared with the proposed method in this work. The experimental results on benchmark dataset show that the MCS trained by the proposed method is promising.


1976 ◽  
Vol 15 (01) ◽  
pp. 21-28 ◽  
Author(s):  
Carmen A. Scudiero ◽  
Ruth L. Wong

A free text data collection system has been developed at the University of Illinois utilizing single word, syntax free dictionary lookup to process data for retrieval. The source document for the system is the Surgical Pathology Request and Report form. To date 12,653 documents have been entered into the system.The free text data was used to create an IRS (Information Retrieval System) database. A program to interrogate this database has been developed to numerically coded operative procedures. A total of 16,519 procedures records were generated. One and nine tenths percent of the procedures could not be fitted into any procedures category; 6.1% could not be specifically coded, while 92% were coded into specific categories. A system of PL/1 programs has been developed to facilitate manual editing of these records, which can be performed in a reasonable length of time (1 week). This manual check reveals that these 92% were coded with precision = 0.931 and recall = 0.924. Correction of the readily correctable errors could improve these figures to precision = 0.977 and recall = 0.987. Syntax errors were relatively unimportant in the overall coding process, but did introduce significant error in some categories, such as when right-left-bilateral distinction was attempted.The coded file that has been constructed will be used as an input file to a gynecological disease/PAP smear correlation system. The outputs of this system will include retrospective information on the natural history of selected diseases and a patient log providing information to the clinician on patient follow-up.Thus a free text data collection system can be utilized to produce numerically coded files of reasonable accuracy. Further, these files can be used as a source of useful information both for the clinician and for the medical researcher.


Sign in / Sign up

Export Citation Format

Share Document