scholarly journals CLUSTER ANALYSIS OF MEDICAL TEXT DOCUMENTS BY USING SEMI-CLUSTERING APPROACH BASED ON GRAPH REPRESENTATION

2018 ◽  
Vol 7 (3) ◽  
pp. 213-224
Author(s):  
Rafał Woźniak ◽  
Piotr Ożdżyński ◽  
Danuta Zakrzewska

The development of Internet resulted in an increasing number of online text re-positories. In many cases, documents are assigned to more than one class and automatic multi-label classification needs to be used. When the number of labels exceeds the number of the documents, effective label space dimension reduction may signifi-cantly improve classification accuracy, what is a major priority in the medical field. In the paper, we propose document clustering for label selection. We use semi-clustering method, by considering graph representation, where documents are represented by vertices and edge weights are calculated according to their mutual similarity. Assigning documents to semi-clusters helps in reducing number of labels, further used in multilabel classification process. The performance of the method is examined by experiments conducted on real medical datasets.

2017 ◽  
Vol 2017 ◽  
pp. 1-13 ◽  
Author(s):  
Junkai Yi ◽  
Yacong Zhang ◽  
Xianghui Zhao ◽  
Jing Wan

Text clustering is an effective approach to collect and organize text documents into meaningful groups for mining valuable information on the Internet. However, there exist some issues to tackle such as feature extraction and data dimension reduction. To overcome these problems, we present a novel approach named deep-learning vocabulary network. The vocabulary network is constructed based on related-word set, which contains the “cooccurrence” relations of words or terms. We replace term frequency in feature vectors with the “importance” of words in terms of vocabulary network and PageRank, which can generate more precise feature vectors to represent the meaning of text clustering. Furthermore, sparse-group deep belief network is proposed to reduce the dimensionality of feature vectors, and we introduce coverage rate for similarity measure in Single-Pass clustering. To verify the effectiveness of our work, we compare the approach to the representative algorithms, and experimental results show that feature vectors in terms of deep-learning vocabulary network have better clustering performance.


2017 ◽  
Vol 79 (5) ◽  
Author(s):  
Athraa Jasim Mohammed ◽  
Yuhanis Yusof ◽  
Husniza Husni

Text clustering is one of the text mining tasks that is employed in search engines. Discovering the optimal number of clusters for a dataset or repository is a challenging problem. Various clustering algorithms have been reported in the literature but most of them rely on a pre-defined value of the k clusters. In this study, a variant of Firefly algorithm, termed as FireflyClust, is proposed to automatically cluster text documents in a hierarchical manner. The proposed clustering method operates based on five phases: data pre-processing, clustering, item re-location, cluster selection and cluster refinement. Experiments are undertaken based on different selections of threshold value. Results on the TREC collection named TR11, TR12, TR23 and TR45, showed that the FireflyClust is a better approach than the Bisect K-means, hybrid Bisect K-means and Practical General Stochastic Clustering Method. Such a result would enlighten the directions in developing a better information retrieval engine for this dynamic and fast growing big data era.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Hamideh Soltani ◽  
Zahra Einalou ◽  
Mehrdad Dadgostar ◽  
Keivan Maghooli

AbstractBrain computer interface (BCI) systems have been regarded as a new way of communication for humans. In this research, common methods such as wavelet transform are applied in order to extract features. However, genetic algorithm (GA), as an evolutionary method, is used to select features. Finally, classification was done using the two approaches support vector machine (SVM) and Bayesian method. Five features were selected and the accuracy of Bayesian classification was measured to be 80% with dimension reduction. Ultimately, the classification accuracy reached 90.4% using SVM classifier. The results of the study indicate a better feature selection and the effective dimension reduction of these features, as well as a higher percentage of classification accuracy in comparison with other studies.


2014 ◽  
Vol 14 (3) ◽  
pp. 25-36
Author(s):  
Bohdan Pavlyshenko

Abstract This paper describes the analysis of possible differentiation of the author’s idiolect in the space of semantic fields; it also analyzes the clustering of text documents in the vector space of semantic fields and in the semantic space with orthogonal basis. The analysis showed that using the vector space model on the basis of semantic fields is efficient in cluster analysis algorithms of author’s texts in English fiction. The study of the distribution of authors' texts in the cluster structure showed the presence of the areas of semantic space that represent the idiolects of individual authors. Such areas are described by the clusters where only one author dominates. The clusters, where the texts of several authors dominate, can be considered as areas of semantic similarity of author’s styles. SVD factorization of the semantic fields matrix makes it possible to reduce significantly the dimension of the semantic space in the cluster analysis of author’s texts. Using the clustering of the semantic field vector space can be efficient in a comparative analysis of author's styles and idiolects. The clusters of some authors' idiolects are semantically invariant and do not depend on any changes in the basis of the semantic space and clustering method.


Author(s):  
Muhamad Alias Md. Jedi ◽  
Robiah Adnan

TCLUST is a method in statistical clustering technique which is based on modification of trimmed k-means clustering algorithm. It is called “crisp” clustering approach because the observation is can be eliminated or assigned to a group. TCLUST strengthen the group assignment by putting constraint to the cluster scatter matrix. The emphasis in this paper is to restrict on the eigenvalues, λ of the scatter matrix. The idea of imposing constraints is to maximize the log-likelihood function of spurious-outlier model. A review of different robust clustering approach is presented as a comparison to TCLUST methods. This paper will discuss the nature of TCLUST algorithm and how to determine the number of cluster or group properly and measure the strength of group assignment. At the end of this paper, R-package on TCLUST implement the types of scatter restriction, making the algorithm to be more flexible for choosing the number of clusters and the trimming proportion.


Author(s):  
A I Maksimov ◽  
M V Gashnikov

We propose a new adaptive multidimensional signal interpolator for differential compression tasks. To increase the efficiency of interpolation, we optimize its parameters space by the minimum absolute interpolation error criterion. To reduce the complexity of interpolation optimization, we reduce the dimension of its parameter range. The correspondence between signal samples in a local neighbourhood is parameterized. Besides, we compare several methods for such parameterization. The developed adaptive interpolator is embedded in the differential compression method. Computational experiments on real multidimensional signals confirm that the use of the proposed interpolator can increase the compression ratio.


Author(s):  
Rong-Hua Li ◽  
Shuang Liang ◽  
George Baciu ◽  
Eddie Chan

Singularity problems of scatter matrices in Linear Discriminant Analysis (LDA) are challenging and have obtained attention during the last decade. Linear Discriminant Analysis via QR decomposition (LDA/QR) and Direct Linear Discriminant analysis (DLDA) are two popular algorithms to solve the singularity problem. This paper establishes the equivalent relationship between LDA/QR and DLDA. They can be regarded as special cases of pseudo-inverse LDA. Similar to LDA/QR algorithm, DLDA can also be considered as a two-stage LDA method. Interestingly, the first stage of DLDA can act as a dimension reduction algorithm. The experiment compares LDA/QR and DLDA algorithms in terms of classification accuracy, computational complexity on several benchmark datasets and compares their first stages. The results confirm the established equivalent relationship and verify their capabilities in dimension reduction.


2020 ◽  
Vol 12 (2) ◽  
pp. 321
Author(s):  
Jiao Guo ◽  
Henghui Li ◽  
Jifeng Ning ◽  
Wenting Han ◽  
Weitao Zhang ◽  
...  

Crop classification in agriculture is one of important applications for polarimetric synthetic aperture radar (PolSAR) data. For agricultural crop discrimination, compared with single-temporal data, multi-temporal data can dramatically increase crop classification accuracies since the same crop shows different external phenomena as it grows up. In practice, the utilization of multi-temporal data encounters a serious problem known as a “dimension disaster”. Aiming to solve this problem and raise the classification accuracy, this study developed a feature dimension reduction method using stacked sparse auto-encoders (S-SAEs) for crop classification. First, various incoherent scattering decomposition algorithms were employed to extract a variety of detailed and quantitative parameters from multi-temporal PolSAR data. Second, based on analyzing the configuration and main parameters for constructing an S-SAE, a three-hidden-layer S-SAE network was built to reduce the dimensionality and extract effective features to manage the “dimension disaster” caused by excessive scattering parameters, especially for multi-temporal, quad-pol SAR images. Third, a convolutional neural network (CNN) was constructed and employed to further enhance the crop classification performance. Finally, the performances of the proposed strategy were assessed with the simulated multi-temporal Sentinel-1 data for two experimental sites established by the European Space Agency (ESA). The experimental results showed that the overall accuracy with the proposed method was raised by at least 17% compared with the long short-term memory (LSTM) method in the case of a 1% training ratio. Meanwhile, for a CNN classifier, the overall accuracy was almost 4% higher than those of the principle component analysis (PCA) and locally linear embedded (LLE) methods. The comparison studies clearly demonstrated the advantage of the proposed multi-temporal crop classification methodology in terms of classification accuracy, even with small training ratios.


Sign in / Sign up

Export Citation Format

Share Document