APPLYING ROBUST DIRECTIONAL SIMILARITY BASED CLUSTERING APPROACH RDSC TO CLASSIFICATION OF GENE EXPRESSION DATA

Despite the fact that the classification of gene expression data from a cDNA microarrays has been extensively studied, nowadays a robust clustering method, which can estimate an appropriate number of clusters and be insensitive to its initialization has not yet been developed. In this work, a novel Robust Clustering approach, RDSC, based on the new Directional Similarity measure is presented. This new approach RDSC, which integrates the Directional Similarity based Clustering Algorithm, DSC, with the Agglomerative Hierarchical Clustering Algorithm, AHC, exhibits its robustness to initialization and its capability to determine the appropriate number of clusters reasonably. RDSC has been successfully employed to both artificial and benchmarking gene expression datasets. Our experimental results demonstrate its distinctive superiority over the conventional method Kmeans and the two typical directional clustering algorithms SPKmeans and moVMF.

Download Full-text

ENTROPY-BASED CLUSTER VALIDATION AND ESTIMATION OF THE NUMBER OF CLUSTERS IN GENE EXPRESSION DATA

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720012500114 ◽

2012 ◽

Vol 10 (05) ◽

pp. 1250011

Author(s):

NATALIA NOVOSELOVA ◽

IGOR TOM

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Clustering Algorithm ◽

Selection Procedure ◽

Biological Knowledge ◽

Consensus Clustering ◽

Expression Data ◽

Cluster Validation ◽

Number Of Clusters ◽

Validity Measure

Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.

Download Full-text

Clustering Genes Using Heterogeneous Data Sources

Computational Knowledge Discovery for Bioinformatics Research ◽

10.4018/978-1-4666-1785-8.ch005 ◽

2013 ◽

pp. 67-83

Author(s):

Erliang Zeng ◽

Chengyong Yang ◽

Tao Li ◽

Giri Narasimhan

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Incomplete Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Exploratory Analysis ◽

Data Sources ◽

Constrained Clustering ◽

Expression Data ◽

Multiple Sources

Clustering of gene expression data is a standard exploratory technique used to identify closely related genes. Many other sources of data are also likely to be of great assistance in the analysis of gene expression data. This data provides a mean to begin elucidating the large-scale modular organization of the cell. The authors consider the challenging task of developing exploratory analytical techniques to deal with multiple complete and incomplete information sources. The Multi-Source Clustering (MSC) algorithm developed performs clustering with multiple, but complete, sources of data. To deal with incomplete data sources, the authors adopted the MPCK-means clustering algorithms to perform exploratory analysis on one complete source and other potentially incomplete sources provided in the form of constraints. This paper presents a new clustering algorithm MSC to perform exploratory analysis using two or more diverse but complete data sources, studies the effectiveness of constraints sets and robustness of the constrained clustering algorithm using multiple sources of incomplete biological data, and incorporates such incomplete data into constrained clustering algorithm in form of constraints sets.

Download Full-text

A class imbalance-aware Relief algorithm for the classification of tumors using microarray gene expression data

Computational Biology and Chemistry ◽

10.1016/j.compbiolchem.2019.03.017 ◽

2019 ◽

Vol 80 ◽

pp. 121-127 ◽

Cited By ~ 3

Author(s):

Yuanyu He ◽

Junhai Zhou ◽

Yaping Lin ◽

Tuanfei Zhu

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Class Imbalance ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Relief Algorithm ◽

Classification Of Tumors ◽

Microarray Gene

Download Full-text

Improving the Performance of Principal Components for Classification of Gene Expression Data Through Feature Selection

Studies in Classification, Data Analysis, and Knowledge Organization - Data Science and Classification ◽

10.1007/3-540-34416-0_35 ◽

2006 ◽

pp. 325-332

Author(s):

Edgar Acuña ◽

Jaime Porras

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Principal Components ◽

Expression Data

Download Full-text

Classification of micro-array gene expression data using neural networks

The 2010 International Joint Conference on Neural Networks (IJCNN) ◽

10.1109/ijcnn.2010.5596568 ◽

2010 ◽

Author(s):

David Tian ◽

Keith Burley

Keyword(s):

Gene Expression ◽

Neural Networks ◽

Gene Expression Data ◽

Expression Data ◽

Micro Array

Download Full-text

Cox Survival Analysis of Microarray Gene Expression Data Using Correlation Principal Component Regression

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1153 ◽

2007 ◽

Vol 6 (1) ◽

Cited By ~ 4

Author(s):

Qiang Zhao ◽

Jianguo Sun

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Principal Component Regression ◽

Predictive Ability ◽

Principal Component ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

New Approach ◽

Microarray Gene

Statistical analysis of microarray gene expression data has recently attracted a great deal of attention. One problem of interest is to relate genes to survival outcomes of patients with the purpose of building regression models for the prediction of future patients' survival based on their gene expression data. For this, several authors have discussed the use of the proportional hazards or Cox model after reducing the dimension of the gene expression data. This paper presents a new approach to conduct the Cox survival analysis of microarray gene expression data with the focus on models' predictive ability. The method modifies the correlation principal component regression (Sun, 1995) to handle the censoring problem of survival data. The results based on simulated data and a set of publicly available data on diffuse large B-cell lymphoma show that the proposed method works well in terms of models' robustness and predictive ability in comparison with some existing partial least squares approaches. Also, the new approach is simpler and easy to implement.

Download Full-text

Impact of Partition Based Clustering Algorithms to Cluster Samples in Microarray Gene Expression Data

Learning and Analytics in Intelligent Systems - Intelligent Techniques and Applications in Science and Technology ◽

10.1007/978-3-030-42363-6_77 ◽

2020 ◽

pp. 659-668

Author(s):

Chandra Das ◽

Shilpi Bose ◽

Debanjana Karmakar ◽

Agniswar Roy ◽

Natasha Ghosh ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Clustering Algorithms ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

Classification of Microarray Gene Expression Data by MultiBlock Dimension Reduction

Communications for Statistical Applications and Methods ◽

10.5351/ckss.2006.13.3.567 ◽

2006 ◽

Vol 13 (3) ◽

pp. 567-576

Author(s):

Mi-Ra Oh ◽

Seo-Young Kim ◽

Kyung-Sook Kim ◽

Jang-Sun Baek ◽

Young-Sook Son

Keyword(s):

Gene Expression ◽

Dimension Reduction ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

TCLUST: Trimming Approach of Robust Clustering Method

Malaysian Journal of Fundamental and Applied Sciences ◽

10.11113/mjfas.v8n4.154 ◽

2014 ◽

Vol 8 (4) ◽

Author(s):

Muhamad Alias Md. Jedi ◽

Robiah Adnan

Keyword(s):

Clustering Algorithm ◽

Likelihood Function ◽

R Package ◽

Clustering Method ◽

Number Of Clusters ◽

Robust Clustering ◽

Scatter Matrix ◽

Group Assignment ◽

Log Likelihood ◽

Clustering Approach

TCLUST is a method in statistical clustering technique which is based on modification of trimmed k-means clustering algorithm. It is called “crisp” clustering approach because the observation is can be eliminated or assigned to a group. TCLUST strengthen the group assignment by putting constraint to the cluster scatter matrix. The emphasis in this paper is to restrict on the eigenvalues, λ of the scatter matrix. The idea of imposing constraints is to maximize the log-likelihood function of spurious-outlier model. A review of different robust clustering approach is presented as a comparison to TCLUST methods. This paper will discuss the nature of TCLUST algorithm and how to determine the number of cluster or group properly and measure the strength of group assignment. At the end of this paper, R-package on TCLUST implement the types of scatter restriction, making the algorithm to be more flexible for choosing the number of clusters and the trimming proportion.

Download Full-text