Clustering Genes Using Heterogeneous Data Sources

Clustering of gene expression data is a standard exploratory technique used to identify closely related genes. Many other sources of data are also likely to be of great assistance in the analysis of gene expression data. This data provides a mean to begin elucidating the large-scale modular organization of the cell. The authors consider the challenging task of developing exploratory analytical techniques to deal with multiple complete and incomplete information sources. The Multi-Source Clustering (MSC) algorithm developed performs clustering with multiple, but complete, sources of data. To deal with incomplete data sources, the authors adopted the MPCK-means clustering algorithms to perform exploratory analysis on one complete source and other potentially incomplete sources provided in the form of constraints. This paper presents a new clustering algorithm MSC to perform exploratory analysis using two or more diverse but complete data sources, studies the effectiveness of constraints sets and robustness of the constrained clustering algorithm using multiple sources of incomplete biological data, and incorporates such incomplete data into constrained clustering algorithm in form of constraints sets.

Download Full-text

APPLYING ROBUST DIRECTIONAL SIMILARITY BASED CLUSTERING APPROACH RDSC TO CLASSIFICATION OF GENE EXPRESSION DATA

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720006002144 ◽

2006 ◽

Vol 04 (03) ◽

pp. 745-768

Author(s):

H. X LI ◽

SHITONG WANG ◽

YU XIU

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Expression Data ◽

Number Of Clusters ◽

Robust Clustering ◽

New Approach ◽

Clustering Approach

Despite the fact that the classification of gene expression data from a cDNA microarrays has been extensively studied, nowadays a robust clustering method, which can estimate an appropriate number of clusters and be insensitive to its initialization has not yet been developed. In this work, a novel Robust Clustering approach, RDSC, based on the new Directional Similarity measure is presented. This new approach RDSC, which integrates the Directional Similarity based Clustering Algorithm, DSC, with the Agglomerative Hierarchical Clustering Algorithm, AHC, exhibits its robustness to initialization and its capability to determine the appropriate number of clusters reasonably. RDSC has been successfully employed to both artificial and benchmarking gene expression datasets. Our experimental results demonstrate its distinctive superiority over the conventional method Kmeans and the two typical directional clustering algorithms SPKmeans and moVMF.

Download Full-text

Visualising Inconsistency and Incompleteness in RDF Gene Expression Data using FCA

International Journal of Conceptual Structures and Smart Applications ◽

10.4018/ijcssa.2014010105 ◽

2014 ◽

Vol 2 (1) ◽

pp. 68-82 ◽

Cited By ~ 1

Author(s):

Honour Chika Nwagwu

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Incomplete Data ◽

Formal Concept Analysis ◽

Concept Analysis ◽

Data Sources ◽

Formal Concept ◽

Expression Data ◽

Data Formats

The integration of data from different data sources can result to the existence of inconsistent or incomplete data (IID). IID can undermine the validity of information retrieved from an integrated dataset. There is therefore a need to identify these anomalies. This work presents SPARQL queries that retrieve from an EMAGE dataset, information which are inconsistent or incomplete. Also, it will be shown how Formal Concept Analysis (FCA) tools notably FcaBedrock and Concept Explorer can be applied to identify and visualise IID existing in these retrieved information. Although, instances of IID can exist in most data formats, the investigation is focused on RDF dataset.

Download Full-text

Microarray-MD: A system for exploratory analysis of microarray gene expression data

Computer Methods and Programs in Biomedicine ◽

10.1016/j.cmpb.2006.06.008 ◽

2006 ◽

Vol 83 (2) ◽

pp. 157-167 ◽

Cited By ~ 1

Author(s):

D.E. Maroulis ◽

I.N. Flaounas ◽

D.K. Iakovidis ◽

S.A. Karkanis

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

Exploratory Analysis ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

ENTROPY-BASED CLUSTER VALIDATION AND ESTIMATION OF THE NUMBER OF CLUSTERS IN GENE EXPRESSION DATA

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720012500114 ◽

2012 ◽

Vol 10 (05) ◽

pp. 1250011

Author(s):

NATALIA NOVOSELOVA ◽

IGOR TOM

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Clustering Algorithm ◽

Selection Procedure ◽

Biological Knowledge ◽

Consensus Clustering ◽

Expression Data ◽

Cluster Validation ◽

Number Of Clusters ◽

Validity Measure

Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.

Download Full-text

Impact of Partition Based Clustering Algorithms to Cluster Samples in Microarray Gene Expression Data

Learning and Analytics in Intelligent Systems - Intelligent Techniques and Applications in Science and Technology ◽

10.1007/978-3-030-42363-6_77 ◽

2020 ◽

pp. 659-668

Author(s):

Chandra Das ◽

Shilpi Bose ◽

Debanjana Karmakar ◽

Agniswar Roy ◽

Natasha Ghosh ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Clustering Algorithms ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

Effectiveness of Different Partition Based Clustering Algorithms for Estimation of Missing Values in Microarray Gene Expression Data

Advances in Computing and Information Technology - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-642-31552-7_5 ◽

2013 ◽

pp. 37-47 ◽

Cited By ~ 2

Author(s):

Shilpi Bose ◽

Chandra Das ◽

Abirlal Chakraborty ◽

Samiran Chattopadhyay

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Missing Values ◽

Clustering Algorithms ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

Clustering Algorithms in Gene Expression: Data Analysis

10.1109/icrito51393.2021.9596549 ◽

2021 ◽

Author(s):

Karuna Ghai ◽

Jaspreet Singh

Keyword(s):

Gene Expression ◽

Data Analysis ◽

Gene Expression Data ◽

Clustering Algorithms ◽

Expression Data ◽

Gene Expression Data Analysis

Download Full-text

Inference of Gene Regulatory Networks by Topological Prior Information and Data Integration

Biotechnology ◽

10.4018/978-1-5225-8903-7.ch010 ◽

2019 ◽

pp. 265-304

Author(s):

David Correa Martins Jr. ◽

Fabricio Martins Lopes ◽

Shubhra Sankar Ray

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Regulatory Networks ◽

Regulatory Networks ◽

Prior Information ◽

Heterogeneous Data ◽

Data Sources ◽

Expression Data ◽

Heterogeneous Data Sources ◽

Gene Regulatory

The inference of Gene Regulatory Networks (GRNs) is a very challenging problem which has attracted increasing attention since the development of high-throughput sequencing and gene expression measurement technologies. Many models and algorithms have been developed to identify GRNs using mainly gene expression profile as data source. As the gene expression data usually has limited number of samples and inherent noise, the integration of gene expression with several other sources of information can be vital for accurately inferring GRNs. For instance, some prior information about the overall topological structure of the GRN can guide inference techniques toward better results. In addition to gene expression data, recently biological information from heterogeneous data sources have been integrated by GRN inference methods as well. The objective of this chapter is to present an overview of GRN inference models and techniques with focus on incorporation of prior information such as, global and local topological features and integration of several heterogeneous data sources.

Download Full-text

Multi-cancer samples clustering via graph regularized low-rank representation method under sparse and symmetric constraints

BMC Bioinformatics ◽

10.1186/s12859-019-3231-5 ◽

2019 ◽

Vol 20 (S22) ◽

Author(s):

Juan Wang ◽

Cong-Hai Lu ◽

Jin-Xing Liu ◽

Ling-Yun Dai ◽

Xiang-Zhen Kong

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Clustering Algorithm ◽

Low Rank ◽

Expression Data ◽

Geometrical Structures ◽

Graph Regularization ◽

Raw Data ◽

Clustering Quality ◽

Low Rank Representation

Abstract Background Identifying different types of cancer based on gene expression data has become hotspot in bioinformatics research. Clustering cancer gene expression data from multiple cancers to their own class is a significance solution. However, the characteristics of high-dimensional and small samples of gene expression data and the noise of the data make data mining and research difficult. Although there are many effective and feasible methods to deal with this problem, the possibility remains that these methods are flawed. Results In this paper, we propose the graph regularized low-rank representation under symmetric and sparse constraints (sgLRR) method in which we introduce graph regularization based on manifold learning and symmetric sparse constraints into the traditional low-rank representation (LRR). For the sgLRR method, by means of symmetric constraint and sparse constraint, the effect of raw data noise on low-rank representation is alleviated. Further, sgLRR method preserves the important intrinsic local geometrical structures of the raw data by introducing graph regularization. We apply this method to cluster multi-cancer samples based on gene expression data, which improves the clustering quality. First, the gene expression data are decomposed by sgLRR method. And, a lowest rank representation matrix is obtained, which is symmetric and sparse. Then, an affinity matrix is constructed to perform the multi-cancer sample clustering by using a spectral clustering algorithm, i.e., normalized cuts (Ncuts). Finally, the multi-cancer samples clustering is completed. Conclusions A series of comparative experiments demonstrate that the sgLRR method based on low rank representation has a great advantage and remarkable performance in the clustering of multi-cancer samples.

Download Full-text