How to Group Genes according to Expression Profiles?

The most commonly applied strategies for identifying genes with a common response profile are based on clustering algorithms. These methods have no explicit rules to define the appropriate number of groups of genes. Usually the number of clusters is decided on heuristic criteria or through the application of different methods proposed to assess the number of clusters in a data set. The purpose of this paper is to compare the performance of seven of these techniques, including traditional ones, and some recently proposed. All of them produce underestimations of the true number of clusters. However, within this limitation, the gDGC algorithm appears to be the best. It is the only one that explicitly states a rule for cutting a dendrogram on the basis of a testing hypothesis framework, allowing the user to calibrate the sensitivity, adjusting the significance level.

Download Full-text

Determination of the Number of Clusters in a Data Set

Management Theories and Strategic Practices for Decision Making ◽

10.4018/978-1-4666-2473-3.ch004 ◽

2012 ◽

pp. 59-73

Author(s):

Derrick S. Boone

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Monte Carlo Study ◽

Stopping Rules ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

True Number ◽

Clustering Criterion

The accuracy of “stopping rules” for determining the number of clusters in a data set is examined as a function of the underlying clustering algorithm being used. Using a Monte Carlo study, various stopping rules, used in conjunction with six clustering algorithms, are compared to determine which rule/algorithm combinations best recover the true number of clusters. The rules and algorithms are tested using disparately sized, artificially generated data sets that contained multiple numbers and levels of clusters, variables, noise, outliers, and elongated and unequally sized clusters. The results indicate that stopping rule accuracy depends on the underlying clustering algorithm being used. The cubic clustering criterion (CCC), when used in conjunction with mixture models or Ward’s method, recovers the true number of clusters more accurately than other rules and algorithms. However, the CCC was more likely than other stopping rules to report more clusters than are actually present. Implications are discussed.

Download Full-text

Determination of the Number of Clusters in a Data Set

International Journal of Strategic Decision Sciences ◽

10.4018/jsds.2011100101 ◽

2011 ◽

Vol 2 (4) ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Derrick S. Boone

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Monte Carlo Study ◽

Stopping Rules ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

True Number ◽

Clustering Criterion

Download Full-text

Models for Internal Clustering Validation Indexes Based on Hadoop-MapReduce

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2020070103 ◽

2020 ◽

Vol 11 (3) ◽

pp. 42-67

Author(s):

Soumeya Zerabi ◽

Souham Meshoul ◽

Samia Chikhi Boucherkha

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Optimal Number ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Distributed Models ◽

Hadoop Mapreduce ◽

Distributed Solutions ◽

Clustering Validation

Cluster validation aims to both evaluate the results of clustering algorithms and predict the number of clusters. It is usually achieved using several indexes. Traditional internal clustering validation indexes (CVIs) are mainly based in computing pairwise distances which results in a quadratic complexity of the related algorithms. The existing CVIs cannot handle large data sets properly and need to be revisited to take account of the ever-increasing data set volume. Therefore, design of parallel and distributed solutions to implement these indexes is required. To cope with this issue, the authors propose two parallel and distributed models for internal CVIs namely for Silhouette and Dunn indexes using MapReduce framework under Hadoop. The proposed models termed as MR_Silhouette and MR_Dunn have been tested to solve both the issue of evaluating the clustering results and identifying the optimal number of clusters. The results of experimental study are very promising and show that the proposed parallel and distributed models achieve the expected tasks successfully.

Download Full-text

Rough ISODATA Algorithm

International Journal of Fuzzy System Applications ◽

10.4018/ijfsa.2013100101 ◽

2013 ◽

Vol 3 (4) ◽

pp. 1-14 ◽

Cited By ~ 2

Author(s):

S. Sampath ◽

B. Ramya

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Vital Role ◽

Data Sets ◽

Clustering Method ◽

Data Set ◽

Number Of Clusters ◽

Real Life Data ◽

Nonparametric Statistical

Cluster analysis is a branch of data mining, which plays a vital role in bringing out hidden information in databases. Clustering algorithms help medical researchers in identifying the presence of natural subgroups in a data set. Different types of clustering algorithms are available in the literature. The most popular among them is k-means clustering. Even though k-means clustering is a popular clustering method widely used, its application requires the knowledge of the number of clusters present in the given data set. Several solutions are available in literature to overcome this limitation. The k-means clustering method creates a disjoint and exhaustive partition of the data set. However, in some situations one can come across objects that belong to more than one cluster. In this paper, a clustering algorithm capable of producing rough clusters automatically without requiring the user to give as input the number of clusters to be produced. The efficiency of the algorithm in detecting the number of clusters present in the data set has been studied with the help of some real life data sets. Further, a nonparametric statistical analysis on the results of the experimental study has been carried out in order to analyze the efficiency of the proposed algorithm in automatic detection of the number of clusters in the data set with the help of rough version of Davies-Bouldin index.

Download Full-text

A Fuzzy Graph Framework for Initializing k-Means

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213016500317 ◽

2016 ◽

Vol 25 (06) ◽

pp. 1650031 ◽

Cited By ~ 4

Author(s):

Georgios Drakopoulos ◽

Panagiotis Gourgaris ◽

Andreas Kanavos ◽

Christos Makris

Keyword(s):

Web Mining ◽

Clustering Algorithms ◽

Low Complexity ◽

Design Parameters ◽

Fuzzy Graph ◽

Community Discovery ◽

Number Of Clusters ◽

Overlapping Communities ◽

Underlying Space ◽

True Number

k-Means is among the most significant clustering algorithms for vectors chosen from an underlying space S. Its applications span a broad range of fields including machine learning, image and signal processing, and Web mining. Since the introduction of k-Means, two of its major design parameters remain open to research. The first is the number of clusters to be formed and the second is the initial vectors. The latter is also inherently related to selecting a density measure for S. This article presents a two-step framework for estimating both parameters. First, the underlying vector space is represented as a fuzzy graph. Afterwards, two algorithms for partitioning a fuzzy graph to non-overlapping communities, namely Fuzzy Walktrap and Fuzzy Newman-Girvan, are executed. The former is a low complexity evolving heuristic, whereas the latter is deterministic and combines a graph communication metric with an exhaustive search principle. Once communities are discovered, their number is taken as an estimate of the true number of clusters. The initial centroids or seeds are subsequently selected based on the density of S. The proposed framework is modular, allowing thus more initialization schemes to be derived. The secondary contributions of this article are HI, a similarity metric for vectors with numerical and categorical entries and the assessment of its stochastic behavior, and TD, a metric for assessing cluster confusion. The aforementioned framework has been implemented mainly in C# and partially in C++ and its performance in terms of efficiency, accuracy, and cluster confusion was experimentally assessed. Post-processing results conducted with MATLAB indicate that the evolving community discovery algorithm approaches the performance of its deterministic counterpart with considerably less complexity.

Download Full-text

Clustering of gene expression profiles: creating initialization-independent clusterings by eliminating unstable genes

Journal of Integrative Bioinformatics ◽

10.1515/jib-2010-134 ◽

2010 ◽

Vol 7 (3) ◽

Author(s):

Wim De Mulder ◽

Martin Kuiper ◽

René Boel

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Profiles ◽

Clustering Algorithms ◽

Gene Expression Profiles ◽

Biological Data ◽

Biological Knowledge ◽

Expression Data ◽

Data Set ◽

Cluster Membership

SummaryClustering is an important approach in the analysis of biological data, and often a first step to identify interesting patterns of coexpression in gene expression data. Because of the high complexity and diversity of gene expression data, many genes cannot be easily assigned to a cluster, but even if the dissimilarity of these genes with all other gene groups is large, they will finally be forced to become member of a cluster. In this paper we show how to detect such elements, called unstable elements. We have developed an approach for iterative clustering algorithms in which unstable elements are deleted, making the iterative algorithm less dependent on initial centers. Although the approach is unsupervised, it is less likely that the clusters into which the reduced data set is subdivided contain false positives. This clustering yields a more differentiated approach for biological data, since the cluster analysis is divided into two parts: the pruned data set is divided into highly consistent clusters in an unsupervised way and the removed, unstable elements for which no meaningful cluster exists in unsupervised terms can be given a cluster with the use of biological knowledge and information about the likelihood of cluster membership. We illustrate our framework on both an artificial and real biological data set.

Download Full-text

Missing Data Imputation in High Dimensional Data Set using Local Similarity

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c6435.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 8070-8074 ◽

Cited By ~ 1

Keyword(s):

Data Mining ◽

Data Quality ◽

Missing Values ◽

Mean Squared Error ◽

Clustering Algorithms ◽

Local Similarity ◽

Data Imputation ◽

Data Set ◽

Number Of Clusters ◽

Missing Data Imputation

Data quality is an important aspect for any data mining and statistical tasks. Presence of missing values in the dataset affects the data quality. Missing values refers to the event did not happen or the value does not exist. Data mining algorithms are not robust towards incomplete data. Imputation of missing values is necessary to improve the data quality for performing data mining and statistical analysis. The existing methods such as Expectation Maximization Imputation (EMI), A Framework for Imputing Missing values Using co appearance, correlation and Similarity analysis (FIMUS) use the whole dataset to impute missing values. In such cases, due to the influence of irrelevant record the accuracy of imputation may be affected. This can be controlled by only considering locally similar records to impute missing values. Local similarity imputation can be done through clustering algorithms such as k-means algorithm. K-means clustering efficiency depends on the number of clusters is to be defined by users. To increase the clustering efficiency, first distinctive value is imputed in place of missing ones and this imputed dataset is given to stacked autoencoder for dimensionality reduction which also improves the efficiency of clustering. Initial number of clusters to k-means algorithm is determined using fast clustering. Due to initial imputation, some irrelevant records may be partitioned to a cluster. When these records are used for imputing missing values, accuracy of imputation decreases. In the proposed algorithm, local similarity imputation algorithm uses only top knearest neighbours within the cluster to impute missing values. The performance of the proposed algorithm is evaluated based on Root-Mean-Squared-Error (RMSE) and Index of Agreement (d2). University of California Irvine datasets has been used for analyzing the performance of the proposed algorithm.

Download Full-text

On the Persistence of Clustering Solutions and True Number of Clusters in a Dataset

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015000 ◽

2019 ◽

Vol 33 ◽

pp. 5000-5007 ◽

Cited By ~ 1

Author(s):

Amber Srivastava ◽

Mayank Baranwal ◽

Srinivasa Salapaka

Keyword(s):

A Priori ◽

Clustering Algorithms ◽

Covariance Matrices ◽

Deterministic Annealing ◽

Number Of Clusters ◽

Gap Statistic ◽

True Number ◽

Distinct Cluster ◽

Annealing Algorithm ◽

Synthetic Datasets

Typically clustering algorithms provide clustering solutions with prespecified number of clusters. The lack of a priori knowledge on the true number of underlying clusters in the dataset makes it important to have a metric to compare the clustering solutions with different number of clusters. This article quantifies a notion of persistence of clustering solutions that enables comparing solutions with different number of clusters. The persistence relates to the range of dataresolution scales over which a clustering solution persists; it is quantified in terms of the maximum over two-norms of all the associated cluster-covariance matrices. Thus we associate a persistence value for each element in a set of clustering solutions with different number of clusters. We show that the datasets where natural clusters are a priori known, the clustering solutions that identify the natural clusters are most persistent - in this way, this notion can be used to identify solutions with true number of clusters. Detailed experiments on a variety of standard and synthetic datasets demonstrate that the proposed persistence-based indicator outperforms the existing approaches, such as, gap-statistic method, X-means, Gmeans, PG-means, dip-means algorithms and informationtheoretic method, in accurately identifying the clustering solutions with true number of clusters. Interestingly, our method can be explained in terms of the phase-transition phenomenon in the deterministic annealing algorithm, where the number of distinct cluster centers changes (bifurcates) with respect to an annealing parameter.

Download Full-text

Detecting Heterogeneity of Substitution Along DNA and Protein Sequences

Genetics ◽

10.1093/genetics/143.1.589 ◽

1996 ◽

Vol 143 (1) ◽

pp. 589-602 ◽

Cited By ~ 2

Author(s):

Peter J E Goss ◽

R C Lewontin

Keyword(s):

Null Hypothesis ◽

Nonuniform Distribution ◽

Critical Values ◽

Third Order ◽

Variance Test ◽

Data Set ◽

Significance Level ◽

Uniformly Most Powerful Test ◽

Interval Lengths ◽

Uniformly Most Powerful

Abstract Regions of differing constraint, mutation rate or recombination along a sequence of DNA or amino acids lead to a nonuniform distribution of polymorphism within species or fixed differences between species. The power of five tests to reject the null hypothesis of a uniform distribution is studied for four classes of alternate hypothesis. The tests explored are the variance of interval lengths; a modified variance test, which includes covariance between neighboring intervals; the length of the longest interval; the length of the shortest third-order interval; and a composite test. Although there is no uniformly most powerful test over the range of alternate hypotheses tested, the variance and modified variance tests usually have the highest power. Therefore, we recommend that one of these two tests be used to test departure from uniformity in all circumstances. Tables of critical values for the variance and modified variance tests are given. The critical values depend both on the number of events and the number of positions in the sequence. A computer program is available on request that calculates both the critical values for a specified number of events and number of positions as well as the significance level of a given data set.

Download Full-text

MNBDR: A Module Network Based Method for Drug Repositioning

Genes ◽

10.3390/genes12010025 ◽

2020 ◽

Vol 12 (1) ◽

pp. 25

Author(s):

He-Gang Chen ◽

Xiong-Hui Zhou

Keyword(s):

Gene Expression ◽

Protein Interactions ◽

Drug Response ◽

Drug Repositioning ◽

Expression Profiles ◽

Drug Repurposing ◽

Recent Decade ◽

Data Set ◽

Module Network ◽

The Cross

Drug repurposing/repositioning, which aims to find novel indications for existing drugs, contributes to reducing the time and cost for drug development. For the recent decade, gene expression profiles of drug stimulating samples have been successfully used in drug repurposing. However, most of the existing methods neglect the gene modules and the interactions among the modules, although the cross-talks among pathways are common in drug response. It is essential to develop a method that utilizes the cross-talks information to predict the reliable candidate associations. In this study, we developed MNBDR (Module Network Based Drug Repositioning), a novel method that based on module network to screen drugs. It integrated protein–protein interactions and gene expression profile of human, to predict drug candidates for diseases. Specifically, the MNBDR mined dense modules through protein–protein interaction (PPI) network and constructed a module network to reveal cross-talks among modules. Then, together with the module network, based on existing gene expression data set of drug stimulation samples and disease samples, we used random walk algorithms to capture essential modules in disease development and proposed a new indicator to screen potential drugs for a given disease. Results showed MNBDR could provide better performance than popular methods. Moreover, functional analysis of the essential modules in the network indicated our method could reveal biological mechanism in drug response.

Download Full-text