Clustering Methods for Microarray Data Sets

Author(s):  
Giuseppe Agapito ◽  
Giuseppe Fedele
2015 ◽  
Vol 76 (1) ◽  
Author(s):  
Ang Jun Chin ◽  
Andri Mirzal ◽  
Habibollah Haron

Gene expression profile is eminent for its broad applications and achievements in disease discovery and analysis, especially in cancer research. Spectral clustering is robust to irrelevant features which are appropriated for gene expression analysis. However, previous works show that performance comparison with other clustering methods is limited and only a few microarray data sets were analyzed in each study. In this study, we demonstrate the use of spectral clustering in identifying cancer types or subtypes from microarray gene expression profiling. Spectral clustering was applied to eleven microarray data sets and its clustering performances were compared with the results in the literature. Based on the result, overall the spectral clustering slightly outperformed the corresponding results in the literature. The spectral clustering can also offer more stable clustering performances as it has smaller standard deviation value. Moreover, out of eleven data sets the spectral clustering outperformed the corresponding methods in the literature for six data sets. So, it can be stated that the spectral clustering is a promising method in identifying the cancer types or subtypes for microarray gene expression data sets.


2020 ◽  
Vol 23 (8) ◽  
pp. 805-813
Author(s):  
Ai Jiang ◽  
Peng Xu ◽  
Zhenda Zhao ◽  
Qizhao Tan ◽  
Shang Sun ◽  
...  

Background: Osteoarthritis (OA) is a joint disease that leads to a high disability rate and a low quality of life. With the development of modern molecular biology techniques, some key genes and diagnostic markers have been reported. However, the etiology and pathogenesis of OA are still unknown. Objective: To develop a gene signature in OA. Method: In this study, five microarray data sets were integrated to conduct a comprehensive network and pathway analysis of the biological functions of OA related genes, which can provide valuable information and further explore the etiology and pathogenesis of OA. Results and Discussion: Differential expression analysis identified 180 genes with significantly expressed expression in OA. Functional enrichment analysis showed that the up-regulated genes were associated with rheumatoid arthritis (p < 0.01). Down-regulated genes regulate the biological processes of negative regulation of kinase activity and some signaling pathways such as MAPK signaling pathway (p < 0.001) and IL-17 signaling pathway (p < 0.001). In addition, the OA specific protein-protein interaction (PPI) network was constructed based on the differentially expressed genes. The analysis of network topological attributes showed that differentially upregulated VEGFA, MYC, ATF3 and JUN genes were hub genes of the network, which may influence the occurrence and development of OA through regulating cell cycle or apoptosis, and were potential biomarkers of OA. Finally, the support vector machine (SVM) method was used to establish the diagnosis model of OA, which not only had excellent predictive power in internal and external data sets (AUC > 0.9), but also had high predictive performance in different chip platforms (AUC > 0.9) and also had effective ability in blood samples (AUC > 0.8). Conclusion: The 4-genes diagnostic model may be of great help to the early diagnosis and prediction of OA.


Sensors ◽  
2021 ◽  
Vol 21 (5) ◽  
pp. 1573
Author(s):  
Loris Nanni ◽  
Giovanni Minchio ◽  
Sheryl Brahnam ◽  
Gianluca Maguolo ◽  
Alessandra Lumini

Traditionally, classifiers are trained to predict patterns within a feature space. The image classification system presented here trains classifiers to predict patterns within a vector space by combining the dissimilarity spaces generated by a large set of Siamese Neural Networks (SNNs). A set of centroids from the patterns in the training data sets is calculated with supervised k-means clustering. The centroids are used to generate the dissimilarity space via the Siamese networks. The vector space descriptors are extracted by projecting patterns onto the similarity spaces, and SVMs classify an image by its dissimilarity vector. The versatility of the proposed approach in image classification is demonstrated by evaluating the system on different types of images across two domains: two medical data sets and two animal audio data sets with vocalizations represented as images (spectrograms). Results show that the proposed system’s performance competes competitively against the best-performing methods in the literature, obtaining state-of-the-art performance on one of the medical data sets, and does so without ad-hoc optimization of the clustering methods on the tested data sets.


2008 ◽  
Vol 06 (02) ◽  
pp. 261-282 ◽  
Author(s):  
AO YUAN ◽  
WENQING HE

Clustering is a major tool for microarray gene expression data analysis. The existing clustering methods fall mainly into two categories: parametric and nonparametric. The parametric methods generally assume a mixture of parametric subdistributions. When the mixture distribution approximately fits the true data generating mechanism, the parametric methods perform well, but not so when there is nonnegligible deviation between them. On the other hand, the nonparametric methods, which usually do not make distributional assumptions, are robust but pay the price for efficiency loss. In an attempt to utilize the known mixture form to increase efficiency, and to free assumptions about the unknown subdistributions to enhance robustness, we propose a semiparametric method for clustering. The proposed approach possesses the form of parametric mixture, with no assumptions to the subdistributions. The subdistributions are estimated nonparametrically, with constraints just being imposed on the modes. An expectation-maximization (EM) algorithm along with a classification step is invoked to cluster the data, and a modified Bayesian information criterion (BIC) is employed to guide the determination of the optimal number of clusters. Simulation studies are conducted to assess the performance and the robustness of the proposed method. The results show that the proposed method yields reasonable partition of the data. As an illustration, the proposed method is applied to a real microarray data set to cluster genes.


2015 ◽  
Vol 17 (5) ◽  
pp. 719-732
Author(s):  
Dulakshi Santhusitha Kumari Karunasingha ◽  
Shie-Yui Liong

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.


2018 ◽  
Vol 30 (4) ◽  
pp. 1080-1103 ◽  
Author(s):  
Kun Zhan ◽  
Jinhui Shi ◽  
Jing Wang ◽  
Haibo Wang ◽  
Yuange Xie

Most existing multiview clustering methods require that graph matrices in different views are computed beforehand and that each graph is obtained independently. However, this requirement ignores the correlation between multiple views. In this letter, we tackle the problem of multiview clustering by jointly optimizing the graph matrix to make full use of the data correlation between views. With the interview correlation, a concept factorization–based multiview clustering method is developed for data integration, and the adaptive method correlates the affinity weights of all views. This method differs from nonnegative matrix factorization–based clustering methods in that it can be applicable to data sets containing negative values. Experiments are conducted to demonstrate the effectiveness of the proposed method in comparison with state-of-the-art approaches in terms of accuracy, normalized mutual information, and purity.


2015 ◽  
Vol 135 (10) ◽  
pp. 2455-2463 ◽  
Author(s):  
Lanlan Yin ◽  
Sergio G. Coelho ◽  
Julio C. Valencia ◽  
Dominik Ebsen ◽  
Andre Mahns ◽  
...  

2021 ◽  
Author(s):  
Sebastiaan Valkiers ◽  
Max Van Houcke ◽  
Kris Laukens ◽  
Pieter Meysman

The T-cell receptor (TCR) determines the specificity of a T-cell towards an epitope. As of yet, the rules for antigen recognition remain largely undetermined. Current methods for grouping TCRs according to their epitope specificity remain limited in performance and scalability. Multiple methodologies have been developed, but all of them fail to efficiently cluster large data sets exceeding 1 million sequences. To account for this limitation, we developed clusTCR, a rapid TCR clustering alternative that efficiently scales up to millions of CDR3 amino acid sequences. Benchmarking comparisons revealed similar accuracy of clusTCR with other TCR clustering methods. clusTCR offers a drastic improvement in clustering speed, which allows clustering of millions of TCR sequences in just a few minutes through efficient similarity searching and sequence hashing.clusTCR was written in Python 3. It is available as an anaconda package (https://anaconda.org/svalkiers/clustcr) and on github (https://github.com/svalkiers/clusTCR).


Sign in / Sign up

Export Citation Format

Share Document