scholarly journals Block-Constraint Laplacian-Regularized Low-Rank Representation and Its Application for Cancer Sample Clustering Based on Integrated TCGA Data

Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Juan Wang ◽  
Jin-Xing Liu ◽  
Chun-Hou Zheng ◽  
Cong-Hai Lu ◽  
Ling-Yun Dai ◽  
...  

Low-Rank Representation (LRR) is a powerful subspace clustering method because of its successful learning of low-dimensional subspace of data. With the breakthrough of “OMics” technology, many LRR-based methods have been proposed and used to cancer clustering based on gene expression data. Moreover, studies have shown that besides gene expression data, some other genomic data in TCGA also contain important information for cancer research. Therefore, these genomic data can be integrated as a comprehensive feature source for cancer clustering. How to establish an effective clustering model for comprehensive analysis of integrated TCGA data has become a key issue. In this paper, we develop the traditional LRR method and propose a novel method named Block-constraint Laplacian-Regularized Low-Rank Representation (BLLRR) to model multigenome data for cancer sample clustering. The proposed method is dedicated to extracting more abundant subspace structure information from multiple genomic data to improve the accuracy of cancer sample clustering. Considering the heterogeneity of different genome data, we introduce the block-constraint idea into our method. In BLLRR decomposition, we treat each genome data as a data block and impose different constraints on different data blocks. In addition, graph Laplacian is also introduced into our method to better learn the topological structure of data by preserving the local geometric information. The experiments demonstrate that the BLLRR method can effectively analyze integrated TCGA data and extract more subspace structure information from multigenome data. It is a reliable and efficient clustering algorithm for cancer sample clustering.

2019 ◽  
Vol 20 (S22) ◽  
Author(s):  
Juan Wang ◽  
Cong-Hai Lu ◽  
Jin-Xing Liu ◽  
Ling-Yun Dai ◽  
Xiang-Zhen Kong

Abstract Background Identifying different types of cancer based on gene expression data has become hotspot in bioinformatics research. Clustering cancer gene expression data from multiple cancers to their own class is a significance solution. However, the characteristics of high-dimensional and small samples of gene expression data and the noise of the data make data mining and research difficult. Although there are many effective and feasible methods to deal with this problem, the possibility remains that these methods are flawed. Results In this paper, we propose the graph regularized low-rank representation under symmetric and sparse constraints (sgLRR) method in which we introduce graph regularization based on manifold learning and symmetric sparse constraints into the traditional low-rank representation (LRR). For the sgLRR method, by means of symmetric constraint and sparse constraint, the effect of raw data noise on low-rank representation is alleviated. Further, sgLRR method preserves the important intrinsic local geometrical structures of the raw data by introducing graph regularization. We apply this method to cluster multi-cancer samples based on gene expression data, which improves the clustering quality. First, the gene expression data are decomposed by sgLRR method. And, a lowest rank representation matrix is obtained, which is symmetric and sparse. Then, an affinity matrix is constructed to perform the multi-cancer sample clustering by using a spectral clustering algorithm, i.e., normalized cuts (Ncuts). Finally, the multi-cancer samples clustering is completed. Conclusions A series of comparative experiments demonstrate that the sgLRR method based on low rank representation has a great advantage and remarkable performance in the clustering of multi-cancer samples.


2017 ◽  
Vol 11 (3) ◽  
pp. 224-233
Author(s):  
Xiaoyun Chen ◽  
Mengzhen Liao ◽  
Xianbao Ye

Gene expression data is a kind of high dimension and small sample size data. The clustering accuracy of conventional clustering techniques is lower on gene expression data due to its high dimension. Because some subspace segmentation approaches can be better applied in the high-dimensional space, three new subspace clustering models for gene expression data sets are proposed in this work. The proposed projection subspace clustering models have projection sparse subspace clustering, projection low-rank representation subspace clustering and projection least-squares regression subspace clustering which combine projection technique with sparse subspace clustering, low-rank representation and least-square regression, respectively. In order to compute the inner product in the high-dimensional space, the kernel function is used to the projection subspace clustering models. The experimental results on six gene expression data sets show these models are effective.


Symmetry ◽  
2020 ◽  
Vol 12 (1) ◽  
pp. 154 ◽  
Author(s):  
Ho Sun Shon ◽  
Erdenebileg Batbaatar ◽  
Kyoung Ok Kim ◽  
Eun Jong Cha ◽  
Kyung-Ah Kim

Recently, large-scale bioinformatics and genomic data have been generated using advanced biotechnology methods, thus increasing the importance of analyzing such data. Numerous data mining methods have been developed to process genomic data in the field of bioinformatics. We extracted significant genes for the prognosis prediction of 1157 patients using gene expression data from patients with kidney cancer. We then proposed an end-to-end, cost-sensitive hybrid deep learning (COST-HDL) approach with a cost-sensitive loss function for classification tasks on imbalanced kidney cancer data. Here, we combined the deep symmetric auto encoder; the decoder is symmetric to the encoder in terms of layer structure, with reconstruction loss for non-linear feature extraction and neural network with balanced classification loss for prognosis prediction to address data imbalance problems. Combined clinical data from patients with kidney cancer and gene data were used to determine the optimal classification model and estimate classification accuracy by sample type, primary diagnosis, tumor stage, and vital status as risk factors representing the state of patients. Experimental results showed that the COST-HDL approach was more efficient with gene expression data for kidney cancer prognosis than other conventional machine learning and data mining techniques. These results could be applied to extract features from gene biomarkers for prognosis prediction of kidney cancer and prevention and early diagnosis.


Author(s):  
Mélina Gallopin ◽  
Gilles Celeux ◽  
Florence Jaffrézic ◽  
Andrea Rau

AbstractIn co-expression analyses of gene expression data, it is often of interest to interpret clusters of co-expressed genes with respect to a set of external information, such as a potentially incomplete list of functional properties for which a subset of genes may be annotated. Based on the framework of finite mixture models, we propose a model selection criterion that takes into account such external gene annotations, providing an efficient tool for selecting a relevant number of clusters and clustering model. This criterion, called the integrated completed annotated likelihood (ICAL), is defined by adding an entropy term to a penalized likelihood to measure the concordance between a clustering partition and the external annotation information. The ICAL leads to the choice of a model that is more easily interpretable with respect to the known functional gene annotations. We illustrate the interest of this model selection criterion in conjunction with Gaussian mixture models on simulated gene expression data and on real RNA-seq data.


2015 ◽  
Vol 11 (7) ◽  
pp. 1876-1886 ◽  
Author(s):  
Wei Liu ◽  
Qiuyu Wang ◽  
Jianmei Zhao ◽  
Chunlong Zhang ◽  
Yuejuan Liu ◽  
...  

Accurately predicting the risk of cancer relapse or death is important for clinical utility.


2021 ◽  
Vol 2021 ◽  
pp. 1-16
Author(s):  
Jian Liu ◽  
Yuhu Cheng ◽  
Xuesong Wang ◽  
Shuguang Ge

Clustering of tumor samples can help identify cancer types and discover new cancer subtypes, which is essential for effective cancer treatment. Although many traditional clustering methods have been proposed for tumor sample clustering, advanced algorithms with better performance are still needed. Low-rank subspace clustering is a popular algorithm in recent years. In this paper, we propose a novel one-step robust low-rank subspace segmentation method (ORLRS) for clustering the tumor sample. For a gene expression data set, we seek its lowest rank representation matrix and the noise matrix. By imposing the discrete constraint on the low-rank matrix, without performing spectral clustering, ORLRS learns the cluster indicators of subspaces directly, i.e., performing the clustering task in one step. To improve the robustness of the method, capped norm is adopted to remove the extreme data outliers in the noise matrix. Furthermore, we conduct an efficient solution to solve the problem of ORLRS. Experiments on several tumor gene expression data demonstrate the effectiveness of ORLRS.


Sign in / Sign up

Export Citation Format

Share Document