CrossICC: iterative consensus clustering of cross-platform gene expression data without adjusting batch effect

2019 ◽  
Vol 21 (5) ◽  
pp. 1818-1824 ◽  
Author(s):  
Qi Zhao ◽  
Yu Sun ◽  
Zekun Liu ◽  
Hongwan Zhang ◽  
Xingyang Li ◽  
...  

Abstract   Unsupervised clustering of high-throughput gene expression data is widely adopted for cancer subtyping. However, cancer subtypes derived from a single dataset are usually not applicable across multiple datasets from different platforms. Merging different datasets is necessary to determine accurate and applicable cancer subtypes but is still embarrassing due to the batch effect. CrossICC is an R package designed for the unsupervised clustering of gene expression data from multiple datasets/platforms without the requirement of batch effect adjustment. CrossICC utilizes an iterative strategy to derive the optimal gene signature and cluster numbers from a consensus similarity matrix generated by consensus clustering. This package also provides abundant functions to visualize the identified subtypes and evaluate subtyping performance. We expected that CrossICC could be used to discover the robust cancer subtypes with significant translational implications in personalized care for cancer patients. Availability and Implementation The package is implemented in R and available at GitHub (https://github.com/bioinformatist/CrossICC) and Bioconductor (http://bioconductor.org/packages/release/bioc/html/CrossICC.html) under the GPL v3 License.

2012 ◽  
Vol 10 (05) ◽  
pp. 1250011
Author(s):  
NATALIA NOVOSELOVA ◽  
IGOR TOM

Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.


IEEE Access ◽  
2019 ◽  
Vol 7 ◽  
pp. 22086-22095 ◽  
Author(s):  
Jing Xu ◽  
Peng Wu ◽  
Yuehui Chen ◽  
Qingfang Meng ◽  
Hussain Dawood ◽  
...  

2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Lianxin Zhong ◽  
Qingfang Meng ◽  
Yuehui Chen

The correct classification of cancer subtypes is of great significance for the in-depth study of cancer pathogenesis and the realization of accurate treatment for cancer patients. In recent years, the classification of cancer subtypes using deep neural networks and gene expression data has become a hot topic. However, most classifiers may face the challenges of overfitting and low classification accuracy when dealing with small sample size and high-dimensional biological data. In this paper, the Cascade Flexible Neural Forest (CFNForest) Model was proposed to accomplish cancer subtype classification. CFNForest extended the traditional flexible neural tree structure to FNT Group Forest exploiting a bagging ensemble strategy and could automatically generate the model’s structure and parameters. In order to deepen the FNT Group Forest without introducing new hyperparameters, the multilayer cascade framework was exploited to design the FNT Group Forest model, which transformed features between levels and improved the performance of the model. The proposed CFNForest model also improved the operational efficiency and the robustness of the model by sample selection mechanism between layers and setting different weights for the output of each layer. To accomplish cancer subtype classification, FNT Group Forest with different feature sets was used to enrich the structural diversity of the model, which make it more suitable for processing small sample size datasets. The experiments on RNA-seq gene expression data showed that CFNForest effectively improves the accuracy of cancer subtype classification. The classification results have good robustness.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Lianxin Zhong ◽  
Qingfang Meng ◽  
Yuehui Chen ◽  
Lei Du ◽  
Peng Wu

Abstract Background Correctly classifying the subtypes of cancer is of great significance for the in-depth study of cancer pathogenesis and the realization of personalized treatment for cancer patients. In recent years, classification of cancer subtypes using deep neural networks and gene expression data has gradually become a research hotspot. However, most classifiers may face overfitting and low classification accuracy when dealing with small sample size and high-dimensional biology data. Results In this paper, a laminar augmented cascading flexible neural forest (LACFNForest) model was proposed to complete the classification of cancer subtypes. This model is a cascading flexible neural forest using deep flexible neural forest (DFNForest) as the base classifier. A hierarchical broadening ensemble method was proposed, which ensures the robustness of classification results and avoids the waste of model structure and function as much as possible. We also introduced an output judgment mechanism to each layer of the forest to reduce the computational complexity of the model. The deep neural forest was extended to the densely connected deep neural forest to improve the prediction results. The experiments on RNA-seq gene expression data showed that LACFNForest has better performance in the classification of cancer subtypes compared to the conventional methods. Conclusion The LACFNForest model effectively improves the accuracy of cancer subtype classification with good robustness. It provides a new approach for the ensemble learning of classifiers in terms of structural design.


2013 ◽  
Vol 31 (15_suppl) ◽  
pp. 1013-1013 ◽  
Author(s):  
Rene Natowicz ◽  
Tingting Jiang ◽  
Weiwei Shi ◽  
Yuan Qi ◽  
Yann Delpech ◽  
...  

1013 Background: The goal of this study was to develop a method to quantify intratumor heterogeneity of cancers using gene expression data. We compared gene expression heterogeneity between different molecular subtypes of breast cancer and between basal like cancers with or without pathologic complete response (pCR) to neoadjuvant chemotherapy. Methods: Affymetrix U133A gene expression data of 335 stage I-III breast cancers were analyzed. Molecular class was assigned using the PAM50 predictor. All patients received neoadjuvant chemotherapy. We measured tumor heterogeneity by the Gini index (GI) calculated individually for each case over the expression of all probe sets and random subsets. The GI was used as a metric of inequality of gene expression distributions between cases. The higher the GI, the greater the inequality of the expression distribution. Results: Basal like cancers (n=138) had greater heterogeneity than luminal cancers (n=197) (mean GI values 24.51 vs 23.05, p<0.001) and luminal B (n=71) cancers had greater heterogeneity compared to Luminal A (n=126) cancers (24.49 vs 22.25, p<0.001). Among the basal-like cancers, those with pCR (n=44) had significantly higher heterogeneity compared to cancers with residual disease (RD, n=94) (26.10 vs 23.77, p<0.001). Significant differences in GI between cancer subtypes remained for as low 2500 randomly selected probe sets. Conclusions: Breast cancer subtypes differ in intratumor gene expression heterogeneity. Greater degree of heterogeneity correlate with greater chemotherapy sensitivity. Importantly, among basal-like cancers only the heterogeneity metric differed significantly between cases with pCR or RD but not individual genes expression values or gene signatures.


BMC Genomics ◽  
2006 ◽  
Vol 7 (1) ◽  
Author(s):  
Andrey Ptitsyn ◽  
Matthew Hulver ◽  
William Cefalu ◽  
David York ◽  
Steven R Smith

Sign in / Sign up

Export Citation Format

Share Document