Cluster Analysis in R With Big Data Applications

Author(s):  
Alicia Taylor Lamere

This chapter discusses several popular clustering functions and open source software packages in R and their feasibility of use on larger datasets. These will include the kmeans() function, the pvclust package, and the DBSCAN (density-based spatial clustering of applications with noise) package, which implement K-means, hierarchical, and density-based clustering, respectively. Dimension reduction methods such as PCA (principle component analysis) and SVD (singular value decomposition), as well as the choice of distance measure, are explored as methods to improve the performance of hierarchical and model-based clustering methods on larger datasets. These methods are illustrated through an application to a dataset of RNA-sequencing expression data for cancer patients obtained from the Cancer Genome Atlas Kidney Clear Cell Carcinoma (TCGA-KIRC) data collection from The Cancer Imaging Archive (TCIA).

Author(s):  
Mouhcine El Hassani ◽  
Noureddine Falih ◽  
Belaid Bouikhalene

<p><span>Classification of information is a vague and difficult to explore area of research, hence the emergence of grouping techniques, often referred to Clustering. It is necessary to differentiate between an unsupervised and a supervised classification. Clustering methods are numerous. Data partitioning and hierarchization push to use them in parametric form or not. Also, their use is influenced by algorithms of a probabilistic nature during the partitioning of data. The choice of a method depends on the result of the Clustering that we want to have. This work focuses on classification using the density-based spatial clustering of applications with noise (DBSCAN) and DENsity-based CLUstEring (DENCLUE) algorithm through an application made in csharp. Through the use of three databases which are the IRIS database, breast cancer wisconsin (diagnostic) data set and bank marketing data set, we show experimentally that the choice of the initial data parameters is important to accelerate the processing and can minimize the number of iterations to reduce the execution time of the application.</span></p>


Author(s):  
Qiong Wu ◽  
Tianzhou Ma ◽  
Qingzhi Liu ◽  
Donald K Milton ◽  
Yuan Zhang ◽  
...  

Abstract Motivation The analysis of gene co-expression network (GCN) is critical in examining the gene-gene interactions and learning the underlying complex yet highly organized gene regulatory mechanisms. Numerous clustering methods have been developed to detect communities of co-expressed genes in the large network. The assumed independent community structure, however, can be oversimplified and may not adequately characterize the complex biological processes. Results We develop a new computational package to extract interconnected communities from gene co-expression network. We consider a pair of communities be interconnected if a subset of genes from one community is correlated with a subset of genes from another community. The interconnected community structure is more flexible and provides a better fit to the empirical co-expression matrix. To overcome the computational challenges, we develop efficient algorithms by leveraging advanced graph norm shrinkage approach. We validate and show the advantage of our method by extensive simulation studies. We then apply our interconnected community detection method to an RNA-seq data from The Cancer Genome Atlas (TCGA) Acute Myeloid Leukemia (AML) study and identify essential interacting biological pathways related to the immune evasion mechanism of tumor cells. Availability The software is available at Github: https://github.com/qwu1221/ICN and Figshare: https://figshare.com/articles/software/ICN-package/13229093. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
pp. 1-18
Author(s):  
Trang T.D. Nguyen ◽  
Loan T.T. Nguyen ◽  
Anh Nguyen ◽  
Unil Yun ◽  
Bay Vo

Spatial clustering is one of the main techniques for spatial data mining and spatial data analysis. However, existing spatial clustering methods primarily focus on points distributed in planar space with the Euclidean distance measurement. Recently, NS-DBSCAN has been developed to perform clustering of spatial point events in Network Space based on a well-known clustering algorithm, named Density-Based Spatial Clustering of Applications with Noise (DBSCAN). The NS-DBSCAN algorithm has efficiently solved the problem of clustering network constrained spatial points. When compared to the NC_DT (Network-Constraint Delaunay Triangulation) clustering algorithm, the NS-DBSCAN algorithm efficiently solves the problem of clustering network constrained spatial points by visualizing the intrinsic clustering structure of spatial data by constructing density ordering charts. However, the main drawback of this algorithm is when the data are processed, objects that are not specifically categorized into types of clusters cannot be removed, which is undeniably a waste of time, particularly when the dataset is large. In an attempt to have this algorithm work with great efficiency, we thus recommend removing edges that are longer than the threshold and eliminating low-density points from the density ordering table when forming clusters and also take other effective techniques into consideration. In this paper, we develop a theorem to determine the maximum length of an edge in a road segment. Based on this theorem, an algorithm is proposed to greatly improve the performance of the density-based clustering algorithm in network space (NS-DBSCAN). Experiments using our proposed algorithm carried out in collaboration with Ho Chi Minh City, Vietnam yield the same results but shows an advantage of it over NS-DBSCAN in execution time.


Epigenomics ◽  
2020 ◽  
Author(s):  
Qijie Zhao ◽  
Jinan Guo ◽  
Yueshui Zhao ◽  
Jing Shen ◽  
Parham Jabbarzadeh Kaboli ◽  
...  

Background: PD-L1 and PD-L2 are ligands of PD-1. Their overexpression has been reported in different cancers. However, the underlying mechanism of PD-L1 and PD-L2 dysregulation and their related signaling pathways are still unclear in gastrointestinal cancers. Materials & methods: The expression of PD-L1 and PD-L2 were studied in The Cancer Genome Atlas and Genotype-Tissue Expression databases. The gene and protein alteration of PD-L1 and PD-L2 were analyzed in cBioportal. The direct transcription factor regulating PD-L1/ PD-L2 was determined with ChIP-seq data. The association of PD-L1/PD-L2 expression with clinicopathological parameters, survival, immune infiltration and tumor mutation burden were investigated with data from The Cancer Genome Atlas. Potential targets and pathways of PD-L1 and PD-L2 were determined by protein enrichment, WebGestalt and gene ontology. Results: Comprehensive analysis revealed that PD-L1 and PD-L2 were significantly upregulated in most types of gastrointestinal cancers and their expressions were positively correlated. SP1 was a key transcription factor regulating the expression of PD-L1. Conclusion: Higher PD-L1 or PD-L2 expression was significantly associated with poor overall survival, higher tumor mutation burden and more immune and stromal cell populations. Finally, HIF-1, ERBB and mTOR signaling pathways were most significantly affected by PD-L1 and PD-L2 dysregulation. Altogether, this study provided comprehensive analysis of the dysregulation of PD-L1 and PD-L2, its underlying mechanism and downstream pathways, which add to the knowledge of manipulating PD-L1/PD-L2 for cancer immunotherapy.


2020 ◽  
Vol 27 (11) ◽  
pp. 3021-3036 ◽  
Author(s):  
Hua Yu ◽  
Jun Ding ◽  
Hongwen Zhu ◽  
Yao Jing ◽  
Hu Zhou ◽  
...  

Abstract The lysyl oxidase (LOX) family is closely related to the progression of glioma. To ensure the clinical significance of LOX family in glioma, The Cancer Genome Atlas (TCGA) database was mined and the analysis indicated that higher LOXL1 expression was correlated with more malignant glioma progression. The functions of LOXL1 in promoting glioma cell survival and inhibiting apoptosis were studied by gain- and loss-of-function experiments in cells and animals. LOXL1 was found to exhibit antiapoptotic activity by interacting with multiple antiapoptosis modulators, especially BAG family molecular chaperone regulator 2 (BAG2). LOXL1-D515 interacted with BAG2-K186 through a hydrogen bond, and its lysyl oxidase activity prevented BAG2 degradation by competing with K186 ubiquitylation. Then, we discovered that LOXL1 expression was specifically upregulated through the VEGFR-Src-CEBPA axis. Clinically, the patients with higher LOXL1 levels in their blood had much more abundant BAG2 protein levels in glioma tissues. Conclusively, LOXL1 functions as an important mediator that increases the antiapoptotic capacity of tumor cells, and approaches targeting LOXL1 represent a potential strategy for treating glioma. In addition, blood LOXL1 levels can be used as a biomarker to monitor glioma progression.


2019 ◽  
Vol 20 (22) ◽  
pp. 5697 ◽  
Author(s):  
Michelle E. Pewarchuk ◽  
Mateus C. Barros-Filho ◽  
Brenda C. Minatel ◽  
David E. Cohn ◽  
Florian Guisier ◽  
...  

Recent studies have uncovered microRNAs (miRNAs) that have been overlooked in early genomic explorations, which show remarkable tissue- and context-specific expression. Here, we aim to identify and characterize previously unannotated miRNAs expressed in gastric adenocarcinoma (GA). Raw small RNA-sequencing data were analyzed using the miRMaster platform to predict and quantify previously unannotated miRNAs. A discovery cohort of 475 gastric samples (434 GA and 41 adjacent nonmalignant samples), collected by The Cancer Genome Atlas (TCGA), were evaluated. Candidate miRNAs were similarly assessed in an independent cohort of 25 gastric samples. We discovered 170 previously unannotated miRNA candidates expressed in gastric tissues. The expression of these novel miRNAs was highly specific to the gastric samples, 143 of which were significantly deregulated between tumor and nonmalignant contexts (p-adjusted < 0.05; fold change > 1.5). Multivariate survival analyses showed that the combined expression of one previously annotated miRNA and two novel miRNA candidates was significantly predictive of patient outcome. Further, the expression of these three miRNAs was able to stratify patients into three distinct prognostic groups (p = 0.00003). These novel miRNAs were also present in the independent cohort (43 sequences detected in both cohorts). Our findings uncover novel miRNA transcripts in gastric tissues that may have implications in the biology and management of gastric adenocarcinoma.


BMC Cancer ◽  
2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Chengwu Xiao ◽  
Wei Zhang ◽  
Meimian Hua ◽  
Huan Chen ◽  
Bin Yang ◽  
...  

Abstract Background The tripartite motif (TRIM) family proteins exhibit oncogenic roles in various cancers. The roles of TRIM27, a member of the TRIM super family, in renal cell carcinoma (RCC) remained unexplored. In the current study, we aimed to investigate the clinical impact and roles of TRIM27 in the development of RCC. Methods The mRNA levels of TRIM27 and Kaplan–Meier survival of RCC were analyzed from The Cancer Genome Atlas database. Real-time PCR and Western blotting were used to measure the mRNA and protein levels of TRIM27 both in vivo and in vitro. siRNA and TRIM27 were exogenously overexpressed in RCC cell lines to manipulate TRIM27 expression. Results We discovered that TRIM27 was elevated in RCC patients, and the expression of TRIM27 was closely correlated with poor prognosis. The loss of function and gain of function results illustrated that TRIM27 promotes cell proliferation and inhibits apoptosis in RCC cell lines. Furthermore, TRIM27 expression was positively associated with NF-κB expression in patients with RCC. Blocking the activity of NF-κB attenuated the TRIM27-mediated enhancement of proliferation and inhibition of apoptosis. TRIM27 directly interacted with Iκbα, an inhibitor of NF-κB, to promote its ubiquitination, and the inhibitory effects of TRIM27 on Iκbα led to NF-κB activation. Conclusions Our results suggest that TRIM27 exhibits an oncogenic role in RCC by regulating NF-κB signaling. TRIM27 serves as a specific prognostic indicator for RCC, and strategies targeting the suppression of TRIM27 function may shed light on future therapeutic approaches.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Dejun Wu ◽  
Zhenhua Yin ◽  
Yisheng Ji ◽  
Lin Li ◽  
Yunxin Li ◽  
...  

AbstractLncRNAs play a pivotal role in tumorigenesis and development. However, the potential involvement of lncRNAs in colon adenocarcinoma (COAD) needs to be further explored. All the data used in this study were obtained from The Cancer Genome Atlas database, and all analyses were conducted using R software. Basing on the seven prognosis-related lncRNAs finally selected, we developed a prognosis-predicting model with powerful effectiveness (training cohort, 1 year: AUC = 0.70, 95% Cl = 0.57–0.78; 3 years: AUC = 0.71, 95% Cl = 0.6–0.8; 5 years: AUC = 0.76, 95% Cl = 0.66–0.87; validation cohort, 1 year: AUC = 0.70, 95% Cl = 0.58–0.8; 3 years: AUC = 0.73, 95% Cl = 0.63–0.82; 5 years: AUC = 0.68, 95% Cl = 0.5–0.85). The VEGF and Notch pathway were analyzed through GSEA analysis, and low immune and stromal scores were found in high-risk patients (immune score, cor =  − 0.15, P < 0.001; stromal score, cor =  − 0.18, P < 0.001) , which may partially explain the poor prognosis of patients in the high-risk group. We screened lncRNAs that are significantly associated with the survival of patients with COAD and possibly participate in autophagy regulation. This study may provide direction for future research.


Sign in / Sign up

Export Citation Format

Share Document