scholarly journals Efficient weighted univariate clustering maps outstanding dysregulated genomic zones in human cancers

2020 ◽  
Vol 36 (20) ◽  
pp. 5027-5036 ◽  
Author(s):  
Mingzhou Song ◽  
Hua Zhong

Abstract Motivation Chromosomal patterning of gene expression in cancer can arise from aneuploidy, genome disorganization or abnormal DNA methylation. To map such patterns, we introduce a weighted univariate clustering algorithm to guarantee linear runtime, optimality and reproducibility. Results We present the chromosome clustering method, establish its optimality and runtime and evaluate its performance. It uses dynamic programming enhanced with an algorithm to reduce search-space in-place to decrease runtime overhead. Using the method, we delineated outstanding genomic zones in 17 human cancer types. We identified strong continuity in dysregulation polarity—dominance by either up- or downregulated genes in a zone—along chromosomes in all cancer types. Significantly polarized dysregulation zones specific to cancer types are found, offering potential diagnostic biomarkers. Unreported previously, a total of 109 loci with conserved dysregulation polarity across cancer types give insights into pan-cancer mechanisms. Efficient chromosomal clustering opens a window to characterize molecular patterns in cancer genome and beyond. Availability and implementation Weighted univariate clustering algorithms are implemented within the R package ‘Ckmeans.1d.dp’ (4.0.0 or above), freely available at https://cran.r-project.org/package=Ckmeans.1d.dp. Supplementary information Supplementary data are available at Bioinformatics online.

Cells ◽  
2020 ◽  
Vol 10 (1) ◽  
pp. 45
Author(s):  
Darío Rocha ◽  
Iris A. García ◽  
Aldana González Montoro ◽  
Andrea Llera ◽  
Laura Prato ◽  
...  

Studying tissue-independent components of cancer and defining pan-cancer subtypes could be addressed using tissue-specific molecular signatures if classification errors are controlled. Since PAM50 is a well-known, United States Food and Drug Administration (FDA)-approved and commercially available breast cancer signature, we applied it with uncertainty assessment to classify tumor samples from over 33 cancer types, discarded unassigned samples, and studied the emerging tumor-agnostic molecular patterns. The percentage of unassigned samples ranged between 55.5% and 86.9% in non-breast tissues, and gene set analysis suggested that the remaining samples could be grouped into two classes (named C1 and C2) regardless of the tissue. The C2 class was more dedifferentiated, more proliferative, with higher centrosome amplification, and potentially more TP53 and RB1 mutations. We identified 28 gene sets and 95 genes mainly associated with cell-cycle progression, cell-cycle checkpoints, and DNA damage that were consistently exacerbated in the C2 class. In some cancer types, the C1/C2 classification was associated with survival and drug sensitivity, and modulated the prognostic meaning of the immune infiltrate. Our results suggest that PAM50 could be repurposed for a pan-cancer context when paired with uncertainty assessment, resulting in two classes with molecular, biological, and clinical implications.


Author(s):  
Xiaofan Lu ◽  
Jialin Meng ◽  
Yujie Zhou ◽  
Liyun Jiang ◽  
Fangrong Yan

Abstract Summary Stratification of cancer patients into distinct molecular subgroups based on multi-omics data is an important issue in the context of precision medicine. Here, we present MOVICS, an R package for multi-omics integration and visualization in cancer subtyping. MOVICS provides a unified interface for 10 state-of-the-art multi-omics integrative clustering algorithms, and incorporates the most commonly used downstream analyses in cancer subtyping researches, including characterization and comparison of identified subtypes from multiple perspectives, and verification of subtypes in external cohort using two model-free approaches for multiclass prediction. MOVICS also creates feature rich customizable visualizations with minimal effort. By analysing two published breast cancer cohort, we signifies that MOVICS can serve a wide range of users and assist cancer therapy by moving away from the ‘one-size-fits-all’ approach to patient care. Availability and implementation MOVICS package and online tutorial are freely available at https://github.com/xlucpu/MOVICS. Supplementary information Supplementary data are available at Bioinformatics online.


2013 ◽  
Vol 411-414 ◽  
pp. 1884-1893
Author(s):  
Yong Chun Cao ◽  
Ya Bin Shao ◽  
Shuang Liang Tian ◽  
Zheng Qi Cai

Due to many of the clustering algorithms based on GAs suffer from degeneracy and are easy to fall in local optima, a novel dynamic genetic algorithm for clustering problems (DGA) is proposed. The algorithm adopted the variable length coding to represent individuals and processed the parallel crossover operation in the subpopulation with individuals of the same length, which allows the DGA algorithm clustering to explore the search space more effectively and can automatically obtain the proper number of clusters and the proper partition from a given data set; the algorithm used the dynamic crossover probability and adaptive mutation probability, which prevented the dynamic clustering algorithm from getting stuck at a local optimal solution. The clustering results in the experiments on three artificial data sets and two real-life data sets show that the DGA algorithm derives better performance and higher accuracy on clustering problems.


2021 ◽  
Author(s):  
Manuel Fritz ◽  
Michael Behringer ◽  
Dennis Tschechlov ◽  
Holger Schwarz

AbstractClustering is a fundamental primitive in manifold applications. In order to achieve valuable results in exploratory clustering analyses, parameters of the clustering algorithm have to be set appropriately, which is a tremendous pitfall. We observe multiple challenges for large-scale exploration processes. On the one hand, they require specific methods to efficiently explore large parameter search spaces. On the other hand, they often exhibit large runtimes, in particular when large datasets are analyzed using clustering algorithms with super-polynomial runtimes, which repeatedly need to be executed within exploratory clustering analyses. We address these challenges as follows: First, we present LOG-Means and show that it provides estimates for the number of clusters in sublinear time regarding the defined search space, i.e., provably requiring less executions of a clustering algorithm than existing methods. Second, we demonstrate how to exploit fundamental characteristics of exploratory clustering analyses in order to significantly accelerate the (repetitive) execution of clustering algorithms on large datasets. Third, we show how these challenges can be tackled at the same time. To the best of our knowledge, this is the first work which simultaneously addresses the above-mentioned challenges. In our comprehensive evaluation, we unveil that our proposed methods significantly outperform state-of-the-art methods, thus especially supporting novice analysts for exploratory clustering analyses in large-scale exploration processes.


2019 ◽  
Vol 35 (21) ◽  
pp. 4419-4421 ◽  
Author(s):  
Sun Ah Kim ◽  
Myriam Brossard ◽  
Delnaz Roshandel ◽  
Andrew D Paterson ◽  
Shelley B Bull ◽  
...  

Abstract Summary For the analysis of high-throughput genomic data produced by next-generation sequencing (NGS) technologies, researchers need to identify linkage disequilibrium (LD) structure in the genome. In this work, we developed an R package gpart which provides clustering algorithms to define LD blocks or analysis units consisting of SNPs. The visualization tool in gpart can display the LD structure and gene positions for up to 20 000 SNPs in one image. The gpart functions facilitate construction of LD blocks and SNP partitions for vast amounts of genome sequencing data within reasonable time and memory limits in personal computing environments. Availability and implementation The R package is available at https://bioconductor.org/packages/gpart. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (14) ◽  
pp. 4144-4153
Author(s):  
Jack LeBien ◽  
Gerald McCollam ◽  
Joel Atallah

Abstract Motivation Recent research has uncovered roles for transposable elements (TEs) in multiple evolutionary processes, ranging from somatic evolution in cancer to putatively adaptive germline evolution across species. Most models of TE population dynamics, however, have not incorporated actual genome sequence data. The effect of site integration preferences of specific TEs on evolutionary outcomes and the effects of different selection regimes on TE dynamics in a specific genome are unknown. We present a stochastic model of LINE-1 (L1) transposition in human cancer. This system was chosen because the transposition of L1 elements is well understood, the population dynamics of cancer tumors has been modeled extensively, and the role of L1 elements in cancer progression has garnered interest in recent years. Results Our model predicts that L1 retrotransposition (RT) can play either advantageous or deleterious roles in tumor progression, depending on the initial lesion size, L1 insertion rate and tumor driver genes. Small changes in the RT rate or set of driver tumor-suppressor genes (TSGs) were observed to alter the dynamics of tumorigenesis. We found high variation in the density of L1 target sites across human protein-coding genes. We also present an analysis, across three cancer types, of the frequency of homozygous TSG disruption in wild-type hosts compared to those with an inherited driver allele. Availability and implementation Source code is available at https://github.com/atallah-lab/neoplastic-evolution. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 48 (5) ◽  
pp. 2287-2302 ◽  
Author(s):  
Zishan Wang ◽  
Jiaqi Yin ◽  
Weiwei Zhou ◽  
Jing Bai ◽  
Yunjin Xie ◽  
...  

Abstract Accumulating evidence has demonstrated that transcriptional regulation is affected by DNA methylation. Understanding the perturbation of DNA methylation-mediated regulation between transcriptional factors (TFs) and targets is crucial for human diseases. However, the global landscape of DNA methylation-mediated transcriptional dysregulation (DMTD) across cancers has not been portrayed. Here, we systematically identified DMTD by integrative analysis of transcriptome, methylome and regulatome across 22 human cancer types. Our results revealed that transcriptional regulation was affected by DNA methylation, involving hundreds of methylation-sensitive TFs (MethTFs). In addition, pan-cancer MethTFs, the regulatory activity of which is generally affected by DNA methylation across cancers, exhibit dominant functional characteristics and regulate several cancer hallmarks. Moreover, pan-cancer MethTFs were found to be affected by DNA methylation in a complex pattern. Finally, we investigated the cooperation among MethTFs and identified a network module that consisted of 43 MethTFs with prognostic potential. In summary, we systematically dissected the transcriptional dysregulation mediated by DNA methylation across cancer types, and our results provide a valuable resource for both epigenetic and transcriptional regulation communities.


2020 ◽  
Vol 36 (9) ◽  
pp. 2778-2786 ◽  
Author(s):  
Shobana V Stassen ◽  
Dickson M D Siu ◽  
Kelvin C M Lee ◽  
Joshua W K Ho ◽  
Hayden K H So ◽  
...  

Abstract Motivation New single-cell technologies continue to fuel the explosive growth in the scale of heterogeneous single-cell data. However, existing computational methods are inadequately scalable to large datasets and therefore cannot uncover the complex cellular heterogeneity. Results We introduce a highly scalable graph-based clustering algorithm PARC—Phenotyping by Accelerated Refined Community-partitioning—for large-scale, high-dimensional single-cell data (>1 million cells). Using large single-cell flow and mass cytometry, RNA-seq and imaging-based biophysical data, we demonstrate that PARC consistently outperforms state-of-the-art clustering algorithms without subsampling of cells, including Phenograph, FlowSOM and Flock, in terms of both speed and ability to robustly detect rare cell populations. For example, PARC can cluster a single-cell dataset of 1.1 million cells within 13 min, compared with >2 h for the next fastest graph-clustering algorithm. Our work presents a scalable algorithm to cope with increasingly large-scale single-cell analysis. Availability and implementation https://github.com/ShobiStassen/PARC. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
R. Greg Stacey ◽  
Michael A. Skinnider ◽  
Leonard J. Foster

ABSTRACTBiological functions emerge from complex and dynamic networks of protein-protein interactions. Because these protein-protein interaction networks, or interactomes, represent pairwise connections within a hierarchically organized system, it is often useful to identify higher-order associations embedded within them, such as multi-member protein-complexes. Graph-based clustering techniques are widely used to accomplish this goal, and dozens of field-specific and general clustering algorithms exist. However, interactomes can be prone to errors, especially interactomes that infer interactions using high-throughput biochemical assays. Therefore, robustness to network-level variability is an important criterion for any clustering algorithm that aims to generate robust, reproducible clusters. Here, we tested the robustness of a range of graph-based clustering algorithms in the presence of network-level noise, including algorithms common across domains and those specific to protein networks. We found that the results of all clustering algorithms measured were profoundly sensitive to injected network noise.Randomly rewiring 1% of network edges yielded up to a 57% change in clustering results, indicating that clustering markedly amplified network-level noise. However, the impact of network noise on individual clusters was not uniform. We found that some clusters were consistently robust to injected network noise while others were not. Therefore, we developed the clust.perturb R package and Shiny web application, which measures the reproducibility of clusters by randomly perturbing the network. We show that clust.perturb results are predictive of real-world cluster stability: poorly reproducible clusters as identified by clust.perturb are significantly less likely to be reclustered across experiments. We conclude that quantifying the robustness of a cluster to network noise, as implemented in clust.perturb, provides a powerful tool for ranking the reproducibility of clusters, and separating stable protein complexes from spurious associations.


2020 ◽  
Author(s):  
Eric Minwei Liu ◽  
Augustin Luna ◽  
Guanlan Dong ◽  
Chris Sander

AbstractSummaryLarge-scale sequencing projects, such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), have accumulated a variety of high throughput sequencing and molecular profiling data, but it is still challenging to identify potentially causal genetic mutations in cancer as well as in other diseases in an automated fashion. We developed the NetBoxR package written in the R programming language, that makes use of the NetBox algorithm to identify candidate cancer-related processes. The algorithm makes use of a networkbased approach that combines prior knowledge with a network clustering algorithm, obviating the need for and the limitation of functionally curated gene sets. A key aspect of this approach is its ability to combine multiple data types, such as mutations and copy number alterations, leading to more reliable identification of functional modules. We make the tool available in the Bioconductor R ecosystem for applications in cancer research and cell biology.Availability and implementationThe NetBoxR package is free and open-sourced under the GNU GPL-3 license R package available at https://www.bioconductor.org/packages/release/bioc/html/[email protected]; [email protected]; [email protected] informationNone


Sign in / Sign up

Export Citation Format

Share Document