NOJAH: Not Just Another Heatmap for Genome-Wide Cluster Analysis

AbstractSince their inception, several tools have been developed for cluster analysis and heatmap construction. The application of such tools to the number and types of genome-wide data available from next generation sequencing (NGS) technologies requires the adaptation of statistical concepts, such as in defining a most variable gene set, and more intricate cluster analyses method to address multiple omic data types. Additionally, the growing number of publicly available datasets has created the desire to estimate the statistical significance of a gene signature derived from one dataset to similarly group samples based on another dataset. The currently available number of tools and their combined use for generating heatmaps, along with the several adaptations of statistical concepts for addressing the higher dimensionality of genome-wide NGS-derived data, has created a further challenge in the ability to replicate heatmap results. We introduce NOJAH (NOt Just Another Heatmap), an interactive tool that defines and implements a workflow for genome-wide cluster analysis and heatmap construction by creating and combining several tools into a single user interface. NOJAH includes several newly developed scripts for techniques that though frequently applied are not sufficiently documented to allow for replicability of results. These techniques include: defining a most variable gene set (a.k.a., ‘core genes’), estimating the statistical significance of a gene signature to separate samples into clusters, and performing a result merging integrated cluster analysis. With only a user uploaded dataset, NOJAH provides as output, among other things, the minimum documentation required for replicating heatmap results. Additionally, NOJAH contains five different existing R packages that are connected in the interface by their functionality as part of a defined workflow for genome-wide cluster analysis. The NOJAH application tool is available at http://bbisr.shinyapps.winship.emory.edu/NOJAH/ with corresponding source code available at https://github.com/bbisr-shinyapps/NOJAH/.

Download Full-text

Pan-cancer subtyping in a 2D-map shows substructures that are driven by specific combinations of molecular characteristics

Scientific Reports ◽

10.1038/srep24949 ◽

2016 ◽

Vol 6 (1) ◽

Cited By ~ 13

Author(s):

Erdogan Taskesen ◽

Sjoerd M. H. Huisman ◽

Ahmed Mahfouz ◽

Jesse H. Krijthe ◽

Jeroen de Ridder ◽

...

Keyword(s):

Therapy Response ◽

Visual Exploration ◽

Molecular Characteristics ◽

Cancer Type ◽

Breast Cancers ◽

Data Types ◽

Multiple Cancer ◽

Genome Wide ◽

Genome Wide Data ◽

Cancer Types

Abstract The use of genome-wide data in cancer research, for the identification of groups of patients with similar molecular characteristics, has become a standard approach for applications in therapy-response, prognosis-prediction, and drug-development. To progress in these applications, the trend is to move from single genome-wide measurements in a single cancer-type towards measuring several different molecular characteristics across multiple cancer-types. Although current approaches shed light on molecular characteristics of various cancer-types, detailed relationships between patients within cancer clusters are unclear. We propose a novel multi-omic integration approach that exploits the joint behavior of the different molecular characteristics, supports visual exploration of the data by a two-dimensional landscape, and inspection of the contribution of the different genome-wide data-types. We integrated 4,434 samples across 19 cancer-types, derived from TCGA, containing gene expression, DNA-methylation, copy-number variation and microRNA expression data. Cluster analysis revealed 18 clusters, where three clusters showed a complex collection of cancer-types, squamous-cell-carcinoma, colorectal cancers, and a novel grouping of kidney-cancers. Sixty-four samples were identified outside their tissue-of-origin cluster. Known and novel patient subgroups were detected for Acute Myeloid Leukemia’s, and breast cancers. Quantification of the contributions of the different molecular types showed that substructures are driven by specific (combinations of) molecular characteristics.

Download Full-text

GSEA-InContext Explorer: An interactive visualization tool for putting gene set enrichment analysis results into biological context

10.1101/659847 ◽

2019 ◽

Author(s):

Rani K. Powers ◽

Anthony Sun ◽

James C. Costello

Keyword(s):

Statistical Significance ◽

Null Distribution ◽

Enrichment Analysis ◽

Gene Set Enrichment Analysis ◽

Gene Set Enrichment ◽

Gene Set ◽

Link Type ◽

Interactive Interface ◽

Gene Sets ◽

Shiny App

AbstractSummaryGSEA-InContext Explorer is a Shiny app that allows users to perform two methods of gene set enrichment analysis (GSEA). The first, GSEAPreranked, applies the GSEA algorithm in which statistical significance is estimated from a null distribution of enrichment scores generated for randomly permuted gene sets. The second, GSEA-InContext, incorporates a user-defined set of background experiments to define the null distribution and calculate statistical significance. GSEA-InContext Explorer allows the user to build custom background sets from a compendium of over 5,700 curated experiments, run both GSEAPreranked and GSEA-InContext on their own uploaded experiment, and explore the results using an interactive interface. This tool will allow researchers to visualize gene sets that are commonly enriched across experiments and identify gene sets that are uniquely significant in their experiment, thus complementing current methods for interpreting gene set enrichment results.Availability and implementationThe code for GSEA-InContext Explorer is available at: https://github.com/CostelloLab/GSEA-InContext_Explorer and the interactive tool is at: http://gsea-incontext_explorer.ngrok.io

Download Full-text

fluff: exploratory analysis and visualization of high-throughput sequencing data

PeerJ ◽

10.7717/peerj.2209 ◽

2016 ◽

Vol 4 ◽

pp. e2209 ◽

Cited By ~ 28

Author(s):

Georgios Georgiou ◽

Simon J. van Heeringen

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Developmental Stages ◽

Command Line ◽

Clustering Methods ◽

Sequencing Data ◽

Link Type ◽

High Throughput Sequencing Data ◽

Genome Wide ◽

Genome Wide Data

Summary.In this article we describe fluff, a software package that allows for simple exploration, clustering and visualization of high-throughput sequencing data mapped to a reference genome. The package contains three command-line tools to generate publication-quality figures in an uncomplicated manner using sensible defaults. Genome-wide data can be aggregated, clustered and visualized in a heatmap, according to different clustering methods. This includes a predefined setting to identify dynamic clusters between different conditions or developmental stages. Alternatively, clustered data can be visualized in a bandplot. Finally, fluff includes a tool to generate genomic profiles. As command-line tools, the fluff programs can easily be integrated into standard analysis pipelines. The installation is straightforward and documentation is available athttp://fluff.readthedocs.org.Availability.fluff is implemented in Python and runs on Linux. The source code is freely available for download athttps://github.com/simonvh/fluff.

Download Full-text

Enriching the gene set analysis of genome-wide data by incorporating directionality of gene expression and combining statistical hypotheses and methods

Nucleic Acids Research ◽

10.1093/nar/gkt111 ◽

2013 ◽

Vol 41 (8) ◽

pp. 4378-4391 ◽

Cited By ~ 353

Author(s):

Leif Väremo ◽

Jens Nielsen ◽

Intawat Nookaew

Keyword(s):

Gene Expression ◽

Gene Set Analysis ◽

Gene Set ◽

Genome Wide ◽

Genome Wide Data ◽

Statistical Hypotheses

Download Full-text

RIdeogram: drawing SVG graphics to visualize and map genome-wide data on the idiograms

PeerJ Computer Science ◽

10.7717/peerj-cs.251 ◽

2020 ◽

Vol 6 ◽

pp. e251 ◽

Cited By ~ 17

Author(s):

Zhaodong Hao ◽

Dekang Lv ◽

Ying Ge ◽

Jisen Shi ◽

Dolf Weijers ◽

...

Keyword(s):

Gc Content ◽

R Package ◽

Whole Genome ◽

Data Mapping ◽

Data Types ◽

Model Species ◽

Chromosomal Distribution ◽

Whole Genome Analysis ◽

Genome Wide ◽

Genome Wide Data

Background Owing to the rapid advances in DNA sequencing technologies, whole genome from more and more species are becoming available at increasing pace. For whole-genome analysis, idiograms provide a very popular, intuitive and effective way to map and visualize the genome-wide information, such as GC content, gene and repeat density, DNA methylation distribution, genomic synteny, etc. However, most available software programs and web servers are available only for a few model species, such as human, mouse and fly, or have limited application scenarios. As more and more non-model species are sequenced with chromosome-level assembly being available, tools that can generate idiograms for a broad range of species and be capable of visualizing more data types are needed to help better understanding fundamental genome characteristics. Results The R package RIdeogram allows users to build high-quality idiograms of any species of interest. It can map continuous and discrete genome-wide data on the idiograms and visualize them in a heat map and track labels, respectively. Conclusion The visualization of genome-wide data mapping and comparison allow users to quickly establish a clear impression of the chromosomal distribution pattern, thus making RIdeogram a useful tool for any researchers working with omics.

Download Full-text

HextractoR: an R package for automatic extraction of hairpins from genome-wide data

10.1101/2020.10.09.333898 ◽

2020 ◽

Author(s):

Cristian Yones ◽

Natalia Macchiaroli ◽

Laura Kamenetzky ◽

Georgina Stegmayer ◽

Diego Milone

Keyword(s):

Structure Prediction ◽

Secondary Structure Prediction ◽

R Package ◽

Automatic Extraction ◽

Stem Loop ◽

Link Type ◽

Novel Mirna ◽

Genome Wide ◽

Genome Wide Data ◽

Good Set

AbstractExtracting stem-loop sequences (hairpins) from genome-wide data is very important nowadays for some data mining tasks in bioinformatics. The genome preprocessing is very important because it has a strong influence on the later steps and the final results. For example, for novel miRNA prediction, all well-known hairpins must be properly located. Although there are some scripts that can be adapted and put together to achieve this task, they are outdated, none of them guarantees finding correspondence to well-known structures in the genome under analysis, and they do not take advantage of the latest advances in secondary structure prediction. We present here an R package for automatic extraction of hairpins from genome-wide data (HextractorR). HextractoR makes an exhaustive and smart analysis of the genome in order to obtain a very good set of short sequences for further processing. Moreover, genomes can be processed in parallel and with low memory requirements. Results obtained showed that HextractoR has effectively outperformed other methods.HextractoR it is freely available at CRAN and Sourceforge.

Download Full-text

MONET: Multi-omic patient module detection by omic selection

10.1101/2020.02.21.960062 ◽

2020 ◽

Author(s):

Nimrod Rappoport ◽

Roy Safra ◽

Ron Shamir

Keyword(s):

Clustering Algorithms ◽

Simulated Data ◽

Cell Types ◽

Clustering Methods ◽

Data Types ◽

Common Structure ◽

Genome Wide ◽

Distinct Cluster ◽

Genome Wide Data ◽

Unique Approach

AbstractRecent advances in experimental biology allow creation of datasets where several genome-wide data types (called omics) are measured per sample. Integrative analysis of multi-omic datasets in general, and clustering of samples in such datasets specifically, can improve our understanding of biological processes and discover different disease subtypes. In this work we present Monet (Multi Omic clustering by Non-Exhaustive Types), which presents a unique approach to multi-omic clustering. Monet discovers modules of similar samples, such that each module is allowed to have a clustering structure for only a subset of the omics. This approach differs from most extant multi-omic clustering algorithms, which assume a common structure across all omics, and from several recent algorithms that model distinct cluster structures using Bayesian statistics. We tested Monet extensively on simulated data, on an image dataset, and on ten multi-omic cancer datasets from TCGA. Our analysis shows that Monet compares favorably with other multi-omic clustering methods. We demonstrate Monet’s biological and clinical relevance by analyzing its results for Ovarian Serous Cystadenocarcinoma. We also show that Monet is robust to missing data, can cluster genes in multi-omic dataset, and reveal modules of cell types in single-cell multi-omic data. Our work shows that Monet is a valuable tool that can provide complementary results to those provided by extant algorithms for multi-omic analysis.

Download Full-text

fluff: exploratory analysis and visualization of high-throughput sequencing data

10.1101/045526 ◽

2016 ◽

Cited By ~ 1

Author(s):

Georgios Georgiou ◽

Simon J. van Heeringen

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Developmental Stages ◽

Command Line ◽

Clustering Methods ◽

Sequencing Data ◽

Link Type ◽

High Throughput Sequencing Data ◽

Genome Wide ◽

Genome Wide Data

AbstractSummaryIn this application note we describe fluff, a software package that allows for simple exploration, clustering and visualization of high-throughput sequencing data mapped to a reference genome. The package contains three command-line tools to generate publication-quality figures in an uncomplicated manner using sensible defaults. Genome-wide data can be aggregated, clustered and visualized in a heatmap, according to different clustering methods. This includes a predefined setting to identify dynamic clusters between different conditions or developmental stages. Alternatively, clustered data can be visualized in a bandplot. Finally, fluff includes a tool to generate genomic profiles. As command-line tools, the fluff programs can easily be integrated into standard analysis pipelines. The installation is straightforward and documentation is available at http://fluff.readthedocs.org.Availabilityfluff is implemented in Python and runs on Linux. The source code is freely available for download at http://github.com/simonvh/[email protected]

Download Full-text

A 10-gene prognostic signature points to LIMCH1 and HLA-DQB1 as important players in aggressive cervical cancer disease

British Journal of Cancer ◽

10.1038/s41416-021-01305-0 ◽

2021 ◽

Author(s):

Mari K. Halle ◽

Marte Sødal ◽

David Forsse ◽

Hilde Engerud ◽

Kathrine Woie ◽

...

Keyword(s):

Cervical Cancer ◽

Primary Tumour ◽

Treatment Options ◽

Prognostic Significance ◽

Gene Signature ◽

Prognostic Impact ◽

Cancer Disease ◽

Gene Set ◽

Patient Cohort ◽

Protein Levels

Abstract Background Advanced cervical cancer carries a particularly poor prognosis, and few treatment options exist. Identification of effective molecular markers is vital to improve the individualisation of treatment. We investigated transcriptional data from cervical carcinomas related to patient survival and recurrence to identify potential molecular drivers for aggressive disease. Methods Primary tumour RNA-sequencing profiles from 20 patients with recurrence and 53 patients with cured disease were compared. Protein levels and prognostic impact for selected markers were identified by immunohistochemistry in a population-based patient cohort. Results Comparison of tumours relative to recurrence status revealed 121 differentially expressed genes. From this gene set, a 10-gene signature with high prognostic significance (p = 0.001) was identified and validated in an independent patient cohort (p = 0.004). Protein levels of two signature genes, HLA-DQB1 (n = 389) and LIMCH1 (LIM and calponin homology domain 1) (n = 410), were independent predictors of survival (hazard ratio 2.50, p = 0.007 for HLA-DQB1 and 3.19, p = 0.007 for LIMCH1) when adjusting for established prognostic markers. HLA-DQB1 protein expression associated with programmed death ligand 1 positivity (p < 0.001). In gene set enrichment analyses, HLA-DQB1high tumours associated with immune activation and response to interferon-γ (IFN-γ). Conclusions This study revealed a 10-gene signature with high prognostic power in cervical cancer. HLA-DQB1 and LIMCH1 are potential biomarkers guiding cervical cancer treatment.

Download Full-text

Genetic analysis of amyotrophic lateral sclerosis identifies contributing pathways and cell types

Science Advances ◽

10.1126/sciadv.abd9036 ◽

2021 ◽

Vol 7 (3) ◽

pp. eabd9036

Author(s):

Sara Saez-Atienzar ◽

Sara Bandres-Ciga ◽

Rebekah G. Langston ◽

Jonggeol J. Kim ◽

Shing Wan Choi ◽

...

Keyword(s):

Amyotrophic Lateral Sclerosis ◽

Membrane Trafficking ◽

Molecular Mechanisms ◽

Cell Types ◽

Polygenic Risk Score ◽

Genome Wide ◽

Genome Wide Data ◽

Data Driven Approach ◽

Single Nucleus ◽

Lateral Sclerosis

Despite the considerable progress in unraveling the genetic causes of amyotrophic lateral sclerosis (ALS), we do not fully understand the molecular mechanisms underlying the disease. We analyzed genome-wide data involving 78,500 individuals using a polygenic risk score approach to identify the biological pathways and cell types involved in ALS. This data-driven approach identified multiple aspects of the biology underlying the disease that resolved into broader themes, namely, neuron projection morphogenesis, membrane trafficking, and signal transduction mediated by ribonucleotides. We also found that genomic risk in ALS maps consistently to GABAergic interneurons and oligodendrocytes, as confirmed in human single-nucleus RNA-seq data. Using two-sample Mendelian randomization, we nominated six differentially expressed genes (ATG16L2, ACSL5, MAP1LC3A, MAPKAPK3, PLXNB2, and SCFD1) within the significant pathways as relevant to ALS. We conclude that the disparate genetic etiologies of this fatal neurological disease converge on a smaller number of final common pathways and cell types.

Download Full-text