Analysis of variance when both input and output sets are high-dimensional

AbstractMotivationModern genomic data sets often involve multiple data-layers (e.g., DNA-sequence, gene expression), each of which itself can be high-dimensional. The biological processes underlying these data-layers can lead to intricate multivariate association patterns.ResultsWe propose and evaluate two methods for analysis variance when both input and output sets are high-dimensional. Our approach uses random effects models to estimate the proportion of variance of vectors in the linear span of the output set that can be explained by regression on the input set. We consider a method based on orthogonal basis (Eigen-ANOVA) and one that uses random vectors (Monte Carlo ANOVA, MC-ANOVA) in the linear span of the output set. We used simulations to assess the bias and variance of each of the methods, and to compare it with that of the Partial Least Squares (PLS)–an approach commonly used in multivariate-high-dimensional regressions. The MC-ANOVA method gave nearly unbiased estimates in all the simulation scenarios considered. Estimates produced by Eigen-ANOVA and PLS had noticeable biases. Finally, we demonstrate insight that can be obtained with the of MC-ANOVA and Eigen-ANOVA by applying these two methods to the study of multi-locus linkage disequilibrium in chicken genomes and to the assessment of inter-dependencies between gene expression, methylation and copy-number-variants in data from breast cancer tumors.AvailabilityThe Supplementary data includes an R-implementation of each of the proposed methods as well as the scripts used in simulations and in the real-data [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

ANOVA-HD: Analysis of variance when both input and output layers are high-dimensional

PLoS ONE ◽

10.1371/journal.pone.0243251 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0243251

Author(s):

Gustavo de los Campos ◽

Torsten Pook ◽

Agustin Gonzalez-Reymundez ◽

Henner Simianer ◽

George Mias ◽

...

Keyword(s):

Breast Cancer ◽

Gene Expression ◽

Copy Number ◽

Homo Sapiens ◽

Linear Span ◽

Copy Number Variants ◽

High Dimensional ◽

Data Set ◽

Cancer Data ◽

Data Layers

Modern genomic data sets often involve multiple data-layers (e.g., DNA-sequence, gene expression), each of which itself can be high-dimensional. The biological processes underlying these data-layers can lead to intricate multivariate association patterns. We propose and evaluate two methods to determine the proportion of variance of an output data set that can be explained by an input data set when both data panels are high dimensional. Our approach uses random-effects models to estimate the proportion of variance of vectors in the linear span of the output set that can be explained by regression on the input set. We consider a method based on an orthogonal basis (Eigen-ANOVA) and one that uses random vectors (Monte Carlo ANOVA, MC-ANOVA) in the linear span of the output set. Using simulations, we show that the MC-ANOVA method gave nearly unbiased estimates. Estimates produced by Eigen-ANOVA were also nearly unbiased, except when the shared variance was very high (e.g., >0.9). We demonstrate the potential insight that can be obtained from the use of MC-ANOVA and Eigen-ANOVA by applying these two methods to the study of multi-locus linkage disequilibrium in chicken (Gallus gallus) genomes and to the assessment of inter-dependencies between gene expression, methylation, and copy-number-variants in data from breast cancer tumors from humans (Homo sapiens). Our analyses reveal that in chicken breeding populations ~50,000 evenly-spaced SNPs are enough to fully capture the span of whole-genome-sequencing genomes. In the study of multi-omic breast cancer data, we found that the span of copy-number-variants can be fully explained using either methylation or gene expression data and that roughly 74% of the variance in gene expression can be predicted from methylation data.

Download Full-text

CyTOFmerge: integrating mass cytometry data across multiple panels

Bioinformatics ◽

10.1093/bioinformatics/btz180 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4063-4071 ◽

Cited By ~ 3

Author(s):

Tamim Abdelaal ◽

Thomas Höllt ◽

Vincent van Unen ◽

Boudewijn P F Lelieveldt ◽

Frits Koning ◽

...

Keyword(s):

Single Cell ◽

Biological Sample ◽

Supplementary Information ◽

High Dimensional ◽

Single Cell Level ◽

Supplementary Data ◽

Mass Cytometry ◽

Cell Level ◽

Cellular Markers

Abstract Motivation High-dimensional mass cytometry (CyTOF) allows the simultaneous measurement of multiple cellular markers at single-cell level, providing a comprehensive view of cell compositions. However, the power of CyTOF to explore the full heterogeneity of a biological sample at the single-cell level is currently limited by the number of markers measured simultaneously on a single panel. Results To extend the number of markers per cell, we propose an in silico method to integrate CyTOF datasets measured using multiple panels that share a set of markers. Additionally, we present an approach to select the most informative markers from an existing CyTOF dataset to be used as a shared marker set between panels. We demonstrate the feasibility of our methods by evaluating the quality of clustering and neighborhood preservation of the integrated dataset, on two public CyTOF datasets. We illustrate that by computationally extending the number of markers we can further untangle the heterogeneity of mass cytometry data, including rare cell-population detection. Availability and implementation Implementation is available on GitHub (https://github.com/tabdelaal/CyTOFmerge). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

miqoGraph: fitting admixture graphs using mixed-integer quadratic optimization

Bioinformatics ◽

10.1093/bioinformatics/btaa988 ◽

2020 ◽

Author(s):

Julia Yan ◽

Nick Patterson ◽

Vagheesh M Narasimhan

Keyword(s):

Genetic Relationship ◽

Real Data ◽

Quadratic Optimization ◽

Supplementary Information ◽

Mixed Integer ◽

Supplementary Data ◽

Integer Optimization ◽

Speed Up

Abstract Summary Admixture graphs represent the genetic relationship between a set of populations through splits, drift and admixture. In this article, we present the Julia package miqoGraph, which uses mixed-integer quadratic optimization to fit topology, drift lengths and admixture proportions simultaneously. Through applications of miqoGraph to both simulated and real data, we show that integer optimization can greatly speed up and automate what is usually an arduous manual process. Availability and implementation https://github.com/juliayyan/PhylogeneticTrees.jl. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LipidFinder 2.0: advanced informatics pipeline for lipidomics discovery applications

Bioinformatics ◽

10.1093/bioinformatics/btaa856 ◽

2020 ◽

Author(s):

Jorge Alvarez-Jarreta ◽

Patricia R S Rodrigues ◽

Eoin Fahy ◽

Anne O’Connor ◽

Anna Price ◽

...

Keyword(s):

Real Data ◽

Supplementary Information ◽

Supplementary Data ◽

Scatter Plot ◽

Lipid Profiling ◽

False Discovery ◽

False Discovery Rate Method ◽

Rate Method ◽

Assess Data Quality ◽

Lipid Structures

Abstract Summary We present LipidFinder 2.0, incorporating four new modules that apply artefact filters, remove lipid and contaminant stacks, in-source fragments and salt clusters, and a new isotope deletion method which is significantly more sensitive than available open-access alternatives. We also incorporate a novel false discovery rate method, utilizing a target–decoy strategy, which allows users to assess data quality. A renewed lipid profiling method is introduced which searches three different databases from LIPID MAPS and returns bulk lipid structures only, and a lipid category scatter plot with color blind friendly pallet. An API interface with XCMS Online is made available on LipidFinder’s online version. We show using real data that LipidFinder 2.0 provides a significant improvement over non-lipid metabolite filtering and lipid profiling, compared to available tools. Availability and implementation LipidFinder 2.0 is freely available at https://github.com/ODonnell-Lipidomics/LipidFinder and http://lipidmaps.org/resources/tools/lipidfinder. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

APAlyzer: a bioinformatics package for analysis of alternative polyadenylation isoforms

Bioinformatics ◽

10.1093/bioinformatics/btaa266 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3907-3909 ◽

Cited By ~ 3

Author(s):

Ruijia Wang ◽

Bin Tian

Keyword(s):

Gene Expression ◽

Alternative Polyadenylation ◽

Supplementary Information ◽

Human Tissues ◽

Bioconductor Package ◽

Supplementary Data ◽

Rna Seq ◽

Eukaryotic Genes ◽

Polyadenylation Sites

Abstract Summary Most eukaryotic genes produce alternative polyadenylation (APA) isoforms. APA is dynamically regulated under different growth and differentiation conditions. Here, we present a bioinformatics package, named APAlyzer, for examining 3′UTR APA, intronic APA and gene expression changes using RNA-seq data and annotated polyadenylation sites in the PolyA_DB database. Using APAlyzer and data from the GTEx database, we present APA profiles across human tissues. Availability and implementation APAlyzer is freely available at https://bioconductor.org/packages/release/bioc/html/APAlyzer.html as an R/Bioconductor package. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Triplet-based similarity score for fully multilabeled trees with poly-occurring labels

Bioinformatics ◽

10.1093/bioinformatics/btaa676 ◽

2020 ◽

Author(s):

Simone Ciccolella ◽

Giulia Bernardini ◽

Luca Denti ◽

Paola Bonizzoni ◽

Marco Previtali ◽

...

Keyword(s):

Open Source ◽

Evolutionary History ◽

Similarity Measures ◽

Real Data ◽

Similarity Score ◽

Supplementary Information ◽

Supplementary Data ◽

Wide Range ◽

Golden Standard ◽

History Of

Abstract Motivation The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations occurring multiple times in the tree, a circumstance often occurring in real cases. Results To overcome these limitations, in this article, we propose MP3, the first similarity measure for tumor phylogenies able to effectively manage cases where multiple mutations can occur at the same time and mutations can occur multiple times. Moreover, a comparison of MP3 with other measures shows that it is able to classify correctly similar and dissimilar trees, both on simulated and on real data. Availability and implementation An open source implementation of MP3 is publicly available at https://github.com/AlgoLab/mp3treesim. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CYBERTRACK2.0: zero-inflated model-based cell clustering and population tracking method for longitudinal mass cytometry data

Bioinformatics ◽

10.1093/bioinformatics/btaa873 ◽

2020 ◽

Author(s):

Kodai Minoura ◽

Ko Abe ◽

Yuka Maeda ◽

Hiroyoshi Nishikawa ◽

Teppei Shimamura

Keyword(s):

Real Data ◽

Change Points ◽

Supplementary Information ◽

High Dimensional ◽

Mass Cytometry ◽

Cell Population Dynamics ◽

Tracking Method ◽

Statistical Framework ◽

Track Dynamics ◽

Cell Clustering

Abstract Summary Recent advancements in high-dimensional single-cell technologies, such as mass cytometry, enable longitudinal experiments to track dynamics of cell populations and identify change points where the proportions vary significantly. However, current research is limited by the lack of tools specialized for analyzing longitudinal mass cytometry data. In order to infer cell population dynamics from such data, we developed a statistical framework named CYBERTRACK2.0. The framework’s analytic performance was validated against synthetic and real data, showing that its results are consistent with previous research. Availability and implementation CYBERTRACK2.0 is available at https://github.com/kodaim1115/CYBERTRACK2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

multiclassPairs: An R package to train multiclass pairbased classifier

Bioinformatics ◽

10.1093/bioinformatics/btab088 ◽

2021 ◽

Author(s):

Nour-al-dain Marzouka ◽

Pontus Eriksson

Keyword(s):

Gene Expression ◽

Prediction Models ◽

R Package ◽

Supplementary Information ◽

Tumor Subtype ◽

Test Results ◽

Supplementary Data ◽

Classification Problems ◽

Excellent Performance ◽

Class Prediction

Abstract Motivation k–Top Scoring Pairs (kTSP) algorithms utilize in-sample gene expression feature pair rules for class prediction, and have demonstrated excellent performance and robustness. The available packages and tools primarily focus on binary prediction (i.e. two classes). However, many real-world classification problems e.g., tumor subtype prediction, are multiclass tasks. Results Here, we present multiclassPairs, an R package to train pair-based single sample classifiers for multiclass problems. multiclassPairs offers two main methods to build multiclass prediction models, either using a one-vs-rest kTSP scheme or through a novel pair-based Random Forest approach. The package also provides options for dealing with class imbalances, multiplatform training, missing features in test data, and visualization of training and test results. Availability ‘multiclassPairs’ package is available on CRAN servers and GitHub: https://github.com/NourMarzouka/multiclassPairs Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ShinySOM: graphical SOM-based analysis of single-cell cytometry data

Bioinformatics ◽

10.1093/bioinformatics/btaa091 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3288-3289

Author(s):

Miroslav Kratochvíl ◽

David Bednárek ◽

Tomáš Sieger ◽

Karel Fišer ◽

Jiří Vondrášek

Keyword(s):

Single Cell ◽

High Throughput ◽

Statistical Information ◽

Supplementary Information ◽

High Dimensional ◽

Supplementary Data ◽

Mass Cytometry ◽

High Throughput Analysis ◽

Self Organizing Maps ◽

User Friendly

Abstract Summary ShinySOM offers a user-friendly interface for reproducible, high-throughput analysis of high-dimensional flow and mass cytometry data guided by self-organizing maps. The software implements a FlowSOM-style workflow, with improvements in performance, visualizations and data dissection possibilities. The outputs of the analysis include precise statistical information about the dissected samples, and R-compatible metadata useful for the batch processing of large sample volumes. Availability and implementation ShinySOM is free and open-source, available online at gitlab.com/exaexa/ShinySOM. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Differential Expression Gene Explorer (DrEdGE): a tool for generating interactive online visualizations of gene expression datasets

Bioinformatics ◽

10.1093/bioinformatics/btz972 ◽

2020 ◽

Vol 36 (8) ◽

pp. 2581-2583 ◽

Cited By ~ 2

Author(s):

Sophia C Tintori ◽

Patrick Golden ◽

Bob Goldstein

Keyword(s):

Gene Expression ◽

Caenorhabditis Elegans ◽

Differential Expression ◽

Supplementary Information ◽

Supplementary Data ◽

Online Data ◽

Web Based ◽

Neuronal Tissue ◽

Differential Expression Gene ◽

Data Visualizations

Abstract Summary Differential Expression Gene Explorer (DrEdGE) is a web-based tool that guides genomicists through easily creating interactive online data visualizations, which colleagues can query according to their own conditions to discover genes, samples or patterns of interest. We demonstrate DrEdGE’s features with three example websites generated from publicly available datasets—human neuronal tissue, mouse embryonic tissue and Caenorhabditis elegans whole embryos. DrEdGE increases the utility of large genomics datasets by removing technical obstacles to independent exploration. Availability and implementation Freely available at http://dredge.bio.unc.edu. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text