U-CIE [/juː ’siː/]: Color encoding of high-dimensional data

AbstractU-CIE is a visualization method that encodes arbitrary high-dimensional data as colors using a combination of dimensionality reduction and the CIELAB color space to retain the original structure to the extent possible. We illustrate its broad applicability by visualizing single-cell data on a protein network and metagenomic data on a world map and on scatter plots. U-CIE is available as a web resource at https://u-cie.jensenlab.org/ and as an R package.

Download Full-text

Cluster Weighted Model Based on TSNE Algorithm for High-Dimensional Data

10.21203/rs.3.rs-347795/v1 ◽

2021 ◽

Author(s):

Kehinde Olobatuyi

Keyword(s):

Mixture Models ◽

Dimensional Space ◽

High Dimensional Data ◽

Expectation Maximization Algorithm ◽

Real Data ◽

R Package ◽

High Dimensional ◽

Data Sets ◽

Dimensionality Reduction Technique ◽

Weighted Model

Abstract Similar to many Machine Learning models, both accuracy and speed of the Cluster weighted models (CWMs) can be hampered by high-dimensional data, leading to previous works on a parsimonious technique to reduce the effect of ”Curse of dimensionality” on mixture models. In this work, we review the background study of the cluster weighted models (CWMs). We further show that parsimonious technique is not sufficient for mixture models to thrive in the presence of huge high-dimensional data. We discuss a heuristic for detecting the hidden components by choosing the initial values of location parameters using the default values in the ”FlexCWM” R package. We introduce a dimensionality reduction technique called T-distributed stochastic neighbor embedding (TSNE) to enhance the parsimonious CWMs in high-dimensional space. Originally, CWMs are suited for regression but for classification purposes, all multi-class variables are transformed logarithmically with some noise. The parameters of the model are obtained via expectation maximization algorithm. The effectiveness of the discussed technique is demonstrated using real data sets from different fields.

Download Full-text

Assessing the shared variation among high-dimensional data matrices: a modified version of the Procrustean correlation coefficient

10.1101/842070 ◽

2019 ◽

Author(s):

E. Coissac ◽

C. Gonindard-Melodelima

Keyword(s):

Correlation Coefficient ◽

Partial Correlation ◽

High Dimensional Data ◽

Correlation Coefficients ◽

R Package ◽

High Dimensional ◽

Data Sets ◽

High Dimension Data ◽

Correct Estimation ◽

Shared Variation

AbstractMotivationMolecular biology and ecology studies can produce high dimension data. Estimating correlations and shared variation between such data sets are an important step in disentangling the relationships between different elements of a biological system. Unfortunately, classical approaches are susceptible to producing falsely inferred correlations.ResultsHere we propose a corrected version of the Procrustean correlation coefficient that is robust to high dimensional data. This allows for a correct estimation of the shared variation between two data sets and the partial correlation coefficients between a set of matrix data.AvailabilityThe proposed corrected coefficients are implemented in the ProcMod R package available on CRAN. The git repository is hosted at https://git.metabarcoding.org/lecasofts/[email protected]

Download Full-text

Projected t-SNE for batch correction

Bioinformatics ◽

10.1093/bioinformatics/btaa189 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3522-3527 ◽

Cited By ~ 3

Author(s):

Emanuele Aliverti ◽

Jeffrey L Tilson ◽

Dayne L Filer ◽

Benjamin Babcock ◽

Alejandro Colaneri ◽

...

Keyword(s):

Single Cell ◽

High Dimensional Data ◽

Cell Types ◽

R Package ◽

High Dimensional ◽

Batch Effects ◽

Batch Correction ◽

Fundamental Information ◽

Cell Gene Expression ◽

Low Dimensional

Abstract Motivation Low-dimensional representations of high-dimensional data are routinely employed in biomedical research to visualize, interpret and communicate results from different pipelines. In this article, we propose a novel procedure to directly estimate t-SNE embeddings that are not driven by batch effects. Without correction, interesting structure in the data can be obscured by batch effects. The proposed algorithm can therefore significantly aid visualization of high-dimensional data. Results The proposed methods are based on linear algebra and constrained optimization, leading to efficient algorithms and fast computation in many high-dimensional settings. Results on artificial single-cell transcription profiling data show that the proposed procedure successfully removes multiple batch effects from t-SNE embeddings, while retaining fundamental information on cell types. When applied to single-cell gene expression data to investigate mouse medulloblastoma, the proposed method successfully removes batches related with mice identifiers and the date of the experiment, while preserving clusters of oligodendrocytes, astrocytes, and endothelial cells and microglia, which are expected to lie in the stroma within or adjacent to the tumours. Availability and implementation Source code implementing the proposed approach is available as an R package at https://github.com/emanuelealiverti/BC_tSNE, including a tutorial to reproduce the simulation studies. Contact [email protected]

Download Full-text

dGAselID: An R Package for Selecting a Variable Number of Features in High Dimensional Data

The R Journal ◽

10.32614/rj-2017-040 ◽

2017 ◽

Vol 9 (2) ◽

pp. 18

Author(s):

Nicolae,Teodor Melita ◽

Stefan Holban

Keyword(s):

High Dimensional Data ◽

R Package ◽

Variable Number ◽

High Dimensional

Download Full-text

DataViz: visualization of high-dimensional data in virtual reality

F1000Research ◽

10.12688/f1000research.16453.1 ◽

2018 ◽

Vol 7 ◽

pp. 1687

Author(s):

Eric Feng ◽

Xijin Ge

Keyword(s):

Virtual Reality ◽

High Dimensional Data ◽

Principal Component ◽

High Dimensional ◽

Use Case ◽

Rna Seq ◽

Development Environment ◽

3 Dimensional ◽

Scatter Plots ◽

Complex Datasets

Virtual reality (VR) simulations promote interactivity and immersion, and provide an opportunity that may help researchers gain insights from complex datasets. To explore the utility and potential of VR in graphically rendering large datasets, we have developed an application for immersive, 3-dimensional (3D) scatter plots. Developed using the Unity development environment, DataViz enables the visualization of high-dimensional data with the HTC Vive, a relatively inexpensive and modern virtual reality headset available to the general public. DataViz has the following features: (1) principal component analysis (PCA) of the dataset; (2) graphical rendering of said dataset’s 3D projection onto its first three principal components; and (3) intuitive controls and instructions for using the application. As a use case, we applied DataViz to visualize a single-cell RNA-Seq dataset. DataViz can help gain insights from complex datasets by enabling interaction with high-dimensional data.

Download Full-text

stepwiseCM: An R Package for Stepwise Classification of Cancer Samples Using Multiple Heterogeneous Data Sets

Cancer Informatics ◽

10.4137/cin.s13075 ◽

2014 ◽

Vol 13 ◽

pp. CIN.S13075

Author(s):

Askar Obulkasim ◽

Mark A van de Wiel

Keyword(s):

Waiting Times ◽

High Dimensional Data ◽

R Package ◽

Heterogeneous Data ◽

The Other ◽

High Dimensional ◽

Data Sets ◽

Classification Problems ◽

Data Types ◽

Crucial Difference

This paper presents the R/Bioconductor package stepwiseCM, which classifies cancer samples using two heterogeneous data sets in an efficient way. The algorithm is able to capture the distinct classification power of two given data types without actually combining them. This package suits for classification problems where two different types of data sets on the same samples are available. One of these data types has measurements on all samples and the other one has measurements on some samples. One is easy to collect and/or relatively cheap (eg, clinical covariates) compared to the latter (high-dimensional data, eg, gene expression). One additional application for which stepwiseCM is proven to be useful as well is the combination of two high-dimensional data types, eg, DNA copy number and mRNA expression. The package includes functions to project the neighborhood information in one data space to the other to determine a potential group of samples that are likely to benefit most by measuring the second type of covariates. The two heterogeneous data spaces are connected by indirect mapping. The crucial difference between the stepwise classification strategy implemented in this package and the existing packages is that our approach aims to be cost-efficient by avoiding measuring additional covariates, which might be expensive or patient-unfriendly, for a potentially large subgroup of individuals. Moreover, in diagnosis for these individuals test, results would be quickly available, which may lead to reduced waiting times and hence lower the patients’ distress. The improvement described remedies the key limitations of existing packages, and facilitates the use of the stepwiseCM package in diverse applications.

Download Full-text

RRPP: An r package for fitting linear models to high‐dimensional data using residual randomization

Methods in Ecology and Evolution ◽

10.1111/2041-210x.13029 ◽

2018 ◽

Vol 9 (7) ◽

pp. 1772-1779 ◽

Cited By ~ 78

Author(s):

Michael L. Collyer ◽

Dean C. Adams

Keyword(s):

Linear Models ◽

High Dimensional Data ◽

R Package ◽

High Dimensional

Download Full-text

A primer on high-dimensional data analysis workflows for studying visual cortex development and plasticity

10.1101/554378 ◽

2019 ◽

Cited By ~ 1

Author(s):

Justin L. Balsor ◽

David G. Jones ◽

Kathryn M. Murphy

Keyword(s):

Big Data ◽

Visual Cortex ◽

Clustering Algorithms ◽

High Dimensional Data ◽

R Package ◽

High Dimensional ◽

Data Sets ◽

Data Set ◽

Dimensional Changes ◽

Or Genes

AbstractNew techniques for quantifying large numbers of proteins or genes are transforming the study of plasticity mechanisms in visual cortex (V1) into the era of big data. With those changes comes the challenge of applying new analytical methods designed for high-dimensional data. Studies of V1, however, can take advantage of the known functions that many proteins have in regulating experience-dependent plasticity to facilitate linking big data analyses with neurobiological functions. Here we discuss two workflows and provide example R code for analyzing high-dimensional changes in a group of proteins (or genes) using two data sets. The first data set includes 7 neural proteins, 9 visual conditions, and 3 regions in V1 from an animal model for amblyopia. The second data set includes 23 neural proteins and 31 ages (20d-80yrs) from human post-mortem samples of V1. Each data set presents different challenges and we describe using PCA, tSNE, and various clustering algorithms including sparse high-dimensional clustering. Also, we describe a new approach for identifying high-dimensional features and using them to construct a plasticity phenotype that identifies neurobiological differences among clusters. We include an R package “v1hdexplorer” that aggregates the various coding packages and custom visualization scripts written in R Studio.

Download Full-text