Assessing the shared variation among high-dimensional data matrices: a modified version of the Procrustean correlation coefficient

Mapping Intimacies ◽

10.1101/842070 ◽

2019 ◽

Author(s):

E. Coissac ◽

C. Gonindard-Melodelima

Keyword(s):

Correlation Coefficient ◽

Partial Correlation ◽

High Dimensional Data ◽

Correlation Coefficients ◽

R Package ◽

High Dimensional ◽

Data Sets ◽

High Dimension Data ◽

Correct Estimation ◽

Shared Variation

AbstractMotivationMolecular biology and ecology studies can produce high dimension data. Estimating correlations and shared variation between such data sets are an important step in disentangling the relationships between different elements of a biological system. Unfortunately, classical approaches are susceptible to producing falsely inferred correlations.ResultsHere we propose a corrected version of the Procrustean correlation coefficient that is robust to high dimensional data. This allows for a correct estimation of the shared variation between two data sets and the partial correlation coefficients between a set of matrix data.AvailabilityThe proposed corrected coefficients are implemented in the ProcMod R package available on CRAN. The git repository is hosted at https://git.metabarcoding.org/lecasofts/[email protected]

Download Full-text

Cluster Weighted Model Based on TSNE Algorithm for High-Dimensional Data

10.21203/rs.3.rs-347795/v1 ◽

2021 ◽

Author(s):

Kehinde Olobatuyi

Keyword(s):

Mixture Models ◽

Dimensional Space ◽

High Dimensional Data ◽

Expectation Maximization Algorithm ◽

Real Data ◽

R Package ◽

High Dimensional ◽

Data Sets ◽

Dimensionality Reduction Technique ◽

Weighted Model

Abstract Similar to many Machine Learning models, both accuracy and speed of the Cluster weighted models (CWMs) can be hampered by high-dimensional data, leading to previous works on a parsimonious technique to reduce the effect of ”Curse of dimensionality” on mixture models. In this work, we review the background study of the cluster weighted models (CWMs). We further show that parsimonious technique is not sufficient for mixture models to thrive in the presence of huge high-dimensional data. We discuss a heuristic for detecting the hidden components by choosing the initial values of location parameters using the default values in the ”FlexCWM” R package. We introduce a dimensionality reduction technique called T-distributed stochastic neighbor embedding (TSNE) to enhance the parsimonious CWMs in high-dimensional space. Originally, CWMs are suited for regression but for classification purposes, all multi-class variables are transformed logarithmically with some noise. The parameters of the model are obtained via expectation maximization algorithm. The effectiveness of the discussed technique is demonstrated using real data sets from different fields.

Download Full-text

stepwiseCM: An R Package for Stepwise Classification of Cancer Samples Using Multiple Heterogeneous Data Sets

Cancer Informatics ◽

10.4137/cin.s13075 ◽

2014 ◽

Vol 13 ◽

pp. CIN.S13075

Author(s):

Askar Obulkasim ◽

Mark A van de Wiel

Keyword(s):

Waiting Times ◽

High Dimensional Data ◽

R Package ◽

Heterogeneous Data ◽

The Other ◽

High Dimensional ◽

Data Sets ◽

Classification Problems ◽

Data Types ◽

Crucial Difference

This paper presents the R/Bioconductor package stepwiseCM, which classifies cancer samples using two heterogeneous data sets in an efficient way. The algorithm is able to capture the distinct classification power of two given data types without actually combining them. This package suits for classification problems where two different types of data sets on the same samples are available. One of these data types has measurements on all samples and the other one has measurements on some samples. One is easy to collect and/or relatively cheap (eg, clinical covariates) compared to the latter (high-dimensional data, eg, gene expression). One additional application for which stepwiseCM is proven to be useful as well is the combination of two high-dimensional data types, eg, DNA copy number and mRNA expression. The package includes functions to project the neighborhood information in one data space to the other to determine a potential group of samples that are likely to benefit most by measuring the second type of covariates. The two heterogeneous data spaces are connected by indirect mapping. The crucial difference between the stepwise classification strategy implemented in this package and the existing packages is that our approach aims to be cost-efficient by avoiding measuring additional covariates, which might be expensive or patient-unfriendly, for a potentially large subgroup of individuals. Moreover, in diagnosis for these individuals test, results would be quickly available, which may lead to reduced waiting times and hence lower the patients’ distress. The improvement described remedies the key limitations of existing packages, and facilitates the use of the stepwiseCM package in diverse applications.

Download Full-text

A primer on high-dimensional data analysis workflows for studying visual cortex development and plasticity

10.1101/554378 ◽

2019 ◽

Cited By ~ 1

Author(s):

Justin L. Balsor ◽

David G. Jones ◽

Kathryn M. Murphy

Keyword(s):

Big Data ◽

Visual Cortex ◽

Clustering Algorithms ◽

High Dimensional Data ◽

R Package ◽

High Dimensional ◽

Data Sets ◽

Data Set ◽

Dimensional Changes ◽

Or Genes

AbstractNew techniques for quantifying large numbers of proteins or genes are transforming the study of plasticity mechanisms in visual cortex (V1) into the era of big data. With those changes comes the challenge of applying new analytical methods designed for high-dimensional data. Studies of V1, however, can take advantage of the known functions that many proteins have in regulating experience-dependent plasticity to facilitate linking big data analyses with neurobiological functions. Here we discuss two workflows and provide example R code for analyzing high-dimensional changes in a group of proteins (or genes) using two data sets. The first data set includes 7 neural proteins, 9 visual conditions, and 3 regions in V1 from an animal model for amblyopia. The second data set includes 23 neural proteins and 31 ages (20d-80yrs) from human post-mortem samples of V1. Each data set presents different challenges and we describe using PCA, tSNE, and various clustering algorithms including sparse high-dimensional clustering. Also, we describe a new approach for identifying high-dimensional features and using them to construct a plasticity phenotype that identifies neurobiological differences among clusters. We include an R package “v1hdexplorer” that aggregates the various coding packages and custom visualization scripts written in R Studio.

Download Full-text

An Advanced Mining Services in Predicting and Ranking User Vitality across Dynamic and High Dimensional Data Sets

SSRN Electronic Journal ◽

10.2139/ssrn.3395242 ◽

2019 ◽

Author(s):

Ch. Durga Bhavani ◽

Dr. A. Daveedu Raju ◽

Dr. V. Surya Narayana

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Data Sets

Download Full-text

Hybrid random subsample classifier ensemble for high dimensional data sets

International Journal of Hybrid Intelligent Systems ◽

10.3233/his-2012-0149 ◽

2012 ◽

Vol 9 (2) ◽

pp. 91-103 ◽

Cited By ~ 1

Author(s):

Santhosh Pathical ◽

Gursel Serpen

Keyword(s):

High Dimensional Data ◽

Classifier Ensemble ◽

High Dimensional ◽

Data Sets

Download Full-text

Fast Approximate Similarity Search in Extremely High-Dimensional Data Sets

21st International Conference on Data Engineering (ICDE'05) ◽

10.1109/icde.2005.66 ◽

2005 ◽

Cited By ~ 43

Author(s):

M.E. Houle ◽

Jun Sakuma

Keyword(s):

Similarity Search ◽

High Dimensional Data ◽

High Dimensional ◽

Data Sets ◽

Approximate Similarity Search ◽

Approximate Similarity

Download Full-text

Mahalanobis distance informed by clustering

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iay011 ◽

2018 ◽

Vol 8 (2) ◽

pp. 377-406

Author(s):

Almog Lahav ◽

Ronen Talmon ◽

Yuval Kluger

Keyword(s):

Mahalanobis Distance ◽

High Dimensional Data ◽

Hidden Variables ◽

Real Data ◽

Risk Groups ◽

High Dimensional ◽

Data Sets ◽

Kaplan Meier ◽

Data Points ◽

Survival Plot

Abstract A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored, which is the structure stemming from the relationships between the coordinates. Specifically, we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space. We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan–Meier survival plot.

Download Full-text

Double-Arc Parallel Coordinates and its Axes re-Ordering Methods

Mobile Networks and Applications ◽

10.1007/s11036-019-01455-9 ◽

2020 ◽

Vol 25 (4) ◽

pp. 1376-1391

Author(s):

Liangfu Lu ◽

Wenbo Wang ◽

Zhiyuan Tan

Keyword(s):

High Dimensional Data ◽

Two Dimensions ◽

High Dimensional ◽

Data Sets ◽

Parallel Coordinates ◽

Practical Applications ◽

Data Volume ◽

Visual Clutter ◽

Correlation Information ◽

Value Decomposition

AbstractThe Parallel Coordinates Plot (PCP) is a popular technique for the exploration of high-dimensional data. In many cases, researchers apply it as an effective method to analyze and mine data. However, when today’s data volume is getting larger, visual clutter and data clarity become two of the main challenges in parallel coordinates plot. Although Arc Coordinates Plot (ACP) is a popular approach to address these challenges, few optimization and improvement have been made on it. In this paper, we do three main contributions on the state-of-the-art PCP methods. One approach is the improvement of visual method itself. The other two approaches are mainly on the improvement of perceptual scalability when the scale or the dimensions of the data turn to be large in some mobile and wireless practical applications. 1) We present an improved visualization method based on ACP, termed as double arc coordinates plot (DACP). It not only reduces the visual clutter in ACP, but use a dimension-based bundling method with further optimization to deals with the issues of the conventional parallel coordinates plot (PCP). 2)To reduce the clutter caused by the order of the axes and reveal patterns that hidden in the data sets, we propose our first dimensional reordering method, a contribution-based method in DACP, which is based on the singular value decomposition (SVD) algorithm. The approach computes the importance score of attributes (dimensions) of the data using SVD and visualize the dimensions from left to right in DACP according the score in SVD. 3) Moreover, a similarity-based method, which is based on the combination of nonlinear correlation coefficient and SVD algorithm, is proposed as well in the paper. To measure the correlation between two dimensions and explains how the two dimensions interact with each other, we propose a reordering method based on non-linear correlation information measurements. We mainly use mutual information to calculate the partial similarity of dimensions in high-dimensional data visualization, and SVD is used to measure global data. Lastly, we use five case scenarios to evaluate the effectiveness of DACP, and the results show that our approaches not only do well in visualizing multivariate dataset, but also effectively alleviate the visual clutter in the conventional PCP, which bring users a better visual experience.

Download Full-text

Data segmentation based on the local intrinsic dimension

Scientific Reports ◽

10.1038/s41598-020-72222-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Michele Allegra ◽

Elena Facco ◽

Francesco Denti ◽

Alessandro Laio ◽

Antonietta Mira

Keyword(s):

High Dimensional Data ◽

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Imaging Data ◽

Unsupervised Segmentation ◽

Real World Data ◽

Data Set ◽

Intrinsic Dimension

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.

Download Full-text

Smoothed Linear Modeling for Smooth Spectral Data

International Journal of Spectroscopy ◽

10.1155/2013/604548 ◽

2013 ◽

Vol 2013 ◽

pp. 1-8 ◽

Cited By ~ 4

Author(s):

Douglas M. Hawkins ◽

Edgard M. Maboudou-Tchao

Keyword(s):

Spectral Data ◽

Functional Data ◽

High Dimensional Data ◽

Basis Functions ◽

High Dimensional ◽

Data Sets ◽

Linear Modeling ◽

Linear Discriminant ◽

Smooth Basis ◽

Prediction Problems

Classification and prediction problems using spectral data lead to high-dimensional data sets. Spectral data are, however, different from most other high-dimensional data sets in that information usually varies smoothly with wavelength, suggesting that fitted models should also vary smoothly with wavelength. Functional data analysis, widely used in the analysis of spectral data, meets this objective by changing perspective from the raw spectra to approximations using smooth basis functions. This paper explores linear regression and linear discriminant analysis fitted directly to the spectral data, imposing penalties on the values and roughness of the fitted coefficients, and shows by example that this can lead to better fits than existing standard methodologies.

Download Full-text