The geometry of clinical labs and wellness states from deeply phenotyped humans

AbstractLongitudinal multi-omics measurements are highly valuable in studying heterogeneity in health and disease phenotypes. For thousands of people, we have collected longitudinal multi-omics data. To analyze, interpret and visualize this extremely high-dimensional data, we use the Pareto Task Inference (ParTI) method. We find that the clinical labs data fall within a tetrahedron. We then use all other data types to characterize the four archetypes. We find that the tetrahedron comprises three wellness states, defining a wellness triangular plane, and one aberrant health state that captures aspects of commonality in movement away from wellness. We reveal the tradeoffs that shape the data and their hierarchy, and use longitudinal data to observe individual trajectories. We then demonstrate how the movement on the tetrahedron can be used for detecting unexpected trajectories, which might indicate transitions from health to disease and reveal abnormal conditions, even when all individual blood measurements are in the norm.

Download Full-text

Abstract 355: The integrated analysis of multiple, high-dimensional data types by joint matrix approximations of rank-1 with applications to liver cancer and glioblastoma

10.1158/1538-7445.am2014-355 ◽

2014 ◽

Author(s):

Gordon S. Okimoto

Keyword(s):

Liver Cancer ◽

High Dimensional Data ◽

High Dimensional ◽

Integrated Analysis ◽

Data Types ◽

Matrix Approximations

Download Full-text

Joint analysis of multiple high-dimensional data types using sparse matrix approximations of rank-1 with applications to ovarian and liver cancer

BioData Mining ◽

10.1186/s13040-016-0103-7 ◽

2016 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

Gordon Okimoto ◽

Ashkan Zeinalzadeh ◽

Tom Wenska ◽

Michael Loomis ◽

James B. Nation ◽

...

Keyword(s):

Liver Cancer ◽

Sparse Matrix ◽

High Dimensional Data ◽

Joint Analysis ◽

High Dimensional ◽

Data Types ◽

Matrix Approximations

Download Full-text

Unsupervised Learning for Large Scale Data: The ATHLOS Project

10.1101/2021.04.01.21254751 ◽

2021 ◽

Author(s):

Petros Barmpas ◽

Sotiris Tasoulis ◽

Aristidis G. Vrahatis ◽

Panagiotis Anagnostou ◽

Spiros Georgakopoulos ◽

...

Keyword(s):

Unsupervised Learning ◽

Real World ◽

Large Scale ◽

High Dimensional Data ◽

Experimental Studies ◽

Mixed Data ◽

Categorical Variables ◽

High Dimensional ◽

Data Types ◽

Unified Framework

1AbstractRecent technological advancements in various domains, such as the biomedical and health, offer a plethora of big data for analysis. Part of this data pool is the experimental studies that record various and several features for each instance. It creates datasets having very high dimensionality with mixed data types, with both numerical and categorical variables. On the other hand, unsupervised learning has shown to be able to assist in high-dimensional data, allowing the discovery of unknown patterns through clustering, visualization, dimensionality reduction, and in some cases, their combination. This work highlights unsupervised learning methodologies for large-scale, high-dimensional data, providing the potential of a unified framework that combines the knowledge retrieved from clustering and visualization. The main purpose is to uncover hidden patterns in a high-dimensional mixed dataset, which we achieve through our application in a complex, real-world dataset. The experimental analysis indicates the existence of notable information exposing the usefulness of the utilized methodological framework for similar high-dimensional and mixed, real-world applications.

Download Full-text

Visual Exploration of Relationships and Structure in Low-Dimensional Embeddings

10.31219/osf.io/ujbrs ◽

2021 ◽

Author(s):

Klaus Eckelt ◽

Andreas Hinterreiter ◽

Patrick Adelberger ◽

Conny Walchshofer ◽

Vaishali Dhanoa ◽

...

Keyword(s):

High Dimensional Data ◽

Visual Exploration ◽

High Dimensional ◽

Data Types ◽

Structural Relationships ◽

Or Groups ◽

Analysis Workflow ◽

Visual Approach ◽

Real World Datasets ◽

Low Dimensional

In this work, we propose an interactive visual approach for the exploration of structural relationships in embeddings of high-dimensional data. These structural relationships, such as item sequences, associations of items with groups, and hierarchies between groups of items, are defining properties of many real-world datasets. Nevertheless, most existing methods for the visual exploration of embeddings treat these structures as second-class citizens or do not take them into account at all. In our proposed analysis workflow, users explore enriched scatterplots of the embedding, in which relationships between items and/or groups are visually highlighted. The original high-dimensional data for single items, groups of items, or differences between connected items and groups is accessible through additional summary visualizations. We carefully tailored these summary and difference visualizations to the various data types and semantic contexts. During their exploratory analysis, users can externalize their insights by setting up additional groups and relationships between items and/or groups, thereby creating graphs that represent visual data stories. We demonstrate the utility and potential impact of our approach by means of two use cases and multiple examples from various domains.

Download Full-text

Enter the matrix: factorization uncovers knowledge from omics Names/Affiliations

10.1101/196915 ◽

2017 ◽

Cited By ~ 2

Author(s):

Genevieve L. Stein-O’Brien ◽

Raman Arora ◽

Aedin C. Culhane ◽

Alexander V. Favorov ◽

Lana X. Garmire ◽

...

Keyword(s):

High Throughput ◽

Matrix Factorization ◽

Time Course ◽

High Dimensional Data ◽

Dimensional Structure ◽

High Dimensional ◽

Biological Knowledge ◽

Omics Data ◽

Cellular Interactions ◽

Low Dimensional

AbstractOmics data contains signal from the molecular, physical, and kinetic inter- and intra-cellular interactions that control biological systems. Matrix factorization techniques can reveal low-dimensional structure from high-dimensional data that reflect these interactions. These techniques can uncover new biological knowledge from diverse high-throughput omics data in topics ranging from pathway discovery to time course analysis. We review exemplary applications of matrix factorization for systems-level analyses. We discuss appropriate application of these methods, their limitations, and focus on analysis of results to facilitate optimal biological interpretation. The inference of biologically relevant features with matrix factorization enables discovery from high-throughput data beyond the limits of current biological knowledge—answering questions from high-dimensional data that we have not yet thought to ask.

Download Full-text

Random forests for high-dimensional longitudinal data

Statistical Methods in Medical Research ◽

10.1177/0962280220946080 ◽

2020 ◽

pp. 096228022094608

Author(s):

Louis Capitaine ◽

Robin Genuer ◽

Rodolphe Thiébaut

Keyword(s):

Longitudinal Data ◽

Random Forests ◽

High Dimensional Data ◽

Vaccine Trial ◽

Repeated Measurements ◽

Supervised Machine Learning ◽

Estimation Methods ◽

High Dimensional ◽

Hiv Viral Load ◽

Gene Transcripts

Random forests are one of the state-of-the-art supervised machine learning methods and achieve good performance in high-dimensional settings where p, the number of predictors, is much larger than n, the number of observations. Repeated measurements provide, in general, additional information, hence they are worth accounted especially when analyzing high-dimensional data. Tree-based methods have already been adapted to clustered and longitudinal data by using a semi-parametric mixed effects model, in which the non-parametric part is estimated using regression trees or random forests. We propose a general approach of random forests for high-dimensional longitudinal data. It includes a flexible stochastic model which allows the covariance structure to vary over time. Furthermore, we introduce a new method which takes intra-individual covariance into consideration to build random forests. Through simulation experiments, we then study the behavior of different estimation methods, especially in the context of high-dimensional data. Finally, the proposed method has been applied to an HIV vaccine trial including 17 HIV-infected patients with 10 repeated measurements of 20,000 gene transcripts and blood concentration of human immunodeficiency virus RNA. The approach selected 21 gene transcripts for which the association with HIV viral load was fully relevant and consistent with results observed during primary infection.

Download Full-text

stepwiseCM: An R Package for Stepwise Classification of Cancer Samples Using Multiple Heterogeneous Data Sets

Cancer Informatics ◽

10.4137/cin.s13075 ◽

2014 ◽

Vol 13 ◽

pp. CIN.S13075

Author(s):

Askar Obulkasim ◽

Mark A van de Wiel

Keyword(s):

Waiting Times ◽

High Dimensional Data ◽

R Package ◽

Heterogeneous Data ◽

The Other ◽

High Dimensional ◽

Data Sets ◽

Classification Problems ◽

Data Types ◽

Crucial Difference

This paper presents the R/Bioconductor package stepwiseCM, which classifies cancer samples using two heterogeneous data sets in an efficient way. The algorithm is able to capture the distinct classification power of two given data types without actually combining them. This package suits for classification problems where two different types of data sets on the same samples are available. One of these data types has measurements on all samples and the other one has measurements on some samples. One is easy to collect and/or relatively cheap (eg, clinical covariates) compared to the latter (high-dimensional data, eg, gene expression). One additional application for which stepwiseCM is proven to be useful as well is the combination of two high-dimensional data types, eg, DNA copy number and mRNA expression. The package includes functions to project the neighborhood information in one data space to the other to determine a potential group of samples that are likely to benefit most by measuring the second type of covariates. The two heterogeneous data spaces are connected by indirect mapping. The crucial difference between the stepwise classification strategy implemented in this package and the existing packages is that our approach aims to be cost-efficient by avoiding measuring additional covariates, which might be expensive or patient-unfriendly, for a potentially large subgroup of individuals. Moreover, in diagnosis for these individuals test, results would be quickly available, which may lead to reduced waiting times and hence lower the patients’ distress. The improvement described remedies the key limitations of existing packages, and facilitates the use of the stepwiseCM package in diverse applications.

Download Full-text