scholarly journals Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Laura Cantini ◽  
Pooya Zakeri ◽  
Celine Hernandez ◽  
Aurelien Naldi ◽  
Denis Thieffry ◽  
...  

AbstractHigh-dimensional multi-omics data are now standard in biology. They can greatly enhance our understanding of biological systems when effectively integrated. To achieve proper integration, joint Dimensionality Reduction (jDR) methods are among the most efficient approaches. However, several jDR methods are available, urging the need for a comprehensive benchmark with practical guidelines. We perform a systematic evaluation of nine representative jDR methods using three complementary benchmarks. First, we evaluate their performances in retrieving ground-truth sample clustering from simulated multi-omics datasets. Second, we use TCGA cancer data to assess their strengths in predicting survival, clinical annotations and known pathways/biological processes. Finally, we assess their classification of multi-omics single-cell data. From these in-depth comparisons, we observe that intNMF performs best in clustering, while MCIA offers an effective behavior across many contexts. The code developed for this benchmark study is implemented in a Jupyter notebook—multi-omics mix (momix)—to foster reproducibility, and support users and future developers.

Author(s):  
Laura Cantini ◽  
Pooya Zakeri ◽  
Celine Hernandez ◽  
Aurelien Naldi ◽  
Denis Thieffry ◽  
...  

AbstractHigh-dimensional multi-omics data are now standard in biology. They can greatly enhance our understanding of biological systems when effectively integrated. To achieve this multi-omics data integration, Joint Dimensionality Reduction (jDR) methods are among the most efficient approaches. However, several jDR methods are available, urging the need for a comprehensive benchmark with practical guidelines.We performed a systematic evaluation of nine representative jDR methods using three complementary benchmarks. First, we evaluated their performances in retrieving ground-truth sample clustering from simulated multi-omics datasets. Second, we used TCGA cancer data to assess their strengths in predicting survival, clinical annotations and known pathways/biological processes. Finally, we assessed their classification of multi-omics single-cell data.From these in-depth comparisons, we observed that intNMF performs best in clustering, while MCIA offers a consistent and effective behavior across many contexts. The full code of this benchmark is implemented in a Jupyter notebook - multi-omics mix (momix) - to foster reproducibility, and support data producers, users and future developers.


2020 ◽  
Author(s):  
Ken Chen ◽  
Shaoheng Liang ◽  
Vakul Mohanty ◽  
Jinzhuang Dou ◽  
Miao Qi ◽  
...  

Abstract A key challenge in studying organisms and diseases is to detect rare molecular programs and rare cell populations (RCPs) that drive development, differentiation, and transformation. Molecular features such as genes and proteins defining RCPs are often unknown and difficult to detect from unenriched single-cell data, using conventional dimensionality reduction and clustering-based approaches. Here, we propose a novel unsupervised approach, named SCMER, which performs UMAP style dimensionality reduction via selecting a compact set of molecular features with definitive meanings. We applied SCMER in the context of hematopoiesis, lymphogenesis, tumorigenesis, and drug resistance and response. We found that SCMER can identify non-redundant features that sensitively delineate both common cell lineages and rare cellular states ignored by current approaches. SCMER can be widely used for discovering novel molecular features in a high dimensional dataset, designing targeted, cost-effective assays for clinical applications, and facilitating multi-modality integration.


2018 ◽  
Author(s):  
Etienne Becht ◽  
Charles-Antoine Dutertre ◽  
Immanuel W. H. Kwok ◽  
Lai Guan Ng ◽  
Florent Ginhoux ◽  
...  

AbstractUniform Manifold Approximation and Projection (UMAP) is a recently-published non-linear dimensionality reduction technique. Another such algorithm, t-SNE, has been the default method for such task in the past years. Herein we comment on the usefulness of UMAP high-dimensional cytometry and single-cell RNA sequencing, notably highlighting faster runtime and consistency, meaningful organization of cell clusters and preservation of continuums in UMAP compared to t-SNE.


2014 ◽  
pp. 32-42
Author(s):  
Matthieu Voiry ◽  
Kurosh Madani ◽  
Véronique Véronique Amarger ◽  
Joël Bernier

A major step for high-quality optical surfaces faults diagnosis concerns scratches and digs defects characterization in products. This challenging operation is very important since it is directly linked with the produced optical component’s quality. A classification phase is mandatory to complete optical devices diagnosis since a number of correctable defects are usually present beside the potential “abiding” ones. Unfortunately relevant data extracted from raw image during defects detection phase are high dimensional. This can have harmful effect on the behaviors of artificial neural networks which are suitable to perform such a challenging classification. Reducing data dimension to a smaller value can decrease the problems related to high dimensionality. In this paper we compare different techniques which permit dimensionality reduction and evaluate their impact on classification tasks performances.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Muta Tah Hira ◽  
M. A. Razzaque ◽  
Claudio Angione ◽  
James Scrivens ◽  
Saladin Sawan ◽  
...  

AbstractCancer is a complex disease that deregulates cellular functions at various molecular levels (e.g., DNA, RNA, and proteins). Integrated multi-omics analysis of data from these levels is necessary to understand the aberrant cellular functions accountable for cancer and its development. In recent years, Deep Learning (DL) approaches have become a useful tool in integrated multi-omics analysis of cancer data. However, high dimensional multi-omics data are generally imbalanced with too many molecular features and relatively few patient samples. This imbalance makes a DL based integrated multi-omics analysis difficult. DL-based dimensionality reduction technique, including variational autoencoder (VAE), is a potential solution to balance high dimensional multi-omics data. However, there are few VAE-based integrated multi-omics analyses, and they are limited to pancancer. In this work, we did an integrated multi-omics analysis of ovarian cancer using the compressed features learned through VAE and an improved version of VAE, namely Maximum Mean Discrepancy VAE (MMD-VAE). First, we designed and developed a DL architecture for VAE and MMD-VAE. Then we used the architecture for mono-omics, integrated di-omics and tri-omics data analysis of ovarian cancer through cancer samples identification, molecular subtypes clustering and classification, and survival analysis. The results show that MMD-VAE and VAE-based compressed features can respectively classify the transcriptional subtypes of the TCGA datasets with an accuracy in the range of 93.2-95.5% and 87.1-95.7%. Also, survival analysis results show that VAE and MMD-VAE based compressed representation of omics data can be used in cancer prognosis. Based on the results, we can conclude that (i) VAE and MMD-VAE outperform existing dimensionality reduction techniques, (ii) integrated multi-omics analyses perform better or similar compared to their mono-omics counterparts, and (iii) MMD-VAE performs better than VAE in most omics dataset.


2018 ◽  
Vol 10 (10) ◽  
pp. 1564 ◽  
Author(s):  
Patrick Bradley ◽  
Sina Keller ◽  
Martin Weinmann

In this paper, we investigate the potential of unsupervised feature selection techniques for classification tasks, where only sparse training data are available. This is motivated by the fact that unsupervised feature selection techniques combine the advantages of standard dimensionality reduction techniques (which only rely on the given feature vectors and not on the corresponding labels) and supervised feature selection techniques (which retain a subset of the original set of features). Thus, feature selection becomes independent of the given classification task and, consequently, a subset of generally versatile features is retained. We present different techniques relying on the topology of the given sparse training data. Thereby, the topology is described with an ultrametricity index. For the latter, we take into account the Murtagh Ultrametricity Index (MUI) which is defined on the basis of triangles within the given data and the Topological Ultrametricity Index (TUI) which is defined on the basis of a specific graph structure. In a case study addressing the classification of high-dimensional hyperspectral data based on sparse training data, we demonstrate the performance of the proposed unsupervised feature selection techniques in comparison to standard dimensionality reduction and supervised feature selection techniques on four commonly used benchmark datasets. The achieved classification results reveal that involving supervised feature selection techniques leads to similar classification results as involving unsupervised feature selection techniques, while the latter perform feature selection independently from the given classification task and thus deliver generally versatile features.


Now a day’s cancer has become a deathly disease due to the abnormal growth of the cell. Many researchers are working in this area for the early prediction of cancer. For the proper classification of cancer data, demands for the identification of proper set of genes by analyzing the genomic data. Most of the researchers used microarrays to identify the cancerous genomes. However, such kind of data is high dimensional where number of genes are more compared to samples. Also the data consists of many irrelevant features and noisy data. The classification technique deal with such kind of data influences the performance of algorithm. A popular classification algorithm (i.e., Logistic Regression) is considered in this work for gene classification. Regularization techniques like Lasso with L1 penalty, Ridge with L2 penalty, and hybrid Lasso with L1/2+2 penalty used to minimize irrelevant features and avoid overfitting. However, these methods are of sparse parametric and limits to linear data. Also methods have not produced promising performance when applied to high dimensional genome data. For solving these problems, this paper presents an Additive Sparse Logistic Regression with Additive Regularization (ASLR) method to discriminate linear and non-linear variables in gene classification. The results depicted that the proposed method proved to be the best-regularized method for classifying microarray data compared to standard methods


Sign in / Sign up

Export Citation Format

Share Document