Exploring generative deep learning for omics data using log-linear models

Moritz Hess; Maren Hackenberg; Harald Binder

doi:10.1093/bioinformatics/btaa623

Exploring generative deep learning for omics data using log-linear models

Bioinformatics ◽

10.1093/bioinformatics/btaa623 ◽

2020 ◽

Vol 36 (20) ◽

pp. 5045-5053

Author(s):

Moritz Hess ◽

Maren Hackenberg ◽

Harald Binder

Keyword(s):

Deep Learning ◽

Linear Models ◽

Synthetic Data ◽

Simulated Data ◽

Image Data ◽

Supplementary Information ◽

Underlying Structure ◽

Omics Data ◽

Latent Representations ◽

Log Linear

Abstract Motivation Following many successful applications to image data, deep learning is now also increasingly considered for omics data. In particular, generative deep learning not only provides competitive prediction performance, but also allows for uncovering structure by generating synthetic samples. However, exploration and visualization is not as straightforward as with image applications. Results We demonstrate how log-linear models, fitted to the generated, synthetic data can be used to extract patterns from omics data, learned by deep generative techniques. Specifically, interactions between latent representations learned by the approaches and generated synthetic data are used to determine sets of joint patterns. Distances of patterns with respect to the distribution of latent representations are then visualized in low-dimensional coordinate systems, e.g. for monitoring training progress. This is illustrated with simulated data and subsequently with cortical single-cell gene expression data. Using different kinds of deep generative techniques, specifically variational autoencoders and deep Boltzmann machines, the proposed approach highlights how the techniques uncover underlying structure. It facilitates the real-world use of such generative deep learning techniques to gain biological insights from omics data. Availability and implementation The code for the approach as well as an accompanying Jupyter notebook, which illustrates the application of our approach, is available via the GitHub repository: https://github.com/ssehztirom/Exploring-generative-deep-learning-for-omics-data-by-using-log-linear-models. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SMILE: Mutual Information Learning for Integration of Single-cell Omics Data

Bioinformatics ◽

10.1093/bioinformatics/btab706 ◽

2021 ◽

Author(s):

Yang Xu ◽

Priyojit Das ◽

Rachel Patton McCord

Keyword(s):

Deep Learning ◽

Mutual Information ◽

Single Cell ◽

Learning Algorithm ◽

Cellular Systems ◽

Supplementary Information ◽

Omics Data ◽

Learning Approaches ◽

Rna Seq ◽

Integrate Data

Abstract Motivation Deep learning approaches have empowered single-cell omics data analysis in many ways and generated new insights from complex cellular systems. As there is an increasing need for single cell omics data to be integrated across sources, types, and features of data, the challenges of integrating single-cell omics data are rising. Here, we present an unsupervised deep learning algorithm that learns discriminative representations for single-cell data via maximizing mutual information, SMILE (Single-cell Mutual Information Learning). Results Using a unique cell-pairing design, SMILE successfully integrates multi-source single-cell transcriptome data, removing batch effects and projecting similar cell types, even from different tissues, into the shared space. SMILE can also integrate data from two or more modalities, such as joint profiling technologies using single-cell ATAC-seq, RNA-seq, DNA methylation, Hi-C, and ChIP data. When paired cells are known, SMILE can integrate data with unmatched feature, such as genes for RNA-seq and genome wide peaks for ATAC-seq. Integrated representations learned from joint profiling technologies can then be used as a framework for comparing independent single source data. Supplementary information Supplementary data are available at Bioinformatics online. The source code of SMILE including analyses of key results in the study can be found at: https://github.com/rpmccordlab/SMILE.

Download Full-text

Inferring viral occurrence patterns through a synthetic data simulation

10.1101/2021.07.13.452220 ◽

2021 ◽

Author(s):

Ville N Pimenoff ◽

Ramon Cleries

Keyword(s):

Linear Models ◽

Population Sample ◽

Synthetic Data ◽

Interaction Patterns ◽

Viral Strain ◽

Data Simulation ◽

Synthetic Datasets ◽

Pathogen Occurrence ◽

Log Linear ◽

Occurrence Patterns

Viruses infecting humans are manifold and several of them provoke significant morbidity and mortality. Simulations creating large synthetic datasets from observed multiple viral strain infections in a limited population sample can be a powerful tool to infer significant pathogen occurrence and interaction patterns, particularly if limited number of observed data units is available. Here, to demonstrate diverse human papillomavirus (HPV) strain occurrence patterns, we used log-linear models combined with Bayesian framework for graphical independence network (GIN) analysis. That is, to simulate datasets based on modeling the probabilistic associations between observed viral data points, i.e different viral strain infections in a set of population samples. Our GIN analysis outperformed in precision all oversampling methods tested for simulating large synthetic viral strain-level prevalence dataset from observed set of HPVs data. Altogether, we demonstrate that network modeling is a potent tool for creating synthetic viral datasets for comprehensive pathogen occurrence and interaction pattern estimations.

Download Full-text

Use of Deep Learning Features in Log-Linear Models

Log-Linear Models, Extensions, and Applications ◽

10.7551/mitpress/10012.003.0005 ◽

2018 ◽

Keyword(s):

Deep Learning ◽

Linear Models ◽

Deep Learning Features ◽

Log Linear

Download Full-text

Deep-learning with synthetic data enables automated picking of cryo-EM particle images of biological macromolecules

Bioinformatics ◽

10.1093/bioinformatics/btz728 ◽

2019 ◽

Cited By ~ 2

Author(s):

Ruijie Yao ◽

Jiaqiang Qian ◽

Qiang Huang

Keyword(s):

Deep Learning ◽

Single Particle ◽

Feature Matching ◽

Synthetic Data ◽

Supplementary Information ◽

Particle Analysis ◽

Biological Macromolecules ◽

Low Contrast ◽

3D Structures ◽

Particle Images

Abstract Motivation Single-particle cryo-electron microscopy (cryo-EM) has become a powerful technique for determining 3D structures of biological macromolecules at near-atomic resolution. However, this approach requires picking huge numbers of macromolecular particle images from thousands of low-contrast, high-noisy electron micrographs. Although machine-learning methods were developed to get rid of this bottleneck, it still lacks universal methods that could automatically picking the noisy cryo-EM particles of various macromolecules. Results Here, we present a deep-learning segmentation model that employs fully convolutional networks trained with synthetic data of known 3D structures, called PARSED (PARticle SEgmentation Detector). Without using any experimental information, PARSED could automatically segment the cryo-EM particles in a whole micrograph at a time, enabling faster particle picking than previous template/feature-matching and particle-classification methods. Applications to six large public cryo-EM datasets clearly validated its universal ability to pick macromolecular particles of various sizes. Thus, our deep-learning method could break the particle-picking bottleneck in the single-particle analysis, and thereby accelerates the high-resolution structure determination by cryo-EM. Availability and implementation The PARSED package and user manual for noncommercial use are available as Supplementary Material (in the compressed file: parsed_v1.zip). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Using the structure of genome data in the design of deep neural networks for predicting amyotrophic lateral sclerosis from genotype

Bioinformatics ◽

10.1093/bioinformatics/btz369 ◽

2019 ◽

Vol 35 (14) ◽

pp. i538-i547 ◽

Cited By ~ 4

Author(s):

Bojian Yin ◽

Marleen Balvert ◽

Rick A A van der Spek ◽

Bas E Dutilh ◽

Sander Bohté ◽

...

Keyword(s):

Amyotrophic Lateral Sclerosis ◽

Deep Learning ◽

Network Architecture ◽

Linear Models ◽

Association Studies ◽

Supplementary Information ◽

Whole Genome ◽

Promoter Regions ◽

Genome Data ◽

Lateral Sclerosis

Abstract Motivation Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disease caused by aberrations in the genome. While several disease-causing variants have been identified, a major part of heritability remains unexplained. ALS is believed to have a complex genetic basis where non-additive combinations of variants constitute disease, which cannot be picked up using the linear models employed in classical genotype–phenotype association studies. Deep learning on the other hand is highly promising for identifying such complex relations. We therefore developed a deep-learning based approach for the classification of ALS patients versus healthy individuals from the Dutch cohort of the Project MinE dataset. Based on recent insight that regulatory regions harbor the majority of disease-associated variants, we employ a two-step approach: first promoter regions that are likely associated to ALS are identified, and second individuals are classified based on their genotype in the selected genomic regions. Both steps employ a deep convolutional neural network. The network architecture accounts for the structure of genome data by applying convolution only to parts of the data where this makes sense from a genomics perspective. Results Our approach identifies potentially ALS-associated promoter regions, and generally outperforms other classification methods. Test results support the hypothesis that non-additive combinations of variants contribute to ALS. Architectures and protocols developed are tailored toward processing population-scale, whole-genome data. We consider this a relevant first step toward deep learning assisted genotype–phenotype association in whole genome-sized data. Availability and implementation Our code will be available on Github, together with a synthetic dataset (https://github.com/byin-cwi/ALS-Deeplearning). The data used in this study is available to bona-fide researchers upon request. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Subtype-GAN: a deep learning approach for integrative cancer subtyping of multi-omics data

Bioinformatics ◽

10.1093/bioinformatics/btab109 ◽

2021 ◽

Author(s):

Hai Yang ◽

Rui Chen ◽

Dongdong Li ◽

Zhe Wang

Keyword(s):

Neural Network ◽

Deep Learning ◽

Latent Variables ◽

Supplementary Information ◽

Learning Approach ◽

Data Sets ◽

Omics Data ◽

Data Set ◽

Benchmark Data ◽

Cancer Pathogenesis

Abstract Motivation The discovery of cancer subtyping can help explore cancer pathogenesis, determine clinical actionability in treatment, and improve patients' survival rates. However, due to the diversity and complexity of multi-omics data, it is still challenging to develop integrated clustering algorithms for tumor molecular subtyping. Results We propose Subtype-GAN, a deep adversarial learning approach based on the multiple-input multiple-output neural network to model the complex omics data accurately. With the latent variables extracted from the neural network, Subtype-GAN uses consensus clustering and the Gaussian Mixture model to identify tumor samples' molecular subtypes. Compared with other state-of-the-art subtyping approaches, Subtype-GAN achieved outstanding performance on the benchmark data sets consisting of ∼4,000 TCGA tumors from 10 types of cancer. We found that on the comparison data set, the clustering scheme of Subtype-GAN is not always similar to that of the deep learning method AE but is identical to that of NEMO, MCCA, VAE, and other excellent approaches. Finally, we applied Subtype-GAN to the BRCA data set and automatically obtained the number of subtypes and the subtype labels of 1031 BRCA tumors. Through the detailed analysis, we found that the identified subtypes are clinically meaningful and show distinct patterns in the feature space, demonstrating the practicality of Subtype-GAN. Availability The source codes, the clustering results of Subtype-GAN across the benchmark data sets are available at https://github.com/haiyang1986/Subtype-GAN. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

forgeNet: a graph deep neural network model using tree-based ensemble classifiers for feature graph construction

Bioinformatics ◽

10.1093/bioinformatics/btaa164 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3507-3515

Author(s):

Yunchuan Kong ◽

Tianwei Yu

Keyword(s):

Deep Learning ◽

Model Building ◽

Classification Model ◽

Supplementary Information ◽

Sparse Learning ◽

Omics Data ◽

Ensemble Classifiers ◽

Feedforward Network ◽

Unique Challenge ◽

Functional Relationships

Abstract Motivation A unique challenge in predictive model building for omics data has been the small number of samples (n) versus the large amount of features (p). This ‘n≪p’ property brings difficulties for disease outcome classification using deep learning techniques. Sparse learning by incorporating known functional relationships between the biological units, such as the graph-embedded deep feedforward network (GEDFN) model, has been a solution to this issue. However, such methods require an existing feature graph, and potential mis-specification of the feature graph can be harmful on classification and feature selection. Results To address this limitation and develop a robust classification model without relying on external knowledge, we propose a forest graph-embedded deep feedforward network (forgeNet) model, to integrate the GEDFN architecture with a forest feature graph extractor, so that the feature graph can be learned in a supervised manner and specifically constructed for a given prediction task. To validate the method’s capability, we experimented the forgeNet model with both synthetic and real datasets. The resulting high classification accuracy suggests that the method is a valuable addition to sparse deep learning models for omics data. Availability and implementation The method is available at https://github.com/yunchuankong/forgeNet. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Structured sparsity regularization for analyzing high-dimensional omics data

Briefings in Bioinformatics ◽

10.1093/bib/bbaa122 ◽

2020 ◽

Cited By ~ 1

Author(s):

Susana Vinga

Keyword(s):

Statistical Learning ◽

High Performance ◽

Linear Models ◽

Clinical Decision Support Systems ◽

Clinical Decision ◽

Supplementary Information ◽

Biomedical Data ◽

Omics Data ◽

Regression Methods ◽

Estimation Algorithms

Abstract The development of new molecular and cell technologies is having a significant impact on the quantity of data generated nowadays. The growth of omics databases is creating a considerable potential for knowledge discovery and, concomitantly, is bringing new challenges to statistical learning and computational biology for health applications. Indeed, the high dimensionality of these data may hamper the use of traditional regression methods and parameter estimation algorithms due to the intrinsic non-identifiability of the inherent optimization problem. Regularized optimization has been rising as a promising and useful strategy to solve these ill-posed problems by imposing additional constraints in the solution parameter space. In particular, the field of statistical learning with sparsity has been significantly contributing to building accurate models that also bring interpretability to biological observations and phenomena. Beyond the now-classic elastic net, one of the best-known methods that combine lasso with ridge penalizations, we briefly overview recent literature on structured regularizers and penalty functions that have been applied in biomedical data to build parsimonious models in a variety of underlying contexts, from survival to generalized linear models. These methods include functions of $\ell _k$-norms and network-based penalties that take into account the inherent relationships between the features. The successful application to omics data illustrates the potential of sparse structured regularization for identifying disease’s molecular signatures and for creating high-performance clinical decision support systems towards more personalized healthcare. Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.

Download Full-text

SCIM: universal single-cell matching with unpaired feature sets

Bioinformatics ◽

10.1093/bioinformatics/btaa843 ◽

2020 ◽

Vol 36 (Supplement_2) ◽

pp. i919-i927

Author(s):

Stefan G Stark ◽

Joanna Ficek ◽

Francesco Locatello ◽

Ximena Bonilla ◽

Stéphane Chevrier ◽

...

Keyword(s):

Single Cell ◽

Supplementary Information ◽

Underlying Structure ◽

Scalable Algorithms ◽

Melanoma Tumor ◽

Bipartite Matching ◽

Technological Advances ◽

Latent Representations ◽

Low Dimensional ◽

Cell Data

Abstract Motivation Recent technological advances have led to an increase in the production and availability of single-cell data. The ability to integrate a set of multi-technology measurements would allow the identification of biologically or clinically meaningful observations through the unification of the perspectives afforded by each technology. In most cases, however, profiling technologies consume the used cells and thus pairwise correspondences between datasets are lost. Due to the sheer size single-cell datasets can acquire, scalable algorithms that are able to universally match single-cell measurements carried out in one cell to its corresponding sibling in another technology are needed. Results We propose Single-Cell data Integration via Matching (SCIM), a scalable approach to recover such correspondences in two or more technologies. SCIM assumes that cells share a common (low-dimensional) underlying structure and that the underlying cell distribution is approximately constant across technologies. It constructs a technology-invariant latent space using an autoencoder framework with an adversarial objective. Multi-modal datasets are integrated by pairing cells across technologies using a bipartite matching scheme that operates on the low-dimensional latent representations. We evaluate SCIM on a simulated cellular branching process and show that the cell-to-cell matches derived by SCIM reflect the same pseudotime on the simulated dataset. Moreover, we apply our method to two real-world scenarios, a melanoma tumor sample and a human bone marrow sample, where we pair cells from a scRNA dataset to their sibling cells in a CyTOF dataset achieving 90% and 78% cell-matching accuracy for each one of the samples, respectively. Availability and implementation https://github.com/ratschlab/scim. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A variable selection approach for highly correlated predictors in high-dimensional genomic data

Bioinformatics ◽

10.1093/bioinformatics/btab114 ◽

2021 ◽

Author(s):

Wencan Zhu ◽

Céline Lévy-Leduc ◽

Nils Ternès

Keyword(s):

Variable Selection ◽

Linear Models ◽

Synthetic Data ◽

R Package ◽

Supplementary Information ◽

High Dimensional ◽

Selection Approach ◽

Generalized Lasso ◽

Genomic Studies ◽

Highly Correlated

Abstract Motivation In genomic studies, identifying biomarkers associated with a variable of interest is a major concern in biomedical research. Regularized approaches are classically used to perform variable selection in high-dimensional linear models. However, these methods can fail in highly correlated settings. Results We propose a novel variable selection approach called WLasso, taking these correlations into account. It consists in rewriting the initial high-dimensional linear model to remove the correlation between the biomarkers (predictors) and in applying the generalized Lasso criterion. The performance of WLasso is assessed using synthetic data in several scenarios and compared with recent alternative approaches. The results show that when the biomarkers are highly correlated, WLasso outperforms the other approaches in sparse high-dimensional frameworks. The method is also illustrated on publicly available gene expression data in breast cancer. Availabilityand implementation Our method is implemented in the WLasso R package which is available from the Comprehensive R Archive Network (CRAN). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text