Evaluating the informativeness of deep learning annotations for human complex diseases

Abstract Deep learning models have shown great promise in predicting regulatory effects from DNA sequence, but their informativeness for human complex diseases is not fully understood. Here, we evaluate genome-wide SNP annotations from two previous deep learning models, DeepSEA and Basenji, by applying stratified LD score regression to 41 diseases and traits (average N = 320K), conditioning on a broad set of coding, conserved and regulatory annotations. We aggregated annotations across all (respectively blood or brain) tissues/cell-types in meta-analyses across all (respectively 11 blood or 8 brain) traits. The annotations were highly enriched for disease heritability, but produced only limited conditionally significant results: non-tissue-specific and brain-specific Basenji-H3K4me3 for all traits and brain traits respectively. We conclude that deep learning models have yet to achieve their full potential to provide considerable unique information for complex disease, and that their conditional informativeness for disease cannot be inferred from their accuracy in predicting regulatory annotations.

Download Full-text

Evaluating the informativeness of deep learning annotations for human complex diseases

10.1101/784439 ◽

2019 ◽

Cited By ~ 3

Author(s):

Kushal K. Dey ◽

Bryce Van de Geijn ◽

Samuel Sungil Kim ◽

Farhad Hormozdiari ◽

David R. Kelley ◽

...

Keyword(s):

Deep Learning ◽

Complex Traits ◽

Complex Disease ◽

Complex Diseases ◽

Great Promise ◽

Full Potential ◽

Learning Models ◽

Allelic Effect ◽

Meta Analyses ◽

Human Complex

AbstractDeep learning models have shown great promise in predicting genome-wide regulatory effects from DNA sequence, but their informativeness for human complex diseases and traits is not fully understood. Here, we evaluate the disease informativeness of allelic-effect annotations (absolute value of the predicted difference between reference and variant alleles) constructed using two previously trained deep learning models, DeepSEA and Basenji. We apply stratified LD score regression (S-LDSC) to 41 independent diseases and complex traits (average N=320K) to evaluate each annotation’s informativeness for disease heritability conditional on a broad set of coding, conserved, regulatory and LD-related annotations from the baseline-LD model and other sources; as a secondary metric, we also evaluate the accuracy of models that incorporate deep learning annotations in predicting disease-associated or fine-mapped SNPs. We aggregated annotations across all tissues (resp. blood cell types or brain tissues) in meta-analyses across all 41 traits (resp. 11 blood-related traits or 8 brain-related traits). These allelic-effect annotations were highly enriched for disease heritability, but produced only limited conditionally significant results – only Basenji-H3K4me3 in meta-analyses across all 41 traits and brain-specific Basenji-H3K4me3 in meta-analyses across 8 brain-related traits. We conclude that deep learning models are yet to achieve their full potential to provide considerable amount of unique information for complex disease, and that the informativeness of deep learning models for disease beyond established functional annotations cannot be inferred from metrics based on their accuracy in predicting regulatory annotations.

Download Full-text

Integrative approaches to improve the informativeness of deep learning models for human complex diseases

10.1101/2020.09.08.288563 ◽

2020 ◽

Author(s):

Kushal K. Dey ◽

Samuel S. Kim ◽

Steven Gazal ◽

Joseph Nasser ◽

Jesse M. Engreitz ◽

...

Keyword(s):

Deep Learning ◽

Complex Disease ◽

Gradient Boosting ◽

Great Success ◽

Learning Models ◽

Allelic Effect ◽

Genome Wide ◽

Variant Alleles ◽

Regulatory Effects ◽

Human Complex

AbstractDeep learning models have achieved great success in predicting genome-wide regulatory effects from DNA sequence, but recent work has reported that SNP annotations derived from these predictions contribute limited unique information for human complex disease. Here, we explore three integrative approaches to improve the disease informativeness of allelic-effect annotations (predicted difference between reference and variant alleles) constructed using two previously trained deep learning models, DeepSEA and Basenji. First, we employ gradient boosting to learn optimal combinations of deep learning annotations, using (off-chromosome) fine-mapped SNPs and matched control SNPs for training. Second, we improve the specificity of these annotations by restricting them to SNPs implicated by (proximal and distal) SNP-to-gene (S2G) linking strategies, e.g. prioritizing SNPs involved in gene regulation. Third, we predict gene expression (and derive allelic-effect annotations) from deep learning annotations at SNPs implicated by S2G linking strategies — generalizing the previously proposed ExPecto approach, which incorporates deep learning annotations based on distance to TSS. We evaluated these approaches using stratified LD score regression, using functional data in blood and focusing on 11 autoimmune diseases and blood-related traits (average N=306K). We determined that the three approaches produced SNP annotations that were uniquely informative for these diseases/traits, despite the fact that linear combinations of the underlying DeepSEA and Basenji blood annotations were not uniquely informative for these diseases/traits. Our results highlight the benefits of integrating SNP annotations produced by deep learning models with other types of data, including data linking SNPs to genes.

Download Full-text

Unsupervised Multi-Level Feature Extraction for Improvement of Hyperspectral Classification

Remote Sensing ◽

10.3390/rs13081602 ◽

2021 ◽

Vol 13 (8) ◽

pp. 1602

Author(s):

Qiaoqiao Sun ◽

Xuefeng Liu ◽

Salah Bourennane

Keyword(s):

Feature Extraction ◽

Deep Learning ◽

Spatial Information ◽

Hyperspectral Data ◽

Great Promise ◽

Learning Models ◽

Single Level ◽

Multiple Networks ◽

Multi Level ◽

Hyperspectral Classification

Deep learning models have strong abilities in learning features and they have been successfully applied in hyperspectral images (HSIs). However, the training of most deep learning models requires labeled samples and the collection of labeled samples are labor-consuming in HSI. In addition, single-level features from a single layer are usually considered, which may result in the loss of some important information. Using multiple networks to obtain multi-level features is a solution, but at the cost of longer training time and computational complexity. To solve these problems, a novel unsupervised multi-level feature extraction framework that is based on a three dimensional convolutional autoencoder (3D-CAE) is proposed in this paper. The designed 3D-CAE is stacked by fully 3D convolutional layers and 3D deconvolutional layers, which allows for the spectral-spatial information of targets to be mined simultaneously. Besides, the 3D-CAE can be trained in an unsupervised way without involving labeled samples. Moreover, the multi-level features are directly obtained from the encoded layers with different scales and resolutions, which is more efficient than using multiple networks to get them. The effectiveness of the proposed multi-level features is verified on two hyperspectral data sets. The results demonstrate that the proposed method has great promise in unsupervised feature learning and can help us to further improve the hyperspectral classification when compared with single-level features.

Download Full-text

SeqEnhDL: sequence-based classification of cell type-specific enhancers using deep learning models

10.1101/2020.05.13.093997 ◽

2020 ◽

Author(s):

Yupeng Wang ◽

Rosario B. Jaime-Lara ◽

Abhrarup Roy ◽

Ying Sun ◽

Xinyue Liu ◽

...

Keyword(s):

Neural Network ◽

Deep Learning ◽

Dna Sequences ◽

Cell Types ◽

Learning Models ◽

Cell Type ◽

Coding Sequences ◽

Sequence Features ◽

Cell Type Specific ◽

Different Cell Types

AbstractWe propose SeqEnhDL, a deep learning framework for classifying cell type-specific enhancers based on sequence features. DNA sequences of “strong enhancer” chromatin states in nine cell types from the ENCODE project were retrieved to build and test enhancer classifiers. For any DNA sequence, sequential k-mer (k=5, 7, 9 and 11) fold changes relative to randomly selected non-coding sequences were used as features for deep learning models. Three deep learning models were implemented, including multi-layer perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). All models in SeqEnhDL outperform state-of-the-art enhancer classifiers including gkm-SVM and DanQ, with regard to distinguishing cell type-specific enhancers from randomly selected non-coding sequences. Moreover, SeqEnhDL is able to directly discriminate enhancers from different cell types, which has not been achieved by other enhancer classifiers. Our analysis suggests that both enhancers and their tissue-specificity can be accurately identified according to their sequence features. SeqEnhDL is publicly available at https://github.com/wyp1125/SeqEnhDL.

Download Full-text

A deep learning approach to the structural analysis of proteins

Interface Focus ◽

10.1098/rsfs.2019.0003 ◽

2019 ◽

Vol 9 (3) ◽

pp. 20190003 ◽

Cited By ~ 3

Author(s):

Marco Giulini ◽

Raffaello Potestio

Keyword(s):

Deep Learning ◽

Protein Structures ◽

Molecular Structures ◽

Great Promise ◽

Full Potential ◽

Test Bed ◽

Computational Biophysics ◽

Energy Fluctuation ◽

Neural Network Approach ◽

Global Properties

Deep learning (DL) algorithms hold great promise for applications in the field of computational biophysics. In fact, the vast amount of available molecular structures, as well as their notable complexity, constitutes an ideal context in which DL-based approaches can be profitably employed. To express the full potential of these techniques, though, it is a prerequisite to express the information contained in a molecule’s atomic positions and distances in a set of input quantities that the network can process. Many of the molecular descriptors devised so far are effective and manageable for relatively small structures, but become complex and cumbersome for larger ones. Furthermore, most of them are defined locally, a feature that could represent a limit for those applications where global properties are of interest. Here, we build a DL architecture capable of predicting non-trivial and intrinsically global quantities, that is, the eigenvalues of a protein’s lowest-energy fluctuation modes. This application represents a first, relatively simple test bed for the development of a neural network approach to the quantitative analysis of protein structures, and demonstrates unexpected use in the identification of mechanically relevant regions of the molecule.

Download Full-text

SeqEnhDL: sequence-based classification of cell type-specific enhancers using deep learning models

10.21203/rs.3.rs-94396/v1 ◽

2020 ◽

Author(s):

Yupeng Wang ◽

Rosario Jaime-Lara ◽

Abhrarup Roy ◽

Ying Sun ◽

Xinyue Liu ◽

...

Keyword(s):

Neural Network ◽

Deep Learning ◽

Cell Types ◽

Regulatory Elements ◽

Learning Models ◽

Cell Type ◽

Coding Sequences ◽

Sequence Features ◽

A Genome ◽

Cell Type Specific

Abstract ObjectiveComputational identification of cell type-specific regulatory elements on a genome-wide scale is very challenging.ResultsWe propose SeqEnhDL, a deep learning framework for classifying cell type-specific enhancers based on sequence features. DNA sequences of “strong enhancer” chromatin states in nine cell types from the ENCODE project were retrieved to build and test enhancer classifiers. For any DNA sequence, sequential k-mer (k=5, 7, 9 and 11) fold changes relative to randomly selected non-coding sequences were used as features for deep learning models. Three deep learning models were implemented, including multi-layer perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). All models in SeqEnhDL outperform state-of-the-art enhancer classifiers including gkm-SVM and DanQ, with regard to distinguishing cell type-specific enhancers from randomly selected non-coding sequences. Moreover, SeqEnhDL is able to directly discriminate enhancers from different cell types, which has not been achieved by other enhancer classifiers. Our analysis suggests that both enhancers and their tissue-specificity can be accurately identified according to their sequence features. SeqEnhDL is publicly available at https://github.com/wyp1125/SeqEnhDL.

Download Full-text

Comparative analysis of kidney organoid and adult human kidney single cell and single nucleus transcriptomes

10.1101/232561 ◽

2017 ◽

Cited By ~ 9

Author(s):

Haojia Wu ◽

Kohei Uchimura ◽

Erinn Donnelly ◽

Yuhei Kirita ◽

Samantha A. Morris ◽

...

Keyword(s):

Single Cell ◽

Human Kidney ◽

Cell Types ◽

Great Promise ◽

Full Potential ◽

Diverse Range ◽

Adult Human ◽

Cell Diversity ◽

Single Nucleus ◽

Kidney Organoids

AbstractKidney organoids differentiated from human pluripotent stem cells hold great promise for understanding organogenesis, modeling disease and ultimately as a source of replacement tissue. Realizing the full potential of this technology will require better differentiation strategies based upon knowledge of the cellular diversity and differentiation state of all cells within these organoids. Here we analyze single cell gene expression in 45,227 cells isolated from 23 organoids differentiated using two different protocols. Both generate kidney organoids that contain a diverse range of kidney cells at differing ratios as well as non-renal cell types. We quantified the differentiation state of major organoid kidney cell types by comparing them against a 4,259 single nucleus RNA-seq dataset generated from adult human kidney, revealing immaturity of all kidney organoid cell types. We reconstructed lineage relationships during organoid differentiation through pseudotemporal ordering, and identified transcription factor networks associated with fate decisions. These results define impressive kidney organoid cell diversity, identify incomplete differentiation as a major roadblock for current directed differentiation protocols and provide a human adult kidney snRNA-seq dataset against which to benchmark future progress.

Download Full-text

SeqEnhDL: sequence-based classification of cell type-specific enhancers using deep learning models

BMC Research Notes ◽

10.1186/s13104-021-05518-7 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Yupeng Wang ◽

Rosario B. Jaime-Lara ◽

Abhrarup Roy ◽

Ying Sun ◽

Xinyue Liu ◽

...

Keyword(s):

Neural Network ◽

Deep Learning ◽

Cell Types ◽

Regulatory Elements ◽

Learning Models ◽

Cell Type ◽

Coding Sequences ◽

Sequence Features ◽

A Genome ◽

Cell Type Specific

Abstract Objective To address the challenge of computational identification of cell type-specific regulatory elements on a genome-wide scale. Results We propose SeqEnhDL, a deep learning framework for classifying cell type-specific enhancers based on sequence features. DNA sequences of “strong enhancer” chromatin states in nine cell types from the ENCODE project were retrieved to build and test enhancer classifiers. For any DNA sequence, positional k-mer (k = 5, 7, 9 and 11) fold changes relative to randomly selected non-coding sequences across each nucleotide position were used as features for deep learning models. Three deep learning models were implemented, including multi-layer perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). All models in SeqEnhDL outperform state-of-the-art enhancer classifiers (including gkm-SVM and DanQ) in distinguishing cell type-specific enhancers from randomly selected non-coding sequences. Moreover, SeqEnhDL can directly discriminate enhancers from different cell types, which has not been achieved by other enhancer classifiers. Our analysis suggests that both enhancers and their tissue-specificity can be accurately identified based on their sequence features. SeqEnhDL is publicly available at https://github.com/wyp1125/SeqEnhDL.

Download Full-text

Deep learning models predict regulatory variants in pancreatic islets and refine type 2 diabetes association signals

eLife ◽

10.7554/elife.51503 ◽

2020 ◽

Vol 9 ◽

Cited By ~ 4

Author(s):

Agata Wesolowska-Andersen ◽

Grace Zhuo Yu ◽

Vibe Nylander ◽

Fernando Abaitua ◽

Matthias Thurner ◽

...

Keyword(s):

Type 2 Diabetes ◽

Deep Learning ◽

Pancreatic Islets ◽

Regulatory Function ◽

Learning Models ◽

Association Analyses ◽

Regulatory Variants ◽

Regulatory Effects ◽

Genomic Regions

Genome-wide association analyses have uncovered multiple genomic regions associated with T2D, but identification of the causal variants at these remains a challenge. There is growing interest in the potential of deep learning models - which predict epigenome features from DNA sequence - to support inference concerning the regulatory effects of disease-associated variants. Here, we evaluate the advantages of training convolutional neural network (CNN) models on a broad set of epigenomic features collected in a single disease-relevant tissue – pancreatic islets in the case of type 2 diabetes (T2D) - as opposed to models trained on multiple human tissues. We report convergence of CNN-based metrics of regulatory function with conventional approaches to variant prioritization – genetic fine-mapping and regulatory annotation enrichment. We demonstrate that CNN-based analyses can refine association signals at T2D-associated loci and provide experimental validation for one such signal. We anticipate that these approaches will become routine in downstream analyses of GWAS.

Download Full-text

A Systemic Analysis of Transcriptomic and Epigenomic Data To Reveal Regulation Patterns for Complex Disease

G3 Genes|Genome|Genetics ◽

10.1534/g3.117.042408 ◽

2017 ◽

Vol 7 (7) ◽

pp. 2271-2279 ◽

Cited By ~ 3

Author(s):

Chao Xu ◽

Ji-Gang Zhang ◽

Dongdong Lin ◽

Lan Zhang ◽

Hui Shen ◽

...

Keyword(s):

Complex Disease ◽

Graphical Model ◽

Complex Diseases ◽

Omics Data ◽

Regulatory Modules ◽

Systemic Analysis ◽

Genome Wide ◽

Mirna Expression Data ◽

Interaction Map ◽

Human Complex

Abstract Integrating diverse genomics data can provide a global view of the complex biological processes related to the human complex diseases. Although substantial efforts have been made to integrate different omics data, there are at least three challenges for multi-omics integration methods: (i) How to simultaneously consider the effects of various genomic factors, since these factors jointly influence the phenotypes; (ii) How to effectively incorporate the information from publicly accessible databases and omics datasets to fully capture the interactions among (epi)genomic factors from diverse omics data; and (iii) Until present, the combination of more than two omics datasets has been poorly explored. Current integration approaches are not sufficient to address all of these challenges together. We proposed a novel integrative analysis framework by incorporating sparse model, multivariate analysis, Gaussian graphical model, and network analysis to address these three challenges simultaneously. Based on this strategy, we performed a systemic analysis for glioblastoma multiforme (GBM) integrating genome-wide gene expression, DNA methylation, and miRNA expression data. We identified three regulatory modules of genomic factors associated with GBM survival time and revealed a global regulatory pattern for GBM by combining the three modules, with respect to the common regulatory factors. Our method can not only identify disease-associated dysregulated genomic factors from different omics, but more importantly, it can incorporate the information from publicly accessible databases and omics datasets to infer a comprehensive interaction map of all these dysregulated genomic factors. Our work represents an innovative approach to enhance our understanding of molecular genomic mechanisms underlying human complex diseases.

Download Full-text