scholarly journals Evaluating the informativeness of deep learning annotations for human complex diseases

2019 ◽  
Author(s):  
Kushal K. Dey ◽  
Bryce Van de Geijn ◽  
Samuel Sungil Kim ◽  
Farhad Hormozdiari ◽  
David R. Kelley ◽  
...  

AbstractDeep learning models have shown great promise in predicting genome-wide regulatory effects from DNA sequence, but their informativeness for human complex diseases and traits is not fully understood. Here, we evaluate the disease informativeness of allelic-effect annotations (absolute value of the predicted difference between reference and variant alleles) constructed using two previously trained deep learning models, DeepSEA and Basenji. We apply stratified LD score regression (S-LDSC) to 41 independent diseases and complex traits (average N=320K) to evaluate each annotation’s informativeness for disease heritability conditional on a broad set of coding, conserved, regulatory and LD-related annotations from the baseline-LD model and other sources; as a secondary metric, we also evaluate the accuracy of models that incorporate deep learning annotations in predicting disease-associated or fine-mapped SNPs. We aggregated annotations across all tissues (resp. blood cell types or brain tissues) in meta-analyses across all 41 traits (resp. 11 blood-related traits or 8 brain-related traits). These allelic-effect annotations were highly enriched for disease heritability, but produced only limited conditionally significant results – only Basenji-H3K4me3 in meta-analyses across all 41 traits and brain-specific Basenji-H3K4me3 in meta-analyses across 8 brain-related traits. We conclude that deep learning models are yet to achieve their full potential to provide considerable amount of unique information for complex disease, and that the informativeness of deep learning models for disease beyond established functional annotations cannot be inferred from metrics based on their accuracy in predicting regulatory annotations.

2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Kushal K. Dey ◽  
Bryce van de Geijn ◽  
Samuel Sungil Kim ◽  
Farhad Hormozdiari ◽  
David R. Kelley ◽  
...  

Abstract Deep learning models have shown great promise in predicting regulatory effects from DNA sequence, but their informativeness for human complex diseases is not fully understood. Here, we evaluate genome-wide SNP annotations from two previous deep learning models, DeepSEA and Basenji, by applying stratified LD score regression to 41 diseases and traits (average N = 320K), conditioning on a broad set of coding, conserved and regulatory annotations. We aggregated annotations across all (respectively blood or brain) tissues/cell-types in meta-analyses across all (respectively 11 blood or 8 brain) traits. The annotations were highly enriched for disease heritability, but produced only limited conditionally significant results: non-tissue-specific and brain-specific Basenji-H3K4me3 for all traits and brain traits respectively. We conclude that deep learning models have yet to achieve their full potential to provide considerable unique information for complex disease, and that their conditional informativeness for disease cannot be inferred from their accuracy in predicting regulatory annotations.


2020 ◽  
Author(s):  
Kushal K. Dey ◽  
Samuel S. Kim ◽  
Steven Gazal ◽  
Joseph Nasser ◽  
Jesse M. Engreitz ◽  
...  

AbstractDeep learning models have achieved great success in predicting genome-wide regulatory effects from DNA sequence, but recent work has reported that SNP annotations derived from these predictions contribute limited unique information for human complex disease. Here, we explore three integrative approaches to improve the disease informativeness of allelic-effect annotations (predicted difference between reference and variant alleles) constructed using two previously trained deep learning models, DeepSEA and Basenji. First, we employ gradient boosting to learn optimal combinations of deep learning annotations, using (off-chromosome) fine-mapped SNPs and matched control SNPs for training. Second, we improve the specificity of these annotations by restricting them to SNPs implicated by (proximal and distal) SNP-to-gene (S2G) linking strategies, e.g. prioritizing SNPs involved in gene regulation. Third, we predict gene expression (and derive allelic-effect annotations) from deep learning annotations at SNPs implicated by S2G linking strategies — generalizing the previously proposed ExPecto approach, which incorporates deep learning annotations based on distance to TSS. We evaluated these approaches using stratified LD score regression, using functional data in blood and focusing on 11 autoimmune diseases and blood-related traits (average N=306K). We determined that the three approaches produced SNP annotations that were uniquely informative for these diseases/traits, despite the fact that linear combinations of the underlying DeepSEA and Basenji blood annotations were not uniquely informative for these diseases/traits. Our results highlight the benefits of integrating SNP annotations produced by deep learning models with other types of data, including data linking SNPs to genes.


2021 ◽  
Vol 13 (8) ◽  
pp. 1602
Author(s):  
Qiaoqiao Sun ◽  
Xuefeng Liu ◽  
Salah Bourennane

Deep learning models have strong abilities in learning features and they have been successfully applied in hyperspectral images (HSIs). However, the training of most deep learning models requires labeled samples and the collection of labeled samples are labor-consuming in HSI. In addition, single-level features from a single layer are usually considered, which may result in the loss of some important information. Using multiple networks to obtain multi-level features is a solution, but at the cost of longer training time and computational complexity. To solve these problems, a novel unsupervised multi-level feature extraction framework that is based on a three dimensional convolutional autoencoder (3D-CAE) is proposed in this paper. The designed 3D-CAE is stacked by fully 3D convolutional layers and 3D deconvolutional layers, which allows for the spectral-spatial information of targets to be mined simultaneously. Besides, the 3D-CAE can be trained in an unsupervised way without involving labeled samples. Moreover, the multi-level features are directly obtained from the encoded layers with different scales and resolutions, which is more efficient than using multiple networks to get them. The effectiveness of the proposed multi-level features is verified on two hyperspectral data sets. The results demonstrate that the proposed method has great promise in unsupervised feature learning and can help us to further improve the hyperspectral classification when compared with single-level features.


2019 ◽  
Vol 9 (3) ◽  
pp. 20190003 ◽  
Author(s):  
Marco Giulini ◽  
Raffaello Potestio

Deep learning (DL) algorithms hold great promise for applications in the field of computational biophysics. In fact, the vast amount of available molecular structures, as well as their notable complexity, constitutes an ideal context in which DL-based approaches can be profitably employed. To express the full potential of these techniques, though, it is a prerequisite to express the information contained in a molecule’s atomic positions and distances in a set of input quantities that the network can process. Many of the molecular descriptors devised so far are effective and manageable for relatively small structures, but become complex and cumbersome for larger ones. Furthermore, most of them are defined locally, a feature that could represent a limit for those applications where global properties are of interest. Here, we build a DL architecture capable of predicting non-trivial and intrinsically global quantities, that is, the eigenvalues of a protein’s lowest-energy fluctuation modes. This application represents a first, relatively simple test bed for the development of a neural network approach to the quantitative analysis of protein structures, and demonstrates unexpected use in the identification of mechanically relevant regions of the molecule.


1995 ◽  
Vol 9 (3) ◽  
pp. 169-174
Author(s):  
Craig H Warden ◽  
Jerome I Rotter

Identification of genes underlying complex traits has been difficult, but combined application of novel methods and mouse models provides new hope. Rare monogenic syndromes, and candidate gene and biochemical approaches are sometimes useful, but each of these approaches also has limitations. Some problems that prevent identification and isolation of genes underlying complex disease can be avoided by the use of whole genome mapping of mouse crosses or of human families. Mice have many advantages for the study of complex disease, including an extensive genetic map. A generic method has recently been developed and applied for detection of quantitative trait loci (QTLs) using whole genome maps of mouse crosses. Availability of more than 200 congenic strains provides another incentive for studies in mice. Congenic strains provide a rich, but previously unexploited, resource for the rapid identification of genes causing complex diseases. A congenic mouse strain is genetically identical to a background strain, except for a small chromosomal region derived from a donor strain. Thus, comparison of a phenotype in a congenic strain with the phenotype in its background strain allows study of the effects of single genes derived from the donor strain, isolated from the effects of other donor strain genes. Application of all or several techniques to complex disease studies in mice and in humans may lead to the identification and understanding of complex diseases whose etiology is currently unknown.


Metabolites ◽  
2019 ◽  
Vol 9 (4) ◽  
pp. 66 ◽  
Author(s):  
Michael Lee ◽  
Ting Hu

Metabolomics uses quantitative analyses of metabolites from tissues or bodily fluids to acquire a functional readout of the physiological state. Complex diseases arise from the influence of multiple factors, such as genetics, environment and lifestyle. Since genes, RNAs and proteins converge onto the terminal downstream metabolome, metabolomics datasets offer a rich source of information in a complex and convoluted presentation. Thus, powerful computational methods capable of deciphering the effects of many upstream influences have become increasingly necessary. In this review, the workflow of metabolic marker discovery is outlined from metabolite extraction to model interpretation and validation. Additionally, current metabolomics research in various complex disease areas is examined to identify gaps and trends in the use of several statistical and computational algorithms. Then, we highlight and discuss three advanced machine-learning algorithms, specifically ensemble learning, artificial neural networks, and genetic programming, that are currently less visible, but are budding with high potential for utility in metabolomics research. With an upward trend in the use of highly-accurate, multivariate models in the metabolomics literature, diagnostic biomarker panels of complex diseases are more recently achieving accuracies approaching or exceeding traditional diagnostic procedures. This review aims to provide an overview of computational methods in metabolomics and promote the use of up-to-date machine-learning and computational methods by metabolomics researchers.


2017 ◽  
Vol 7 (7) ◽  
pp. 2271-2279 ◽  
Author(s):  
Chao Xu ◽  
Ji-Gang Zhang ◽  
Dongdong Lin ◽  
Lan Zhang ◽  
Hui Shen ◽  
...  

Abstract Integrating diverse genomics data can provide a global view of the complex biological processes related to the human complex diseases. Although substantial efforts have been made to integrate different omics data, there are at least three challenges for multi-omics integration methods: (i) How to simultaneously consider the effects of various genomic factors, since these factors jointly influence the phenotypes; (ii) How to effectively incorporate the information from publicly accessible databases and omics datasets to fully capture the interactions among (epi)genomic factors from diverse omics data; and (iii) Until present, the combination of more than two omics datasets has been poorly explored. Current integration approaches are not sufficient to address all of these challenges together. We proposed a novel integrative analysis framework by incorporating sparse model, multivariate analysis, Gaussian graphical model, and network analysis to address these three challenges simultaneously. Based on this strategy, we performed a systemic analysis for glioblastoma multiforme (GBM) integrating genome-wide gene expression, DNA methylation, and miRNA expression data. We identified three regulatory modules of genomic factors associated with GBM survival time and revealed a global regulatory pattern for GBM by combining the three modules, with respect to the common regulatory factors. Our method can not only identify disease-associated dysregulated genomic factors from different omics, but more importantly, it can incorporate the information from publicly accessible databases and omics datasets to infer a comprehensive interaction map of all these dysregulated genomic factors. Our work represents an innovative approach to enhance our understanding of molecular genomic mechanisms underlying human complex diseases.


2021 ◽  
Author(s):  
Loic Mangnier ◽  
Charles Joly-Beauparlant ◽  
Arnaud Droit ◽  
Steve Bilodeau ◽  
Alexandre Bureau

Background: The 3-dimensional (3D) conformation of the chromatin creates complex networks of noncoding regulatory regions (distal elements) and genes with important implications in gene regulation. Despite the importance of the role of noncoding regions in complex traits, little is known about their interplay within regulatory hubs and the implication in multigenic diseases like schizophrenia. Results: Here we show that cis-regulatory hubs (CRHs) in neurons highlight functional interactions between distal elements and promoters, providing a model to explain the epigenetic mechanisms involved in complex diseases. CRHs represent a new 3D model, where several distal elements interact to create a complex network of active genes. Indeed, we found that CRHs represent functional structures, showing higher transcriptional activity. In a disease context, CRHs highlighted strong enrichments in schizophrenia-associated genes, schizophrenia-associated SNPs and schizophrenia heritability compared to equivalent tissue and non-tissue-specific structures. Also, genes, by sharing the same distal elements, converge to common biological processes associated with schizophrenia. Finally, the results showed that in a complex disease etiology, small CRHs by linking fewer distal elements to promoters constitute a more informative structure than larger hubs. Conclusion: CRHs are a new 3D model of the chromatin interactions between gene promoters and their distal elements highlighting causal regulatory processes and providing a better understanding of complex disease etiology such as schizophrenia. Indeed, by providing a finer scale chromosome architecture, we have genetic and statistical evidence that CRHs represent a major advancement in 3D models to study the epigenetic underlying processes involved in complex traits.


2020 ◽  
Author(s):  
Dean Sumner ◽  
Jiazhen He ◽  
Amol Thakkar ◽  
Ola Engkvist ◽  
Esben Jannik Bjerrum

<p>SMILES randomization, a form of data augmentation, has previously been shown to increase the performance of deep learning models compared to non-augmented baselines. Here, we propose a novel data augmentation method we call “Levenshtein augmentation” which considers local SMILES sub-sequence similarity between reactants and their respective products when creating training pairs. The performance of Levenshtein augmentation was tested using two state of the art models - transformer and sequence-to-sequence based recurrent neural networks with attention. Levenshtein augmentation demonstrated an increase performance over non-augmented, and conventionally SMILES randomization augmented data when used for training of baseline models. Furthermore, Levenshtein augmentation seemingly results in what we define as <i>attentional gain </i>– an enhancement in the pattern recognition capabilities of the underlying network to molecular motifs.</p>


2019 ◽  
Author(s):  
Mohammad Rezaei ◽  
Yanjun Li ◽  
Xiaolin Li ◽  
Chenglong Li

<b>Introduction:</b> The ability to discriminate among ligands binding to the same protein target in terms of their relative binding affinity lies at the heart of structure-based drug design. Any improvement in the accuracy and reliability of binding affinity prediction methods decreases the discrepancy between experimental and computational results.<br><b>Objectives:</b> The primary objectives were to find the most relevant features affecting binding affinity prediction, least use of manual feature engineering, and improving the reliability of binding affinity prediction using efficient deep learning models by tuning the model hyperparameters.<br><b>Methods:</b> The binding site of target proteins was represented as a grid box around their bound ligand. Both binary and distance-dependent occupancies were examined for how an atom affects its neighbor voxels in this grid. A combination of different features including ANOLEA, ligand elements, and Arpeggio atom types were used to represent the input. An efficient convolutional neural network (CNN) architecture, DeepAtom, was developed, trained and tested on the PDBbind v2016 dataset. Additionally an extended benchmark dataset was compiled to train and evaluate the models.<br><b>Results: </b>The best DeepAtom model showed an improved accuracy in the binding affinity prediction on PDBbind core subset (Pearson’s R=0.83) and is better than the recent state-of-the-art models in this field. In addition when the DeepAtom model was trained on our proposed benchmark dataset, it yields higher correlation compared to the baseline which confirms the value of our model.<br><b>Conclusions:</b> The promising results for the predicted binding affinities is expected to pave the way for embedding deep learning models in virtual screening and rational drug design fields.


Sign in / Sign up

Export Citation Format

Share Document