scholarly journals Integrative approaches to improve the informativeness of deep learning models for human complex diseases

2020 ◽  
Author(s):  
Kushal K. Dey ◽  
Samuel S. Kim ◽  
Steven Gazal ◽  
Joseph Nasser ◽  
Jesse M. Engreitz ◽  
...  

AbstractDeep learning models have achieved great success in predicting genome-wide regulatory effects from DNA sequence, but recent work has reported that SNP annotations derived from these predictions contribute limited unique information for human complex disease. Here, we explore three integrative approaches to improve the disease informativeness of allelic-effect annotations (predicted difference between reference and variant alleles) constructed using two previously trained deep learning models, DeepSEA and Basenji. First, we employ gradient boosting to learn optimal combinations of deep learning annotations, using (off-chromosome) fine-mapped SNPs and matched control SNPs for training. Second, we improve the specificity of these annotations by restricting them to SNPs implicated by (proximal and distal) SNP-to-gene (S2G) linking strategies, e.g. prioritizing SNPs involved in gene regulation. Third, we predict gene expression (and derive allelic-effect annotations) from deep learning annotations at SNPs implicated by S2G linking strategies — generalizing the previously proposed ExPecto approach, which incorporates deep learning annotations based on distance to TSS. We evaluated these approaches using stratified LD score regression, using functional data in blood and focusing on 11 autoimmune diseases and blood-related traits (average N=306K). We determined that the three approaches produced SNP annotations that were uniquely informative for these diseases/traits, despite the fact that linear combinations of the underlying DeepSEA and Basenji blood annotations were not uniquely informative for these diseases/traits. Our results highlight the benefits of integrating SNP annotations produced by deep learning models with other types of data, including data linking SNPs to genes.

2019 ◽  
Author(s):  
Kushal K. Dey ◽  
Bryce Van de Geijn ◽  
Samuel Sungil Kim ◽  
Farhad Hormozdiari ◽  
David R. Kelley ◽  
...  

AbstractDeep learning models have shown great promise in predicting genome-wide regulatory effects from DNA sequence, but their informativeness for human complex diseases and traits is not fully understood. Here, we evaluate the disease informativeness of allelic-effect annotations (absolute value of the predicted difference between reference and variant alleles) constructed using two previously trained deep learning models, DeepSEA and Basenji. We apply stratified LD score regression (S-LDSC) to 41 independent diseases and complex traits (average N=320K) to evaluate each annotation’s informativeness for disease heritability conditional on a broad set of coding, conserved, regulatory and LD-related annotations from the baseline-LD model and other sources; as a secondary metric, we also evaluate the accuracy of models that incorporate deep learning annotations in predicting disease-associated or fine-mapped SNPs. We aggregated annotations across all tissues (resp. blood cell types or brain tissues) in meta-analyses across all 41 traits (resp. 11 blood-related traits or 8 brain-related traits). These allelic-effect annotations were highly enriched for disease heritability, but produced only limited conditionally significant results – only Basenji-H3K4me3 in meta-analyses across all 41 traits and brain-specific Basenji-H3K4me3 in meta-analyses across 8 brain-related traits. We conclude that deep learning models are yet to achieve their full potential to provide considerable amount of unique information for complex disease, and that the informativeness of deep learning models for disease beyond established functional annotations cannot be inferred from metrics based on their accuracy in predicting regulatory annotations.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Kushal K. Dey ◽  
Bryce van de Geijn ◽  
Samuel Sungil Kim ◽  
Farhad Hormozdiari ◽  
David R. Kelley ◽  
...  

Abstract Deep learning models have shown great promise in predicting regulatory effects from DNA sequence, but their informativeness for human complex diseases is not fully understood. Here, we evaluate genome-wide SNP annotations from two previous deep learning models, DeepSEA and Basenji, by applying stratified LD score regression to 41 diseases and traits (average N = 320K), conditioning on a broad set of coding, conserved and regulatory annotations. We aggregated annotations across all (respectively blood or brain) tissues/cell-types in meta-analyses across all (respectively 11 blood or 8 brain) traits. The annotations were highly enriched for disease heritability, but produced only limited conditionally significant results: non-tissue-specific and brain-specific Basenji-H3K4me3 for all traits and brain traits respectively. We conclude that deep learning models have yet to achieve their full potential to provide considerable unique information for complex disease, and that their conditional informativeness for disease cannot be inferred from their accuracy in predicting regulatory annotations.


Author(s):  
Yuejun Liu ◽  
Yifei Xu ◽  
Xiangzheng Meng ◽  
Xuguang Wang ◽  
Tianxu Bai

Background: Medical imaging plays an important role in the diagnosis of thyroid diseases. In the field of machine learning, multiple dimensional deep learning algorithms are widely used in image classification and recognition, and have achieved great success. Objective: The method based on multiple dimensional deep learning is employed for the auxiliary diagnosis of thyroid diseases based on SPECT images. The performances of different deep learning models are evaluated and compared. Methods: Thyroid SPECT images are collected with three types, they are hyperthyroidism, normal and hypothyroidism. In the pre-processing, the region of interest of thyroid is segmented and the amount of data sample is expanded. Four CNN models, including CNN, Inception, VGG16 and RNN, are used to evaluate deep learning methods. Results: Deep learning based methods have good classification performance, the accuracy is 92.9%-96.2%, AUC is 97.8%-99.6%. VGG16 model has the best performance, the accuracy is 96.2% and AUC is 99.6%. Especially, the VGG16 model with a changing learning rate works best. Conclusion: The standard CNN, Inception, VGG16, and RNN four deep learning models are efficient for the classification of thyroid diseases with SPECT images. The accuracy of the assisted diagnostic method based on deep learning is higher than that of other methods reported in the literature.


2021 ◽  
Vol 42 (Supplement_1) ◽  
Author(s):  
M Lewis ◽  
J Figueroa

Abstract   Recent health reforms have created incentives for cardiologists and accountable care organizations to participate in value-based care models for heart failure (HF). Accurate risk stratification of HF patients is critical to efficiently deploy interventions aimed at reducing preventable utilization. The goal of this paper was to compare deep learning approaches with traditional logistic regression (LR) to predict preventable utilization among HF patients. We conducted a prognostic study using data on 93,260 HF patients continuously enrolled for 2-years in a large U.S. commercial insurer to develop and validate prediction models for three outcomes of interest: preventable hospitalizations, preventable emergency department (ED) visits, and preventable costs. Patients were split into training, validation, and testing samples. Outcomes were modeled using traditional and enhanced LR and compared to gradient boosting model and deep learning models using sequential and non-sequential inputs. Evaluation metrics included precision (positive predictive value) at k, cost capture, and Area Under the Receiver operating characteristic (AUROC). Deep learning models consistently outperformed LR for all three outcomes with respect to the chosen evaluation metrics. Precision at 1% for preventable hospitalizations was 43% for deep learning compared to 30% for enhanced LR. Precision at 1% for preventable ED visits was 39% for deep learning compared to 33% for enhanced LR. For preventable cost, cost capture at 1% was 30% for sequential deep learning, compared to 18% for enhanced LR. The highest AUROCs for deep learning were 0.778, 0.681 and 0.727, respectively. These results offer a promising approach to identify patients for targeted interventions. FUNDunding Acknowledgement Type of funding sources: Private company. Main funding source(s): internally funded by Diagnostic Robotics Inc.


Foods ◽  
2021 ◽  
Vol 10 (4) ◽  
pp. 809
Author(s):  
Liyang Wang ◽  
Dantong Niu ◽  
Xinjie Zhao ◽  
Xiaoya Wang ◽  
Mengzhen Hao ◽  
...  

Traditional food allergen identification mainly relies on in vivo and in vitro experiments, which often needs a long period and high cost. The artificial intelligence (AI)-driven rapid food allergen identification method has solved the above mentioned some drawbacks and is becoming an efficient auxiliary tool. Aiming to overcome the limitations of lower accuracy of traditional machine learning models in predicting the allergenicity of food proteins, this work proposed to introduce deep learning model—transformer with self-attention mechanism, ensemble learning models (representative as Light Gradient Boosting Machine (LightGBM) eXtreme Gradient Boosting (XGBoost)) to solve the problem. In order to highlight the superiority of the proposed novel method, the study also selected various commonly used machine learning models as the baseline classifiers. The results of 5-fold cross-validation showed that the area under the receiver operating characteristic curve (AUC) of the deep model was the highest (0.9578), which was better than the ensemble learning and baseline algorithms. But the deep model need to be pre-trained, and the training time is the longest. By comparing the characteristics of the transformer model and boosting models, it can be analyzed that, each model has its own advantage, which provides novel clues and inspiration for the rapid prediction of food allergens in the future.


2021 ◽  
Author(s):  
Liam Butler ◽  
Ibrahim Karabayir ◽  
Mohammad Samie Tootooni ◽  
Majid Afshar ◽  
Ari Goldberg ◽  
...  

Background: Patients admitted to the emergency department (ED) with COVID-19 symptoms are routinely required to have chest radiographs and computed tomography (CT) scans. COVID-19 infection has been directly related to development of acute respiratory distress syndrome (ARDS) and severe infections lead to admission to intensive care and can also lead to death. The use of clinical data in machine learning models available at time of admission to ED can be used to assess possible risk of ARDS, need for intensive care unit (ICU) admission as well as risk of mortality. In addition, chest radiographs can be inputted into a deep learning model to further assess these risks. Purpose: This research aimed to develop machine and deep learning models using both structured clinical data and image data from the electronic health record (EHR) to adverse outcomes following ED admission. Materials and Methods: Light Gradient Boosting Machines (LightGBM) was used as the main machine learning algorithm using all clinical data including 42 variables. Compact models were also developed using 15 the most important variables to increase applicability of the models in clinical settings. To predict risk of the aforementioned health outcome events, transfer learning from the CheXNet model was implemented on our data as well. This research utilized clinical data and chest radiographs of 3571 patients 18 years and older admitted to the emergency department between 9th March 2020 and 29th October 2020 at Loyola University Medical Center. Main Findings: Our research results show that we can detect COVID-19 infection (AUC = 0.790 (0.746-0.835)) and predict the risk of developing ARDS (AUC = 0.781 (0.690-0.872), ICU admission (AUC = 0.675 (0.620-0.713)), and mortality (AUC = 0.759 (0.678-0.840)) at moderate accuracy from both chest X-ray images and clinical data. Principal Conclusions: The results can help in clinical decision making, especially when addressing ARDS and mortality, during the assessment of patients admitted to the ED with or without COVID-19 symptoms.


eLife ◽  
2020 ◽  
Vol 9 ◽  
Author(s):  
Agata Wesolowska-Andersen ◽  
Grace Zhuo Yu ◽  
Vibe Nylander ◽  
Fernando Abaitua ◽  
Matthias Thurner ◽  
...  

Genome-wide association analyses have uncovered multiple genomic regions associated with T2D, but identification of the causal variants at these remains a challenge. There is growing interest in the potential of deep learning models - which predict epigenome features from DNA sequence - to support inference concerning the regulatory effects of disease-associated variants. Here, we evaluate the advantages of training convolutional neural network (CNN) models on a broad set of epigenomic features collected in a single disease-relevant tissue – pancreatic islets in the case of type 2 diabetes (T2D) - as opposed to models trained on multiple human tissues. We report convergence of CNN-based metrics of regulatory function with conventional approaches to variant prioritization – genetic fine-mapping and regulatory annotation enrichment. We demonstrate that CNN-based analyses can refine association signals at T2D-associated loci and provide experimental validation for one such signal. We anticipate that these approaches will become routine in downstream analyses of GWAS.


2017 ◽  
Vol 7 (7) ◽  
pp. 2271-2279 ◽  
Author(s):  
Chao Xu ◽  
Ji-Gang Zhang ◽  
Dongdong Lin ◽  
Lan Zhang ◽  
Hui Shen ◽  
...  

Abstract Integrating diverse genomics data can provide a global view of the complex biological processes related to the human complex diseases. Although substantial efforts have been made to integrate different omics data, there are at least three challenges for multi-omics integration methods: (i) How to simultaneously consider the effects of various genomic factors, since these factors jointly influence the phenotypes; (ii) How to effectively incorporate the information from publicly accessible databases and omics datasets to fully capture the interactions among (epi)genomic factors from diverse omics data; and (iii) Until present, the combination of more than two omics datasets has been poorly explored. Current integration approaches are not sufficient to address all of these challenges together. We proposed a novel integrative analysis framework by incorporating sparse model, multivariate analysis, Gaussian graphical model, and network analysis to address these three challenges simultaneously. Based on this strategy, we performed a systemic analysis for glioblastoma multiforme (GBM) integrating genome-wide gene expression, DNA methylation, and miRNA expression data. We identified three regulatory modules of genomic factors associated with GBM survival time and revealed a global regulatory pattern for GBM by combining the three modules, with respect to the common regulatory factors. Our method can not only identify disease-associated dysregulated genomic factors from different omics, but more importantly, it can incorporate the information from publicly accessible databases and omics datasets to infer a comprehensive interaction map of all these dysregulated genomic factors. Our work represents an innovative approach to enhance our understanding of molecular genomic mechanisms underlying human complex diseases.


Sign in / Sign up

Export Citation Format

Share Document