Systematic Prediction of Regulatory Motifs from Human ChIP-Sequencing Data Based on a Deep Learning Framework

ABSTRACTIdentification of transcription factor binding sites (TFBSs) and cis-regulatory motifs (motifs for short) from genomics datasets, provides a powerful view of the rules governing the interactions between TFs and DNA. Existing motif prediction methods however, are limited by high false positive rates in TFBSs identification, contributions from non-sequence-specific binding, and complex and indirect binding mechanisms. High throughput next-generation sequencing data provides unprecedented opportunities to overcome these difficulties, as it provides multiple whole-genome scale measurements of TF binding information. Uncovering this information brings new computational and modeling challenges in high-dimensional data mining and heterogeneous data integration. To improve TFBS identification and novel motifs prediction accuracy in the human genome, we developed an advanced computational technique based on deep learning (DL) and high-performance computing, named DESSO. DESSO utilizes deep neural network and binomial distribution to optimize the motif prediction. Our results showed that DESSO outperformed existing tools in predicting distinct motifs from the 690 in vivo ENCODE ChIP-Sequencing (ChIP-Seq) datasets for 161 human TFs in 91 cell lines. We also found that protein-protein interactions (PPIs) are prevalent among human TFs, and a total of 61 potential tethering binding were identified among the 100 TFs in the K562 cell line. To further expand DESSO’s deep-learning capabilities, we included DNA shape features and found that (i) shape information has a strong predictive power for TF-DNA binding specificity; and (ii) it aided in identification of the shape motifs recognized by human TFs which in turn contributed to the interpretation of TF-DNA binding in the absence of sequence recognition. DESSO and the analyses it enabled will continue to improve our understanding of how gene expression is controlled by TFs and the complexities of DNA binding. The source code and the predicted motifs and TFBSs from the 690 ENCODE TF ChIP-Seq datasets are freely available at the DESSO web server: http://bmbl.sdstate.edu/DESSO.

Download Full-text

Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework

Nucleic Acids Research ◽

10.1093/nar/gkz672 ◽

2019 ◽

Vol 47 (15) ◽

pp. 7809-7824 ◽

Cited By ~ 7

Author(s):

Jinyu Yang ◽

Anjun Ma ◽

Adam D Hoppe ◽

Cankun Wang ◽

Yang Li ◽

...

Keyword(s):

Deep Learning ◽

Dna Binding ◽

Binding Sites ◽

Motif Discovery ◽

Distribution Model ◽

Sequencing Data ◽

Shape Information ◽

Chip Sequencing ◽

Regulatory Motifs ◽

Learning Framework

Abstract The identification of transcription factor binding sites and cis-regulatory motifs is a frontier whereupon the rules governing protein–DNA binding are being revealed. Here, we developed a new method (DEep Sequence and Shape mOtif or DESSO) for cis-regulatory motif prediction using deep neural networks and the binomial distribution model. DESSO outperformed existing tools, including DeepBind, in predicting motifs in 690 human ENCODE ChIP-sequencing datasets. Furthermore, the deep-learning framework of DESSO expanded motif discovery beyond the state-of-the-art by allowing the identification of known and new protein–protein–DNA tethering interactions in human transcription factors (TFs). Specifically, 61 putative tethering interactions were identified among the 100 TFs expressed in the K562 cell line. In this work, the power of DESSO was further expanded by integrating the detection of DNA shape features. We found that shape information has strong predictive power for TF–DNA binding and provides new putative shape motif information for human TFs. Thus, DESSO improves in the identification and structural analysis of TF binding sites, by integrating the complexities of DNA binding into a deep-learning framework.

Download Full-text

A Deep Learning Approach for Detecting Copy Number Variation in Next-Generation Sequencing Data

G3 Genes|Genome|Genetics ◽

10.1534/g3.119.400596 ◽

2019 ◽

Vol 9 (11) ◽

pp. 3575-3582 ◽

Cited By ~ 5

Author(s):

Tom Hill ◽

Robert L. Unckless

Keyword(s):

Deep Learning ◽

Next Generation Sequencing ◽

Copy Number Variation ◽

Copy Number ◽

Next Generation Sequencing Data ◽

Learning Approach ◽

Next Generation ◽

Sequencing Data ◽

Number Variation ◽

Generation Sequencing

Download Full-text

Deep Learning Applied on Next Generation Sequencing Data Analysis

Methods in Molecular Biology - Deep Sequencing Data Analysis ◽

10.1007/978-1-0716-1103-6_9 ◽

2021 ◽

pp. 169-182

Author(s):

Artem Danilevsky ◽

Noam Shomron

Keyword(s):

Deep Learning ◽

Data Analysis ◽

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing ◽

Sequencing Data Analysis

Download Full-text

Adversarial Vulnerability of Deep Learning Models in Analyzing Next Generation Sequencing Data

2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm49941.2020.9313421 ◽

2020 ◽

Author(s):

Amiel Meiseles ◽

Ishai Rosenberg ◽

Yair Motro ◽

Lior Rokach ◽

Jacob Moran-Gilad

Keyword(s):

Deep Learning ◽

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Learning Models ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

Application of Deep Learning in Predicting the Prognosis of Acute Myeloid Leukemia using Cytogenetics, Age, and Mutations

Clinical Oncology and Research ◽

10.31487/j.cor.2020.03.01 ◽

2020 ◽

pp. 1-6

Author(s):

Andy N.D. Nguyen ◽

Adan Rios ◽

Andy N.D. Nguyen ◽

Brenda Mai ◽

Hanadi El Achi ◽

...

Keyword(s):

Acute Myeloid Leukemia ◽

Deep Learning ◽

Myeloid Leukemia ◽

The Cancer Genome Atlas ◽

Next Generation Sequencing Data ◽

Common Mutation ◽

Sequencing Data ◽

Learning Network ◽

Deep Learning Network ◽

Acute Myeloid

Objective: We explored how Deep Learning can be utilized to predict the prognosis of acute myeloid leukemia. Methods: Out of The Cancer Genome Atlas database, 94 acute myeloid leukemia cases were used in this study. Input data included age, 10 most common cytogenetic and 23 most common mutation results; output was the prognosis (diagnosis to death). In our Deep Learning network, autoencoders were stacked to form a hierarchical Deep Learning model from which raw data were compressed and organized, and high-level features were extracted. The network was written in R language and was designed to predict the prognosis of acute myeloid leukemia for a given case (diagnosis to death of either more or less than 730 days). Results: The Deep Learning network achieved an excellent accuracy of 83% in predicting prognosis. Conclusion: As a proof-of-concept study, our preliminary results demonstrated a practical application of Deep Learning in the future practice of prognostic prediction using next-generation sequencing data.

Download Full-text

VC@Scale: Scalable and high-performance variant calling on cluster environments

GigaScience ◽

10.1093/gigascience/giab057 ◽

2021 ◽

Vol 10 (9) ◽

Author(s):

Tanveer Ahmad ◽

Zaid Al Ars ◽

H Peter Hofstee

Keyword(s):

Deep Learning ◽

High Performance ◽

High Efficiency ◽

Variant Calling ◽

Apache Spark ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Single Node ◽

Loosely Coupled ◽

Gpu Clusters

Abstract Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations. Results Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. Conclusions We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.

Download Full-text

Faculty Opinions recommendation of VarWalker: personalized mutation network analysis of putative cancer genes from next-generation sequencing data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718272765.793499663 ◽

2014 ◽

Author(s):

Gary Bader ◽

Mohamed Helmy

Keyword(s):

Next Generation Sequencing ◽

Network Analysis ◽

Next Generation Sequencing Data ◽

Cancer Genes ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

Faculty Opinions recommendation of Bioinformatory-assisted analysis of next-generation sequencing data for precision medicine in pancreatic cancer.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.727775566.793536095 ◽

2017 ◽

Author(s):

Steve Pereira

Keyword(s):

Pancreatic Cancer ◽

Next Generation Sequencing ◽

Precision Medicine ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Assisted Analysis ◽

Generation Sequencing

Download Full-text

Advancing clinical genomics and precision medicine with GVViZ: FAIR bioinformatics platform for variable gene-disease annotation, visualization, and expression analysis

Human Genomics ◽

10.1186/s40246-021-00336-1 ◽

2021 ◽

Vol 15 (1) ◽

Author(s):

Zeeshan Ahmed ◽

Eduard Gibert Renart ◽

Saman Zeeshan ◽

XinQi Dong

Keyword(s):

Data Analysis ◽

Patient Care ◽

Expression Analysis ◽

High Throughput ◽

Gene Annotation ◽

Next Generation Sequencing Data ◽

Rna Seq ◽

Sequencing Data ◽

Complex Disorders ◽

Transcriptomics Data

Abstract Background Genetic disposition is considered critical for identifying subjects at high risk for disease development. Investigating disease-causing and high and low expressed genes can support finding the root causes of uncertainties in patient care. However, independent and timely high-throughput next-generation sequencing data analysis is still a challenge for non-computational biologists and geneticists. Results In this manuscript, we present a findable, accessible, interactive, and reusable (FAIR) bioinformatics platform, i.e., GVViZ (visualizing genes with disease-causing variants). GVViZ is a user-friendly, cross-platform, and database application for RNA-seq-driven variable and complex gene-disease data annotation and expression analysis with a dynamic heat map visualization. GVViZ has the potential to find patterns across millions of features and extract actionable information, which can support the early detection of complex disorders and the development of new therapies for personalized patient care. The execution of GVViZ is based on a set of simple instructions that users without a computational background can follow to design and perform customized data analysis. It can assimilate patients’ transcriptomics data with the public, proprietary, and our in-house developed gene-disease databases to query, easily explore, and access information on gene annotation and classified disease phenotypes with greater visibility and customization. To test its performance and understand the clinical and scientific impact of GVViZ, we present GVViZ analysis for different chronic diseases and conditions, including Alzheimer’s disease, arthritis, asthma, diabetes mellitus, heart failure, hypertension, obesity, osteoporosis, and multiple cancer disorders. The results are visualized using GVViZ and can be exported as image (PNF/TIFF) and text (CSV) files that include gene names, Ensembl (ENSG) IDs, quantified abundances, expressed transcript lengths, and annotated oncology and non-oncology diseases. Conclusions We emphasize that automated and interactive visualization should be an indispensable component of modern RNA-seq analysis, which is currently not the case. However, experts in clinics and researchers in life sciences can use GVViZ to visualize and interpret the transcriptomics data, making it a powerful tool to study the dynamics of gene expression and regulation. Furthermore, with successful deployment in clinical settings, GVViZ has the potential to enable high-throughput correlations between patient diagnoses based on clinical and transcriptomics data.

Download Full-text

NGSremix: A software tool for estimating pairwise relatedness between admixed individuals from next-generation sequencing data

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab174 ◽

2021 ◽

Author(s):

Anne Krogh Nøhr ◽

Kristian Hanghøj ◽

Genis Garcia Erill ◽

Zilong Li ◽

Ida Moltke ◽

...

Keyword(s):

Next Generation Sequencing ◽

Genetic Research ◽

Likelihood Estimation ◽

Software Tool ◽

Estimation Methods ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Ngs Data ◽

Generation Sequencing

Abstract Estimation of relatedness between pairs of individuals is important in many genetic research areas. When estimating relatedness, it is important to account for admixture if this is present. However, the methods that can account for admixture are all based on genotype data as input, which is a problem for low-depth next-generation sequencing (NGS) data from which genotypes are called with high uncertainty. Here we present a software tool, NGSremix, for maximum likelihood estimation of relatedness between pairs of admixed individuals from low-depth NGS data, which takes the uncertainty of the genotypes into account via genotype likelihoods. Using both simulated and real NGS data for admixed individuals with an average depth of 4x or below we show that our method works well and clearly outperforms all the commonly used state-of-the-art relatedness estimation methods PLINK, KING, relateAdmix, and ngsRelate that all perform quite poorly. Hence, NGSremix is a useful new tool for estimating relatedness in admixed populations from low-depth NGS data. NGSremix is implemented in C/C ++ in a multi-threaded software and is freely available on Github https://github.com/KHanghoj/NGSremix.

Download Full-text