A simple convolutional neural network for prediction of enhancer–promoter interactions with DNA sequence data

Abstract Motivation Enhancer–promoter interactions (EPIs) in the genome play an important role in transcriptional regulation. EPIs can be useful in boosting statistical power and enhancing mechanistic interpretation for disease- or trait-associated genetic variants in genome-wide association studies. Instead of expensive and time-consuming biological experiments, computational prediction of EPIs with DNA sequence and other genomic data is a fast and viable alternative. In particular, deep learning and other machine learning methods have been demonstrated with promising performance. Results First, using a published human cell line dataset, we demonstrate that a simple convolutional neural network (CNN) performs as well as, if no better than, a more complicated and state-of-the-art architecture, a hybrid of a CNN and a recurrent neural network. More importantly, in spite of the well-known cell line-specific EPIs (and corresponding gene expression), in contrast to the standard practice of training and predicting for each cell line separately, we propose two transfer learning approaches to training a model using all cell lines to various extents, leading to substantially improved predictive performance. Availability and implementation Computer code is available at https://github.com/zzUMN/Combine-CNN-Enhancer-and-Promoters. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Convolutional neural network model to predict causal risk factors that share complex regulatory features

Nucleic Acids Research ◽

10.1093/nar/gkz868 ◽

2019 ◽

Vol 47 (22) ◽

pp. e146-e146 ◽

Cited By ~ 3

Author(s):

Taeyeop Lee ◽

Min Kyung Sung ◽

Seulkee Lee ◽

Woojin Yang ◽

Jaeho Oh ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Association Studies ◽

Explanatory Power ◽

Physical Contact ◽

Regulatory Function ◽

Genome Wide Association Studies ◽

Functional Interpretation ◽

Causal Variants ◽

Functional Features

Abstract Major progress in disease genetics has been made through genome-wide association studies (GWASs). One of the key tasks for post-GWAS analyses is to identify causal noncoding variants with regulatory function. Here, on the basis of >2000 functional features, we developed a convolutional neural network framework for combinatorial, nonlinear modeling of complex patterns shared by risk variants scattered among multiple associated loci. When applied for major psychiatric disorders and autoimmune diseases, neural and immune features, respectively, exhibited high explanatory power while reflecting the pathophysiology of the relevant disease. The predicted causal variants were concentrated in active regulatory regions of relevant cell types and tended to be in physical contact with transcription factors while residing in evolutionarily conserved regions and resulting in expression changes of genes related to the given disease. We demonstrate some examples of novel candidate causal variants and associated genes. Our method is expected to contribute to the identification and functional interpretation of potential causal noncoding variants in post-GWAS analyses.

Download Full-text

Convolutional neural network model to predict causal risk factors that share complex regulatory features

10.1101/725309 ◽

2019 ◽

Author(s):

Taeyeop Lee ◽

Min Kyung Sung ◽

Seulkee Lee ◽

Woojin Yang ◽

Jaeho Oh ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Association Studies ◽

Explanatory Power ◽

Physical Contact ◽

Regulatory Function ◽

Genome Wide Association Studies ◽

Functional Interpretation ◽

Causal Variants ◽

Functional Features

ABSTRACTMajor progress in disease genetics has been made through genome-wide association studies (GWASs). One of the key tasks for post-GWAS analyses is to identify causal noncoding variants with regulatory function. Here, on the basis of > 2,000 functional features, we developed a convolutional neural network framework for combinatorial, nonlinear modeling of complex patterns shared by risk variants scattered among multiple associated loci. When applied for major psychiatric disorders and autoimmune diseases, neural and immune features, respectively, exhibited high explanatory power while reflecting the pathophysiology of the relevant disease. The predicted causal variants were concentrated in active regulatory regions of relevant cell types and tended to be in physical contact with transcription factors while residing in evolutionarily conserved regions and resulting in expression changes of genes related to the given disease. We demonstrate some examples of novel candidate causal variants and associated genes. Our method is expected to contribute to the identification and functional interpretation of causal noncoding variants in post-GWAS analyses.

Download Full-text

Genotype Imputation from Large Reference Panels

Annual Review of Genomics and Human Genetics ◽

10.1146/annurev-genom-083117-021602 ◽

2018 ◽

Vol 19 (1) ◽

pp. 73-96 ◽

Cited By ~ 32

Author(s):

Sayantan Das ◽

Gonçalo R. Abecasis ◽

Brian L. Browning

Keyword(s):

Statistical Power ◽

Sequence Data ◽

Single Nucleotide Polymorphism Array ◽

Association Studies ◽

Genotype Imputation ◽

Genome Wide Association ◽

Whole Genome Sequence ◽

Computational Techniques ◽

Genome Wide Association Studies ◽

Genome Wide

Genotype imputation has become a standard tool in genome-wide association studies because it enables researchers to inexpensively approximate whole-genome sequence data from genome-wide single-nucleotide polymorphism array data. Genotype imputation increases statistical power, facilitates fine mapping of causal variants, and plays a key role in meta-analyses of genome-wide association studies. Only variants that were previously observed in a reference panel of sequenced individuals can be imputed. However, the rapid increase in the number of deeply sequenced individuals will soon make it possible to assemble enormous reference panels that greatly increase the number of imputable variants. In this review, we present an overview of genotype imputation and describe the computational techniques that make it possible to impute genotypes from reference panels with millions of individuals.

Download Full-text

DeepSSV: detecting somatic small variants in paired tumor and normal sequencing data with convolutional neural network

10.1101/555680 ◽

2019 ◽

Author(s):

Jing Meng ◽

Brandon Victor ◽

Zhen He ◽

Agus Salim

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Somatic Mutations ◽

Ground Truth ◽

Variant Allele ◽

Supplementary Information ◽

Learning Approaches ◽

Limited Information ◽

Sequencing Data ◽

Reference Allele

AbstractMotivationIt is of considerable interest to detect somatic mutations in paired tumor and normal sequencing data. A number of callers that are based on statistical or machine learning approaches have been developed to detect somatic small variants. However, they take into consideration only limited information about the reference and potential variant allele in both samples at a candidate somatic site. Also, they differ in how biological and technological noises are addressed. Hence, they are expected to produce divergent outputs.ResultsTo overcome the drawbacks of existing somatic callers, we develop a deep learning-based tool called DeepSSV, which employs a convolutional neural network (CNN) model to learn increasingly abstract feature representations from the raw data in higher feature layers. DeepSSV creates a spatially-oriented representation of read alignments around the candidate somatic sites adapted for the convolutional architecture, which enables it to expand to effectively gather scattered evidences. Moreover, DeepSSV incorporates the mapping information of both reference-allele-supporting and variant-allele-supporting reads in the tumor and normal samples at a genomic site that are readily available in the pileup format file. Together, the CNN model can process the whole alignment information. Such representational richness allows the model to capture the dependencies in the sequence and identify context-based sequencing artifacts, and alleviates the need of post-call filters that heavily depend on prior knowledge. We fitted the model on ground truth somatic mutations, and did benchmarking experiments on simulated and real tumors. The benchmarking results demonstrate that DeepSSV outperforms its state-of-the-art competitors in overall F1score.Availability and Implementationhttps://github.com/jingmeng-bioinformatics/[email protected] informationSupplementary data are available at online.

Download Full-text

A phylogenetic method to perform genome-wide association studies in microbes that accounts for population structure and recombination

10.1101/140798 ◽

2017 ◽

Cited By ~ 4

Author(s):

Caitlin Collins ◽

Xavier Didelot

Keyword(s):

Population Structure ◽

Statistical Power ◽

Sequence Data ◽

Association Studies ◽

Strong Support ◽

Simulated Data ◽

Invasive Disease ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

AbstractGenome-Wide Association Studies (GWAS) in microbial organisms have the potential to vastly improve the way we understand, manage, and treat infectious diseases. Yet, GWAS methods established thus far remain insufficiently able to capitalise on the growing wealth of bacterial and viral genetic sequence data. Facing clonal population structure and homologous recombination, existing GWAS methods struggle to achieve both the precision necessary to reject spurious findings and the power required to detect associations in microbes. In this paper, we introduce a novel phylogenetic approach that has been tailor-made for microbial GWAS, which is applicable to organisms ranging from purely clonal to frequently recombining, and to both binary and continuous phenotypes. Our approach is robust to the confounding effects of both population structure and recombination, while maintaining high statistical power to detect associations. Thorough testing via application to simulated data provides strong support for the power and specificity of our approach and demonstrates the advantages offered over alternative cluster-based and dimension-reduction methods. Two applications toNeisseria meningitidisillustrate the versatility and potential of our method, confirming previously-identified penicillin resistance loci and resulting in the identification of both well-characterised and novel drivers of invasive disease. Our method is implemented as an open-source R package called treeWAS which is freely available athttps://github.com/caitiecollins/treeWAS.

Download Full-text

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

Nature ◽

10.1038/s41586-021-03205-y ◽

2021 ◽

Vol 590 (7845) ◽

pp. 290-299 ◽

Cited By ~ 22

Author(s):

Daniel Taliun ◽

◽

Daniel N. Harris ◽

Michael D. Kessler ◽

Jedidiah Carlson ◽

...

Keyword(s):

Rare Variants ◽

Sequence Data ◽

Association Studies ◽

Genotype Imputation ◽

Genome Wide Association Studies ◽

Phenotypic Data ◽

Treatment And Prevention ◽

Genome Wide ◽

Diverse Backgrounds ◽

Unmapped Reads

AbstractThe Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.

Download Full-text

Common genetic variants with fetal effects on birth weight are enriched for proximity to genes implicated in rare developmental disorders

Human Molecular Genetics ◽

10.1093/hmg/ddab060 ◽

2021 ◽

Author(s):

Robin N Beaumont ◽

Isabelle K Mayne ◽

Rachel M Freathy ◽

Caroline F Wright

Keyword(s):

Birth Weight ◽

Statistical Power ◽

Developmental Disorders ◽

Association Studies ◽

Later Life ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Genome Wide ◽

Common Genetic Variants ◽

Causal Genes

Abstract Birth weight is an important factor in newborn survival; both low and high birth weights are associated with adverse later-life health outcomes. Genome-wide association studies (GWAS) have identified 190 loci associated with maternal or fetal effects on birth weight. Knowledge of the underlying causal genes is crucial to understand how these loci influence birth weight and the links between infant and adult morbidity. Numerous monogenic developmental syndromes are associated with birth weights at the extreme ends of the distribution. Genes implicated in those syndromes may provide valuable information to prioritize candidate genes at the GWAS loci. We examined the proximity of genes implicated in developmental disorders (DDs) to birth weight GWAS loci using simulations to test whether they fall disproportionately close to the GWAS loci. We found birth weight GWAS single nucleotide polymorphisms (SNPs) fall closer to such genes than expected both when the DD gene is the nearest gene to the birth weight SNP and also when examining all genes within 258 kb of the SNP. This enrichment was driven by genes causing monogenic DDs with dominant modes of inheritance. We found examples of SNPs in the intron of one gene marking plausible effects via different nearby genes, highlighting the closest gene to the SNP not necessarily being the functionally relevant gene. This is the first application of this approach to birth weight, which has helped identify GWAS loci likely to have direct fetal effects on birth weight, which could not previously be classified as fetal or maternal owing to insufficient statistical power.

Download Full-text

Statistical power and utility of meta-analysis methods for cross-phenotype genome-wide association studies

PLoS ONE ◽

10.1371/journal.pone.0193256 ◽

2018 ◽

Vol 13 (3) ◽

pp. e0193256 ◽

Cited By ~ 13

Author(s):

Zhaozhong Zhu ◽

Verneri Anttila ◽

Jordan W. Smoller ◽

Phil H. Lee

Keyword(s):

Statistical Power ◽

Association Studies ◽

Meta Analysis ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Analysis Methods ◽

Genome Wide

Download Full-text

GWASpro: a high-performance genome-wide association analysis server

Bioinformatics ◽

10.1093/bioinformatics/bty989 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2512-2514 ◽

Cited By ~ 4

Author(s):

Bongsong Kim ◽

Xinbin Dai ◽

Wenchao Zhang ◽

Zhaohong Zhuang ◽

Darlene L Sanchez ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Linear Mixed Model ◽

Association Studies ◽

Learning Curves ◽

Experimental Designs ◽

Genome Wide Association ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Genome Wide

Abstract Summary We present GWASpro, a high-performance web server for the analyses of large-scale genome-wide association studies (GWAS). GWASpro was developed to provide data analyses for large-scale molecular genetic data, coupled with complex replicated experimental designs such as found in plant science investigations and to overcome the steep learning curves of existing GWAS software tools. GWASpro supports building complex design matrices, by which complex experimental designs that may include replications, treatments, locations and times, can be accounted for in the linear mixed model. GWASpro is optimized to handle GWAS data that may consist of up to 10 million markers and 10 000 samples from replicable lines or hybrids. GWASpro provides an interface that significantly reduces the learning curve for new GWAS investigators. Availability and implementation GWASpro is freely available at https://bioinfo.noble.org/GWASPRO. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text