AlleleHMM: a data-driven method to identify allele-specific differences in distributed functional genomic marks

Mapping Intimacies ◽

10.1101/389262 ◽

2018 ◽

Author(s):

Shao-Pei Chou ◽

Charles G. Danko

Keyword(s):

Dna Sequence ◽

Statistical Power ◽

Genomic Data ◽

Computational Method ◽

Functional Genomic ◽

Biochemical Processes ◽

Allele Specific ◽

Genomic Regions ◽

Hidden States ◽

Source Of Information

AbstractHow DNA sequence variation influences gene expression remains poorly understood. Diploid organisms have two homologous copies of their DNA sequence in the same nucleus, providing a rich source of information about how genetic variation affects a wealth of biochemical processes. However, few computational methods have been developed to discover allele-specific differences in functional genomic data. Existing methods either treat each SNP independently, limiting statistical power, or combine SNPs across gene annotations, preventing the discovery of allele specific differences in unexpected genomic regions. Here we introduce AlleleHMM, a new computational method to identify blocks of neighboring SNPs that share similar allele-specific differences in mark abundance. AlleleHMM uses a hidden Markov model to divide the genome among three hidden states based on allele frequencies in genomic data: a symmetric state (state ‘S’) which shows no difference between alleles, and regions with a higher signal on the maternal (state M) or paternal (state P) allele. AlleleHMM substantially outperformed naive methods using both simulated and real genomic data, particularly when input data had realistic levels of overdispersion. Using PRO-seq data, AlleleHMM identified thousands of allele specific blocks of transcription in both coding and non-coding genomic regions. AlleleHMM is a powerful tool for discovering allele-specific regions in functional genomic datasets.

Download Full-text

A deep learning framework for predicting human essential genes from population and functional genomic data

10.1101/2021.12.21.473690 ◽

2021 ◽

Author(s):

Troy M LaPolice ◽

Yi-Fei Huang

Keyword(s):

Deep Learning ◽

Genetic Disorders ◽

Genomic Data ◽

Essential Genes ◽

Computational Method ◽

Functional Genomic ◽

Loss Of Function ◽

Functional Genomic Data ◽

Learning Framework ◽

Limited Power

Being able to predict essential genes intolerant to loss-of-function (LOF) mutations can dramatically improve our ability to identify genes associated with genetic disorders. Numerous computational methods have recently been developed to predict human essential genes from population genomic data; however, the existing methods have limited power in pinpointing short essential genes due to the sparsity of polymorphisms in the human genome. Here we present an evolution-based deep learning model, DeepLOF, which integrates population and functional genomic data to improve gene essentiality prediction. Compared to previous methods, DeepLOF shows unmatched performance in predicting ClinGen haploinsufficient genes, mouse essential genes, and essential genes in human cell lines. Furthermore, DeepLOF discovers 109 potentially essential genes that are too short to be identified by previous methods. Altogether, DeepLOF is a powerful computational method to aid in the discovery of essential genes.

Download Full-text

Faculty Opinions recommendation of Finding function: evaluation methods for functional genomic data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1044091.496329 ◽

2006 ◽

Author(s):

Russ Altman

Keyword(s):

Genomic Data ◽

Evaluation Methods ◽

Function Evaluation ◽

Functional Genomic ◽

Functional Genomic Data

Download Full-text

DNA Polymorphism in Lycopersicon and Crossing-Over per Physical Length

Genetics ◽

10.1093/genetics/150.4.1585 ◽

1998 ◽

Vol 150 (4) ◽

pp. 1585-1593 ◽

Cited By ~ 21

Author(s):

Wolfgang Stephan ◽

Charles H Langley

Keyword(s):

Dna Sequence ◽

Dna Polymorphism ◽

Sequence Polymorphism ◽

Breeding Systems ◽

Crossing Over ◽

Physical Length ◽

Interspecific Divergence ◽

Dna Sequence Polymorphism ◽

Outcrossing Species ◽

Genomic Regions

Abstract Surveys in Drosophila have consistently found reduced levels of DNA sequence polymorphism in genomic regions experiencing low crossing-over per physical length, while these same regions exhibit normal amounts of interspecific divergence. Here we show that for 36 loci across the genomes of eight Lycopersicon species, naturally occurring DNA polymorphism (scaled by locus-specific divergence between species) is positively correlated with the density of crossing-over per physical length. Large between-species differences in the amount of DNA sequence polymorphism reflect breeding systems: selfing species show much less within-species polymorphism than outcrossing species. The strongest association of expected heterozygosity with crossing-over is found in species with intermediate levels of average nucleotide diversity. All of these observations appear to be in qualitative agreement with the hitchhiking effects caused by the fixation of advantageous mutations and/or “background selection” against deleterious mutations.

Download Full-text

Learning a genome-wide score of human–mouse conservation at the functional genomics level

Nature Communications ◽

10.1038/s41467-021-22653-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Soo Bin Kwon ◽

Jason Ernst

Keyword(s):

Mouse Model ◽

Functional Genomics ◽

Functional Genomic ◽

Transcriptomic Data ◽

Model Studies ◽

Genome Wide ◽

A Genome ◽

Important Challenge ◽

Genomic Regions ◽

Human And Mouse

AbstractIdentifying genomic regions with functional genomic properties that are conserved between human and mouse is an important challenge in the context of mouse model studies. To address this, we develop a method to learn a score of evidence of conservation at the functional genomics level by integrating information from a compendium of epigenomic, transcription factor binding, and transcriptomic data from human and mouse. The method, Learning Evidence of Conservation from Integrated Functional genomic annotations (LECIF), trains neural networks to generate this score for the human and mouse genomes. The resulting LECIF score highlights human and mouse regions with shared functional genomic properties and captures correspondence of biologically similar human and mouse annotations. Analysis with independent datasets shows the score also highlights loci associated with similar phenotypes in both species. LECIF will be a resource for mouse model studies by identifying loci whose functional genomic properties are likely conserved.

Download Full-text

Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks

Nature Genetics ◽

10.1038/ng.167 ◽

2008 ◽

Vol 40 (7) ◽

pp. 854-861 ◽

Cited By ~ 361

Author(s):

Jun Zhu ◽

Bin Zhang ◽

Erin N Smith ◽

Becky Drees ◽

Rachel B Brem ◽

...

Keyword(s):

Regulatory Networks ◽

Large Scale ◽

Genomic Data ◽

Functional Genomic ◽

Functional Genomic Data

Download Full-text

Behavior-dependent cis regulation reveals genes and pathways associated with bower building in cichlid fishes

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1810140115 ◽

2018 ◽

Vol 115 (47) ◽

pp. E11081-E11090 ◽

Cited By ~ 15

Author(s):

Ryan A. York ◽

Chinar Patil ◽

Kawther Abdilleh ◽

Zachary V. Johnson ◽

Matthew A. Conte ◽

...

Keyword(s):

Gene Expression ◽

Neural Plasticity ◽

Genetic Basis ◽

Specific Expression ◽

Preferential Expression ◽

Cichlid Fishes ◽

Brain Gene Expression ◽

Genome Wide ◽

Allele Specific ◽

Genomic Regions

Many behaviors are associated with heritable genetic variation [Kendler and Greenspan (2006) Am J Psychiatry 163:1683–1694]. Genetic mapping has revealed genomic regions or, in a few cases, specific genes explaining part of this variation [Bendesky and Bargmann (2011) Nat Rev Gen 12:809–820]. However, the genetic basis of behavioral evolution remains unclear. Here we investigate the evolution of an innate extended phenotype, bower building, among cichlid fishes of Lake Malawi. Males build bowers of two types, pits or castles, to attract females for mating. We performed comparative genome-wide analyses of 20 bower-building species and found that these phenotypes have evolved multiple times with thousands of genetic variants strongly associated with this behavior, suggesting a polygenic architecture. Remarkably, F1 hybrids of a pit-digging and a castle-building species perform sequential construction of first a pit and then a castle bower. Analysis of brain gene expression in these hybrids showed that genes near behavior-associated variants display behavior-dependent allele-specific expression with preferential expression of the pit-digging species allele during pit digging and of the castle-building species allele during castle building. These genes are highly enriched for functions related to neurodevelopment and neural plasticity. Our results suggest that natural behaviors are associated with complex genetic architectures that alter behavior via cis-regulatory differences whose effects on gene expression are specific to the behavior itself.

Download Full-text

Detection and evaluation of selection signatures in sheep

Pesquisa Agropecuária Brasileira ◽

10.1590/s0100-204x2018000500001 ◽

2018 ◽

Vol 53 (5) ◽

pp. 527-539 ◽

Cited By ~ 2

Author(s):

Tiago do Prado Paim ◽

Patrícia Ianella ◽

Samuel Rezende Paiva ◽

Alexandre Rodrigues Caetano ◽

Concepta Margaret McManus Pimentel

Keyword(s):

Statistical Methods ◽

Phenotypic Diversity ◽

Selection Process ◽

Genomic Data ◽

Research Field ◽

Economic Traits ◽

Selection Signatures ◽

Important Species ◽

Challenges And Opportunities ◽

Genomic Regions

Abstract: The recent development of genome-wide single nucleotide polymorphism (SNP) arrays made it possible to carry out several studies with different species. The selection process can increase or reduce allelic (or genic) frequencies at specific loci in the genome, besides dragging neighboring alleles in the chromosome. This way, genomic regions with increased frequencies of specific alleles are formed, caracterizing selection signatures or selective sweeps. The detection of these signatures is important to characterize genetic resources, as well as to identify genes or regions involved in the control and expression of important production and economic traits. Sheep are an important species for theses studies as they are dispersed worldwide and have great phenotypic diversity. Due to the large amounts of genomic data generated, specific statistical methods and softwares are necessary for the detection of selection signatures. Therefore, the objectives of this review are to address the main statistical methods and softwares currently used for the analysis of genomic data and the identification of selection signatures; to describe the results of recent works published on selection signatures in sheep; and to discuss some challenges and opportunities in this research field.

Download Full-text

Discovery of allele-specific protein-RNA interactions in human transcriptomes

10.1101/389205 ◽

2018 ◽

Author(s):

Emad Bahrami-Samani ◽

Yi Xing

Keyword(s):

Genetic Variants ◽

Rna Binding ◽

Rna Binding Proteins ◽

Specific Protein ◽

Computational Method ◽

Joint Analysis ◽

Transcriptional Level ◽

Rna Seq ◽

Allele Specific ◽

Rna Interaction

AbstractGene expression is tightly regulated at the post-transcriptional level through splicing, transport, translation, and decay. RNA-binding proteins (RBPs) play key roles in post-transcriptional gene regulation, and genetic variants that alter RBP-RNA interactions can affect gene products and functions. We developed a computational method ASPRIN (Allele-Specific Protein-RNA Interaction), that uses a joint analysis of CLIP-seq (cross-linking and immunoprecipitation followed by high-throughput sequencing) and RNA-seq data to identify genetic variants that alter RBP-RNA interactions by directly observing the allelic preference of RBP from CLIP-seq experiments as compared to RNA-seq. We used ASPRIN to systematically analyze CLIP-seq and RNA-seq data for 166 RBPs in two ENCODE (Encyclopedia of DNA Elements) cell lines. ASPRIN identified genetic variants that alter RBP-RNA interactions by modifying RBP binding motifs within RNA. Moreover, through an integrative ASPRIN analysis with population-scale RNA-seq data, we showed that ASPRIN can help reveal potential causal variants that affect alternative splicing via allele-specific protein-RNA interactions.

Download Full-text

Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects

10.1101/269316 ◽

2018 ◽

Cited By ~ 1

Author(s):

Allison A. Regier ◽

Yossi Farjoun ◽

David Larson ◽

Olga Krasheninina ◽

Hyun Min Kang ◽

...

Keyword(s):

Data Processing ◽

Genome Sequencing ◽

Statistical Power ◽

Human Genetics ◽

Variant Calling ◽

Joint Analysis ◽

Sequencing Analysis ◽

Batch Effects ◽

Many Sources ◽

Genomic Regions

AbstractHundreds of thousands of human whole genome sequencing (WGS) datasets will be generated over the next few years to interrogate a broad range of traits, across diverse populations. These data are more valuable in aggregate: joint analysis of genomes from many sources increases sample size and statistical power for trait mapping, and will enable studies of genome biology, population genetics and genome function at unprecedented scale. A central challenge for joint analysis is that different WGS data processing and analysis pipelines cause substantial batch effects in combined datasets, necessitating computationally expensive reprocessing and harmonization prior to variant calling. This approach is no longer tenable given the scale of current studies and data volumes. Here, in a collaboration across multiple genome centers and NIH programs, we define WGS data processing standards that allow different groups to produce “functionally equivalent” (FE) results suitable for joint variant calling with minimal batch effects. Our approach promotes broad harmonization of upstream data processing steps, while allowing for diverse variant callers. Importantly, it allows each group to continue innovating on data processing pipelines, as long as results remain compatible. We present initial FE pipelines developed at five genome centers and show that they yield similar variant calling results – including single nucleotide (SNV), insertion/deletion (indel) and structural variation (SV) – and produce significantly less variability than sequencing replicates. Residual inter-pipeline variability is concentrated at low quality sites and repetitive genomic regions prone to stochastic effects. This work alleviates a key technical bottleneck for genome aggregation and helps lay the foundation for broad data sharing and community-wide “big-data” human genetics studies.

Download Full-text

Comparative Performance of Popular Methods for Hybrid Detection using Genomic Data

10.1101/2020.07.27.224022 ◽

2020 ◽

Author(s):

Sungsik Kong ◽

Laura S. Kubatko

Keyword(s):

Statistical Power ◽

Evolutionary Dynamics ◽

Mean Squared Error ◽

Incomplete Lineage Sorting ◽

Introgressive Hybridization ◽

Genomic Data ◽

Lineage Sorting ◽

Comparative Performance ◽

Pattern Frequency ◽

Hybrid Detection

AbstractInterspecific hybridization is an important evolutionary phenomenon that generates genetic variability in a population and fosters species diversity in nature. The availability of large genome scale datasets has revolutionized hybridization studies to shift from the examination of the presence or absence of hybrids in nature to the investigation of the genomic constitution of hybrids and their genome-specific evolutionary dynamics. Although a handful of methods have been proposed in an attempt to identify hybrids, accurate detection of hybridization from genomic data remains a challenging task. The available methods can be classified broadly as site pattern frequency based and population genetic clustering approaches, though the performance of the two classes of methods under different hybridization scenarios has not been extensively examined. Here, we use simulated data to comparatively evaluate the performance of four tools that are commonly used to infer hybridization events: the site pattern frequency based methods HyDe and the D-statistic (i.e., the ABBA-BABA test), and the population clustering approaches structure and ADMIXTURE. We consider single hybridization scenarios that vary in the time of hybridization and the amount of incomplete lineage sorting (ILS) for different proportions of parental contributions (γ); introgressive hybridization; multiple hybridization scenarios; and a mixture of ancestral and recent hybridization scenarios. We focus on the statistical power to detect hybridization, the false discovery rate (FDR) for the D-statistic and HyDe, and the accuracy of the estimates of γ as measured by the mean squared error for HyDe, structure, and ADMIXTURE. Both HyDe and the D-statistic demonstrate a high level of detection power in all scenarios except those with high ILS, although the D-statistic often has an unacceptably high FDR. The estimates of γ in HyDe are impressively robust and accurate whereas structure and ADMIXTURE sometimes fail to identify hybrids, particularly when the proportional parental contributions are asymmetric (i.e., when γ is close to 0). Moreover, the posterior distribution estimated using structure exhibits multimodality in many scenarios, making interpretation difficult. Our results provide guidance in selecting appropriate methods for identifying hybrid populations from genomic data.

Download Full-text