A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data

Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu

Download Full-text

A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data

Scientific Reports ◽

10.1038/srep10576 ◽

2015 ◽

Vol 5 (1) ◽

Cited By ~ 63

Author(s):

Qiongshi Lu ◽

Yiming Hu ◽

Jiehuan Sun ◽

Yuwei Cheng ◽

Kei-Hoi Cheung ◽

...

Keyword(s):

Human Genome ◽

Integrated Analysis ◽

Coding Regions ◽

Statistical Framework ◽

Annotation Data

Download Full-text

LOGO, a contextualized pre-trained language model of human genome flexibly adapts to various downstream tasks by fine-tuning

10.21203/rs.3.rs-448927/v1 ◽

2021 ◽

Author(s):

Meng Yang ◽

Haiping Huang ◽

Lichao Huang ◽

Nan Zhang ◽

Jihong Wu ◽

...

Keyword(s):

Human Genome ◽

Human Genetics ◽

Language Model ◽

Fine Tuning ◽

Human Reference Genome ◽

Coding Regions ◽

Conceptual Analogy ◽

Coding Variants ◽

Promoter Interaction

Abstract Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotate biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpretating non-coding regions. Here we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model that applies self-supervision techniques to learn bidirectional representations of unlabeled human reference genome and extend to a series of downstream tasks via fine-tuning. We also explore a novel knowledge embedded version of LOGO to incorporate prior human annotations. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art predictive power on chromatin features with only 3% parameterization against fully supervised convolutional neural network, DeepSEA. Fine-tuned LOGO also shows outstanding performance in prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework with powerful adaptability to various tasks without substantial task-specific architecture modifications.

Download Full-text

A statistical framework for mapping risk genes from de novo mutations in whole-genome sequencing studies

10.1101/077578 ◽

2016 ◽

Author(s):

Yuwen Liu ◽

Yanyu Liang ◽

A. Ercument Cicek ◽

Zhongshan Li ◽

Jinchen Li ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

De Novo ◽

Genome Wide Association Studies ◽

Whole Genome ◽

Major Advance ◽

Risk Genes ◽

Coding Sequences ◽

Coding Regions ◽

Statistical Framework

AbstractAnalysis of de novo mutations (DNMs) from sequencing data of nuclear families has identified risk genes for many complex diseases, including multiple neurodevelopmental and psychiatric disorders. Most of these efforts have focused on mutations in protein-coding sequences. Evidence from genome-wide association studies (GWAS) strongly suggests that variants important to human diseases often lie in non-coding regions. Extending DNM-based approaches to non-coding sequences is, however, challenging because the functional significance of non-coding mutations is difficult to predict. We propose a new statistical framework for analyzing DNMs from whole-genome sequencing (WGS) data. This method, TADA-Annotations (TADA-A), is a major advance of the TADA method we developed earlier for DNM analysis in coding regions. TADA-A is able to incorporate many functional annotations such as conservation and enhancer marks, learn from data which annotations are informative of pathogenic mutations and combine both coding and non-coding mutations at the gene level to detect risk genes. It also supports meta-analysis of multiple DNM studies, while adjusting for study-specific technical effects. We applied TADA-A to WGS data of ∼300 autism family trios across five studies, and discovered several new autism risk genes. The software is freely available for all research uses.

Download Full-text

Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution

10.1101/2021.09.06.459087 ◽

2021 ◽

Author(s):

Meng Yang ◽

Haiping Huang ◽

Lichao Huang ◽

Nan Zhang ◽

Jihong Wu ◽

...

Keyword(s):

Human Genome ◽

Human Genetics ◽

Language Model ◽

Variant Prioritization ◽

Human Reference Genome ◽

Coding Regions ◽

Conceptual Analogy ◽

Coding Variants ◽

Promoter Interaction

Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotate biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpretating non-coding regions. Here we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only 2 self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of unlabeled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based language model for human genome. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.

Download Full-text

Faculty Opinions recommendation of Phylogenetic shadowing of primate sequences to find functional regions of the human genome.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1012491.193560 ◽

2003 ◽

Author(s):

Ulf Pettersson

Keyword(s):

Human Genome ◽

Functional Regions

Download Full-text

Will Gene Patents Impede Whole Genome Sequencing?: Deconstructing the Myth that Twenty Percent of the Human Genome is Patented

SSRN Electronic Journal ◽

10.2139/ssrn.1894715 ◽

2011 ◽

Cited By ~ 2

Author(s):

Christopher M. Holman

Keyword(s):

Whole Genome Sequencing ◽

Human Genome ◽

Genome Sequencing ◽

Whole Genome ◽

Gene Patents

Download Full-text

MegaLMM: Mega-scale linear mixed models for genomic predictions with thousands of traits

Genome Biology ◽

10.1186/s13059-021-02416-w ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Daniel E. Runcie ◽

Jiayi Qu ◽

Hao Cheng ◽

Lorin Crawford

Keyword(s):

Genomic Prediction ◽

Large Scale ◽

Mixed Model ◽

Human Genetics ◽

Linear Mixed Effect Model ◽

Mixed Effect ◽

Statistical Framework ◽

Effect Model ◽

Plant Data ◽

Genetic Value

AbstractLarge-scale phenotype data can enhance the power of genomic prediction in plant and animal breeding, as well as human genetics. However, the statistical foundation of multi-trait genomic prediction is based on the multivariate linear mixed effect model, a tool notorious for its fragility when applied to more than a handful of traits. We present , a statistical framework and associated software package for mixed model analyses of a virtually unlimited number of traits. Using three examples with real plant data, we show that can leverage thousands of traits at once to significantly improve genetic value prediction accuracy.

Download Full-text