scholarly journals A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data

2015 ◽  
Author(s):  
Qiongshi Lu ◽  
Yiming Hu ◽  
Jiehuan Sun ◽  
Yuwei Cheng ◽  
Kei-Hoi Cheung ◽  
...  

Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu

2015 ◽  
Vol 5 (1) ◽  
Author(s):  
Qiongshi Lu ◽  
Yiming Hu ◽  
Jiehuan Sun ◽  
Yuwei Cheng ◽  
Kei-Hoi Cheung ◽  
...  

2021 ◽  
Author(s):  
Meng Yang ◽  
Haiping Huang ◽  
Lichao Huang ◽  
Nan Zhang ◽  
Jihong Wu ◽  
...  

Abstract Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotate biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpretating non-coding regions. Here we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model that applies self-supervision techniques to learn bidirectional representations of unlabeled human reference genome and extend to a series of downstream tasks via fine-tuning. We also explore a novel knowledge embedded version of LOGO to incorporate prior human annotations. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art predictive power on chromatin features with only 3% parameterization against fully supervised convolutional neural network, DeepSEA. Fine-tuned LOGO also shows outstanding performance in prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework with powerful adaptability to various tasks without substantial task-specific architecture modifications.


2016 ◽  
Author(s):  
Yuwen Liu ◽  
Yanyu Liang ◽  
A. Ercument Cicek ◽  
Zhongshan Li ◽  
Jinchen Li ◽  
...  

AbstractAnalysis of de novo mutations (DNMs) from sequencing data of nuclear families has identified risk genes for many complex diseases, including multiple neurodevelopmental and psychiatric disorders. Most of these efforts have focused on mutations in protein-coding sequences. Evidence from genome-wide association studies (GWAS) strongly suggests that variants important to human diseases often lie in non-coding regions. Extending DNM-based approaches to non-coding sequences is, however, challenging because the functional significance of non-coding mutations is difficult to predict. We propose a new statistical framework for analyzing DNMs from whole-genome sequencing (WGS) data. This method, TADA-Annotations (TADA-A), is a major advance of the TADA method we developed earlier for DNM analysis in coding regions. TADA-A is able to incorporate many functional annotations such as conservation and enhancer marks, learn from data which annotations are informative of pathogenic mutations and combine both coding and non-coding mutations at the gene level to detect risk genes. It also supports meta-analysis of multiple DNM studies, while adjusting for study-specific technical effects. We applied TADA-A to WGS data of ∼300 autism family trios across five studies, and discovered several new autism risk genes. The software is freely available for all research uses.


2021 ◽  
Author(s):  
Meng Yang ◽  
Haiping Huang ◽  
Lichao Huang ◽  
Nan Zhang ◽  
Jihong Wu ◽  
...  

Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotate biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpretating non-coding regions. Here we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only 2 self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of unlabeled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based language model for human genome. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Daniel E. Runcie ◽  
Jiayi Qu ◽  
Hao Cheng ◽  
Lorin Crawford

AbstractLarge-scale phenotype data can enhance the power of genomic prediction in plant and animal breeding, as well as human genetics. However, the statistical foundation of multi-trait genomic prediction is based on the multivariate linear mixed effect model, a tool notorious for its fragility when applied to more than a handful of traits. We present , a statistical framework and associated software package for mixed model analyses of a virtually unlimited number of traits. Using three examples with real plant data, we show that can leverage thousands of traits at once to significantly improve genetic value prediction accuracy.


Genomics ◽  
2020 ◽  
Author(s):  
Xinshuai Zhang ◽  
Yao Ruan ◽  
Wukang Liu ◽  
Qian Chen ◽  
Lihong Gu ◽  
...  

2017 ◽  
Vol 77 (23) ◽  
pp. 6538-6550 ◽  
Author(s):  
Dylan Z. Kelley ◽  
Emily L. Flam ◽  
Evgeny Izumchenko ◽  
Ludmila V. Danilova ◽  
Hildegard A. Wulf ◽  
...  

2007 ◽  
Vol 35 (Web Server) ◽  
pp. W201-W205 ◽  
Author(s):  
C. D. Schmid ◽  
T. Sengstag ◽  
P. Bucher ◽  
M. Delorenzi

Sign in / Sign up

Export Citation Format

Share Document