LOGO, a contextualized pre-trained language model of human genome flexibly adapts to various downstream tasks by fine-tuning

Abstract Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotate biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpretating non-coding regions. Here we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model that applies self-supervision techniques to learn bidirectional representations of unlabeled human reference genome and extend to a series of downstream tasks via fine-tuning. We also explore a novel knowledge embedded version of LOGO to incorporate prior human annotations. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art predictive power on chromatin features with only 3% parameterization against fully supervised convolutional neural network, DeepSEA. Fine-tuned LOGO also shows outstanding performance in prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework with powerful adaptability to various tasks without substantial task-specific architecture modifications.

Download Full-text

Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution

10.1101/2021.09.06.459087 ◽

2021 ◽

Author(s):

Meng Yang ◽

Haiping Huang ◽

Lichao Huang ◽

Nan Zhang ◽

Jihong Wu ◽

...

Keyword(s):

Human Genome ◽

Human Genetics ◽

Language Model ◽

Variant Prioritization ◽

Human Reference Genome ◽

Coding Regions ◽

Conceptual Analogy ◽

Coding Variants ◽

Promoter Interaction

Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotate biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpretating non-coding regions. Here we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only 2 self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of unlabeled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based language model for human genome. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.

Download Full-text

A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data

10.1101/018093 ◽

2015 ◽

Cited By ~ 1

Author(s):

Qiongshi Lu ◽

Yiming Hu ◽

Jiehuan Sun ◽

Yuwei Cheng ◽

Kei-Hoi Cheung ◽

...

Keyword(s):

Human Genome ◽

Genome Annotation ◽

Human Genetics ◽

Integrated Analysis ◽

Whole Genome ◽

Coding Regions ◽

Functional Regions ◽

Statistical Framework ◽

Annotation Data ◽

High Throughput Experiments

Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu

Download Full-text

Faculty Opinions recommendation of Whole-exome sequencing of 2,000 Danish individuals and the role of rare coding variants in type 2 diabetes.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718194747.793488319 ◽

2013 ◽

Author(s):

Alon Keinan

Keyword(s):

Type 2 Diabetes ◽

Exome Sequencing ◽

Whole Exome Sequencing ◽

Whole Exome ◽

Coding Variants

Download Full-text

Analysis of rare coding variants in 200,000 exome‐sequenced subjects reveals novel genetic risk factors for type 2 diabetes

Diabetes/Metabolism Research and Reviews ◽

10.1002/dmrr.3482 ◽

2021 ◽

Author(s):

David Curtis

Keyword(s):

Risk Factors ◽

Type 2 Diabetes ◽

Genetic Risk ◽

Genetic Risk Factors ◽

Coding Variants

Download Full-text

Non-Coding Variants in Cancer: Mechanistic Insights and Clinical Potential for Personalized Medicine

Non-Coding RNA ◽

10.3390/ncrna7030047 ◽

2021 ◽

Vol 7 (3) ◽

pp. 47

Author(s):

Marios Lange ◽

Rodiola Begolli ◽

Antonis Giakountis

Keyword(s):

Disease Onset ◽

Cancer Genome ◽

Driver Mutations ◽

Nucleotide Polymorphisms ◽

Structural Variations ◽

Coding Regions ◽

Functional Variants ◽

Coding Variants ◽

Clinical Potential

The cancer genome is characterized by extensive variability, in the form of Single Nucleotide Polymorphisms (SNPs) or structural variations such as Copy Number Alterations (CNAs) across wider genomic areas. At the molecular level, most SNPs and/or CNAs reside in non-coding sequences, ultimately affecting the regulation of oncogenes and/or tumor-suppressors in a cancer-specific manner. Notably, inherited non-coding variants can predispose for cancer decades prior to disease onset. Furthermore, accumulation of additional non-coding driver mutations during progression of the disease, gives rise to genomic instability, acting as the driving force of neoplastic development and malignant evolution. Therefore, detection and characterization of such mutations can improve risk assessment for healthy carriers and expand the diagnostic and therapeutic toolbox for the patient. This review focuses on functional variants that reside in transcribed or not transcribed non-coding regions of the cancer genome and presents a collection of appropriate state-of-the-art methodologies to study them.

Download Full-text

Gain-of-Function Claims for Type-2-Diabetes-Associated Coding Variants in SLC16A11 Are Not Supported by the Experimental Data

Cell Reports ◽

10.1016/j.celrep.2019.09.021 ◽

2019 ◽

Vol 29 (3) ◽

pp. 778-780 ◽

Cited By ~ 3

Author(s):

Eitan Hoch ◽

Jose C. Florez ◽

Eric S. Lander ◽

Suzanne B.R. Jacobs

Keyword(s):

Experimental Data ◽

Type 2 Diabetes ◽

Gain Of Function ◽

Coding Variants

Download Full-text

Local auxin biosynthesis acts downstream of brassinosteroids to trigger root foraging for nitrogen

Nature Communications ◽

10.1038/s41467-021-25250-x ◽

2021 ◽

Vol 12 (1) ◽

Cited By ~ 1

Author(s):

Zhongtao Jia ◽

Ricardo F. H. Giehl ◽

Nicolaus von Wirén

Keyword(s):

Lateral Roots ◽

Root Surface ◽

Nutrient Acquisition ◽

Fine Tuning ◽

Auxin Biosynthesis ◽

N Availability ◽

Hormonal Crosstalk ◽

Soil Exploration ◽

N Deficiency ◽

Coding Variants

AbstractLateral roots (LRs) dominate the overall root surface of adult plants and are crucial for soil exploration and nutrient acquisition. When grown under mild nitrogen (N) deficiency, flowering plants develop longer LRs to enhance nutrient acquisition. This response is partly mediated by brassinosteroids (BR) and yet unknown mechanisms. Here, we show that local auxin biosynthesis modulates LR elongation while allelic coding variants of YUCCA8 determine the extent of elongation under N deficiency. By up-regulating the expression of YUCCA8/3/5/7 and of Tryptophan Aminotransferase of Arabidopsis 1 (TAA1) under mild N deficiency auxin accumulation increases in LR tips. We further demonstrate that N-dependent auxin biosynthesis in LRs acts epistatic to and downstream of a canonical BR signaling cascade. The uncovered BR-auxin hormonal module and its allelic variants emphasize the importance of fine-tuning hormonal crosstalk to boost adaptive root responses to N availability and offer a path to improve soil exploration by expanded root systems in plants.

Download Full-text

Enhanced Language Model with Hybrid Knowledge Graph for Mathematical Topic Prediction

10.22541/au.163491250.03226531/v1 ◽

2021 ◽

Author(s):

Minghui Wu ◽

Canghong Jin ◽

Wenkang Hu ◽

Yabo Chen

Keyword(s):

Language Model ◽

Real Data ◽

Mathematical Concept ◽

Fine Tuning ◽

Knowledge Graph ◽

Mathematics Knowledge ◽

Hybrid Knowledge ◽

Model Set ◽

Set Up ◽

Mathematical Topic

Understanding mathematical topics is important for both educators and students to capture latent concepts of questions, evaluate study performance, and recommend content in online learning systems. Compared to traditional text classification, mathematical topic classification has several main challenges: (1) the length of mathematical questions is relatively short; (2) there are various representations of the same mathematical concept(i.e., calculations and application); (3) the content of question is complex including algebra, geometry, and calculus. In order to overcome these problems, we propose a framework that combines content tokens and mathematical knowledge concepts in whole procedures. We embed entities from mathematics knowledge graphs, integrate entities into tokens in a masked language model, set up semantic similarity-based tasks for next-sentence prediction, and fuse knowledge vectors and token vectors during the fine-tuning procedure. We also build a Chinese mathematical topic prediction dataset consisting of more than 70,000 mathematical questions with topics. Our experiments using real data demonstrate that our knowledge graph-based mathematical topic prediction model outperforms other state-of-the-art methods.

Download Full-text

Distinctive functional regime of endogenous lncRNAs in dark regions of human genome

10.1101/2020.12.06.413880 ◽

2020 ◽

Author(s):

Anyou Wang ◽

Rong Hai

Keyword(s):

Human Genome ◽

Rna Processing ◽

Self Regulation ◽

Post Translational Modification ◽

Protein Coding ◽

Noncoding Regions ◽

Coding Regions ◽

Rnaseq Data ◽

Response To Stress ◽

Eukaryotic Genomes

AbstractEukaryotic genomes gradually gain noncoding regions when advancing evolution and human genome actively transcribes >90% of its noncoding regions1, suggesting their criticality in evolutionary human genome. Yet <1% of them have been functionally characterized2, leaving most human genome in dark. Here we systematically decode endogenous lncRNAs located in unannotated regions of human genome and decipher a distinctive functional regime of lncRNAs hidden in massive RNAseq data. LncRNAs divergently distribute across chromosomes, independent of protein-coding regions. Their transcriptions barely initiate on promoters through polymerase II, but mostly on enhancers. Yet conventional enhancer activators(e.g. H3K4me1) only account for a small proportion of lncRNA activation, suggesting alternatively unknown mechanisms initiating the majority of lncRNAs. Meanwhile, lncRNA-self regulation also notably contributes to lncRNA activation. LncRNAs trans-regulate broad bioprocesses, including transcription and RNA processing, cell cycle, respiration, response to stress, chromatin organization, post-translational modification, and development. Overall lncRNAs govern their owned regime distinctive from protein’s.

Download Full-text