scholarly journals LOGO, a contextualized pre-trained language model of human genome flexibly adapts to various downstream tasks by fine-tuning

Author(s):  
Meng Yang ◽  
Haiping Huang ◽  
Lichao Huang ◽  
Nan Zhang ◽  
Jihong Wu ◽  
...  

Abstract Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotate biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpretating non-coding regions. Here we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model that applies self-supervision techniques to learn bidirectional representations of unlabeled human reference genome and extend to a series of downstream tasks via fine-tuning. We also explore a novel knowledge embedded version of LOGO to incorporate prior human annotations. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art predictive power on chromatin features with only 3% parameterization against fully supervised convolutional neural network, DeepSEA. Fine-tuned LOGO also shows outstanding performance in prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework with powerful adaptability to various tasks without substantial task-specific architecture modifications.

2021 ◽  
Author(s):  
Meng Yang ◽  
Haiping Huang ◽  
Lichao Huang ◽  
Nan Zhang ◽  
Jihong Wu ◽  
...  

Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotate biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpretating non-coding regions. Here we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only 2 self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of unlabeled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based language model for human genome. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.


2015 ◽  
Author(s):  
Qiongshi Lu ◽  
Yiming Hu ◽  
Jiehuan Sun ◽  
Yuwei Cheng ◽  
Kei-Hoi Cheung ◽  
...  

Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu


2021 ◽  
Vol 7 (3) ◽  
pp. 47
Author(s):  
Marios Lange ◽  
Rodiola Begolli ◽  
Antonis Giakountis

The cancer genome is characterized by extensive variability, in the form of Single Nucleotide Polymorphisms (SNPs) or structural variations such as Copy Number Alterations (CNAs) across wider genomic areas. At the molecular level, most SNPs and/or CNAs reside in non-coding sequences, ultimately affecting the regulation of oncogenes and/or tumor-suppressors in a cancer-specific manner. Notably, inherited non-coding variants can predispose for cancer decades prior to disease onset. Furthermore, accumulation of additional non-coding driver mutations during progression of the disease, gives rise to genomic instability, acting as the driving force of neoplastic development and malignant evolution. Therefore, detection and characterization of such mutations can improve risk assessment for healthy carriers and expand the diagnostic and therapeutic toolbox for the patient. This review focuses on functional variants that reside in transcribed or not transcribed non-coding regions of the cancer genome and presents a collection of appropriate state-of-the-art methodologies to study them.


Cell Reports ◽  
2019 ◽  
Vol 29 (3) ◽  
pp. 778-780 ◽  
Author(s):  
Eitan Hoch ◽  
Jose C. Florez ◽  
Eric S. Lander ◽  
Suzanne B.R. Jacobs

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Zhongtao Jia ◽  
Ricardo F. H. Giehl ◽  
Nicolaus von Wirén

AbstractLateral roots (LRs) dominate the overall root surface of adult plants and are crucial for soil exploration and nutrient acquisition. When grown under mild nitrogen (N) deficiency, flowering plants develop longer LRs to enhance nutrient acquisition. This response is partly mediated by brassinosteroids (BR) and yet unknown mechanisms. Here, we show that local auxin biosynthesis modulates LR elongation while allelic coding variants of YUCCA8 determine the extent of elongation under N deficiency. By up-regulating the expression of YUCCA8/3/5/7 and of Tryptophan Aminotransferase of Arabidopsis 1 (TAA1) under mild N deficiency auxin accumulation increases in LR tips. We further demonstrate that N-dependent auxin biosynthesis in LRs acts epistatic to and downstream of a canonical BR signaling cascade. The uncovered BR-auxin hormonal module and its allelic variants emphasize the importance of fine-tuning hormonal crosstalk to boost adaptive root responses to N availability and offer a path to improve soil exploration by expanded root systems in plants.


Author(s):  
Minghui Wu ◽  
Canghong Jin ◽  
Wenkang Hu ◽  
Yabo Chen

Understanding mathematical topics is important for both educators and students to capture latent concepts of questions, evaluate study performance, and recommend content in online learning systems. Compared to traditional text classification, mathematical topic classification has several main challenges: (1) the length of mathematical questions is relatively short; (2) there are various representations of the same mathematical concept(i.e., calculations and application); (3) the content of question is complex including algebra, geometry, and calculus. In order to overcome these problems, we propose a framework that combines content tokens and mathematical knowledge concepts in whole procedures. We embed entities from mathematics knowledge graphs, integrate entities into tokens in a masked language model, set up semantic similarity-based tasks for next-sentence prediction, and fuse knowledge vectors and token vectors during the fine-tuning procedure. We also build a Chinese mathematical topic prediction dataset consisting of more than 70,000 mathematical questions with topics. Our experiments using real data demonstrate that our knowledge graph-based mathematical topic prediction model outperforms other state-of-the-art methods.


2020 ◽  
Author(s):  
Anyou Wang ◽  
Rong Hai

AbstractEukaryotic genomes gradually gain noncoding regions when advancing evolution and human genome actively transcribes >90% of its noncoding regions1, suggesting their criticality in evolutionary human genome. Yet <1% of them have been functionally characterized2, leaving most human genome in dark. Here we systematically decode endogenous lncRNAs located in unannotated regions of human genome and decipher a distinctive functional regime of lncRNAs hidden in massive RNAseq data. LncRNAs divergently distribute across chromosomes, independent of protein-coding regions. Their transcriptions barely initiate on promoters through polymerase II, but mostly on enhancers. Yet conventional enhancer activators(e.g. H3K4me1) only account for a small proportion of lncRNA activation, suggesting alternatively unknown mechanisms initiating the majority of lncRNAs. Meanwhile, lncRNA-self regulation also notably contributes to lncRNA activation. LncRNAs trans-regulate broad bioprocesses, including transcription and RNA processing, cell cycle, respiration, response to stress, chromatin organization, post-translational modification, and development. Overall lncRNAs govern their owned regime distinctive from protein’s.


2006 ◽  
Vol 7 (1) ◽  
Author(s):  
Steven C Elbein ◽  
Xiaoqin Wang ◽  
Mohammad A Karim ◽  
Winston S Chu ◽  
Kristi D Silver

Sign in / Sign up

Export Citation Format

Share Document