DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

ABSTRACTDeciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, that forms global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on many sequence predictions tasks, after easy fine-tuning using small task-specific data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variants. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance.

Download Full-text

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Bioinformatics ◽

10.1093/bioinformatics/btab083 ◽

2021 ◽

Author(s):

Yanrong Ji ◽

Zhihan Zhou ◽

Han Liu ◽

Ramana V Davuluri

Keyword(s):

Dna Sequences ◽

Regulatory Elements ◽

Ease Of Use ◽

Fine Tuning ◽

Supplementary Information ◽

Sequence Motifs ◽

Semantic Relationship ◽

Accurate Identification ◽

Conserved Sequence ◽

Genome Wide

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Conserved sequence motifs among bacterial, eukaryotic, and archaeal phosphatases that define a new phosphohydrolase superfamily

Protein Science ◽

10.1002/pro.5560070722 ◽

1998 ◽

Vol 7 (7) ◽

pp. 1647-1652 ◽

Cited By ~ 96

Author(s):

Maria Cristina Thaller ◽

Serena Schippa ◽

Gian Maria Rossolini

Keyword(s):

Sequence Motifs ◽

Conserved Sequence ◽

Conserved Sequence Motifs

Download Full-text

Identification and Analysis of Novel Amino-Acid Sequence Repeats inBacillus anthracisstr.AmesProteome Using Computational Tools

Comparative and Functional Genomics ◽

10.1155/2007/47161 ◽

2007 ◽

Vol 2007 ◽

pp. 1-23 ◽

Cited By ~ 2

Author(s):

G. R. Hemalatha ◽

D. Satyanarayana Rao ◽

L. Guruprasad

Keyword(s):

Amino Acid ◽

Amino Acid Residue ◽

Protein Sequence ◽

Amino Acid Residues ◽

Sequence Motifs ◽

Computational Tools ◽

Conserved Sequence ◽

Conserved Sequence Motifs ◽

Multiple Copies ◽

Domain 3

We have identified four repeats and ten domains that are novel in proteins encoded by theBacillus anthracisstr.Amesproteome using automated in silico methods. A “repeat” corresponds to a region comprising less than 55-amino-acid residues that occur more than once in the protein sequence and sometimes present in tandem. A “domain” corresponds to a conserved region with greater than 55-amino-acid residues and may be present as single or multiple copies in the protein sequence. These correspond to (1) 57-amino-acid-residue PxV domain, (2) 122-amino-acid-residue FxF domain, (3) 111-amino-acid-residue YEFF domain, (4) 109-amino-acid-residue IMxxH domain, (5) 103-amino-acid-residue VxxT domain, (6) 84-amino-acid-residue ExW domain, (7) 104-amino-acid-residue NTGFIG domain, (8) 36-amino-acid-residue NxGK repeat, (9) 95-amino-acid-residue VYV domain, (10) 75-amino-acid-residue KEWE domain, (11) 59-amino-acid-residue AFL domain, (12) 53-amino-acid-residue RIDVK repeat, (13) (a) 41-amino-acid-residue AGQF repeat and (b) 42-amino-acid-residue GSAL repeat. A repeat or domain type is characterized by specific conserved sequence motifs. We discuss the presence of these repeats and domains in proteins from other genomes and their probable secondary structure.

Download Full-text

Conserved sequence motifs, alignment, and secondary structure for the third domain of animal 12S rRNA

Molecular Biology and Evolution ◽

10.1093/oxfordjournals.molbev.a025552 ◽

1996 ◽

Vol 13 (1) ◽

pp. 150-169 ◽

Cited By ~ 181

Author(s):

R. E. Hickson ◽

C. Simon ◽

A. Cooper ◽

G. S. Spicer ◽

J. Sullivan ◽

...

Keyword(s):

Secondary Structure ◽

12S Rrna ◽

Sequence Motifs ◽

Conserved Sequence ◽

The Third ◽

Conserved Sequence Motifs

Download Full-text

SAND, a New Protein Family: From Nucleic Acid to Protein Structure and Function Prediction

Comparative and Functional Genomics ◽

10.1002/cfg.93 ◽

2001 ◽

Vol 2 (4) ◽

pp. 226-235 ◽

Cited By ~ 5

Author(s):

Amanda Cottage ◽

Yvonne J. K. Edwards ◽

Greg Elgar

Keyword(s):

Genomic Sequence ◽

Protein Family ◽

Sequence Motifs ◽

Est Database ◽

Computational Tools ◽

Conserved Sequence ◽

Conserved Sequence Motifs ◽

And Function ◽

New Protein ◽

Genomic Organisation

As a result of genome, EST and cDNA sequencing projects, there are huge numbers of predicted and/or partially characterised protein sequences compared with a relatively small number of proteins with experimentally determined function and structure. Thus, there is a considerable attention focused on the accurate prediction of gene function and structure from sequence by using bioinformatics. In the course of our analysis of genomic sequence fromFugu rubripes, we identified a novel gene,SAND, with significant sequence identity to hypothetical proteins predicted inSaccharomyces cerevisiae, Schizosaccharomyces pombe, Caenorhabditis elegans, aDrosophila melanogastergene, and mouse and human cDNAs. Here we identify a furtherSANDhomologue in human andArabidopsis thalianaby use of standard computational tools. We describe the genomic organisation ofSANDin these evolutionarily divergent species and identify sequence homologues from EST database searches confirming the expression of SAND in over 20 different eukaryotes. We confirm the expression of two different SAND paralogues in mammals and determine expression of one SAND in other vertebrates and eukaryotes. Furthermore, we predict structural properties of SAND, and characterise conserved sequence motifs in this protein family.

Download Full-text