scholarly journals Conservation of k-mer Composition and Correlation Contribution between Introns and Intergenic Regions of Animalia Genomes

Genes ◽  
2018 ◽  
Vol 9 (10) ◽  
pp. 482 ◽  
Author(s):  
Aaron Sievers ◽  
Frederik Wenz ◽  
Michael Hausmann ◽  
Georg Hildenbrand

In this study, we pairwise-compared multiple genome regions, including genes, exons, coding DNA sequences (CDS), introns, and intergenic regions of 39 Animalia genomes, including Deuterostomia (27 species) and Protostomia (12 species), by applying established k-mer-based (alignment-free) comparison methods. We found strong correlations between the sequence structure of introns and intergenic regions, individual organisms, and within wider phylogenetical ranges, indicating the conservation of certain structures over the full range of analyzed organisms. We analyzed these sequence structures by quantifying the contribution of different sets of DNA words to the average correlation value by decomposing the correlation coefficients with respect to these word sets. We found that the conserved structures within introns, intergenic regions, and between the two were mainly a result of conserved tandem repeats with repeat units ≤ 2 bp (e.g., (AT)n), while other conserved sequence structures, such as those found between exons and CDS, were dominated by tandem repeats with repeat unit sizes of 3 bp in length and more complex DNA word patterns. We conclude that the conservation between intron and intergenic regions indicates a shared function of these sequence structures. Also, the similar differences in conserved structures with known origin, especially to the conservation between exons and CDS resulting from DNA codons, indicate that k-mer composition-based functional properties of introns and intergenic regions may differ from those of exons and CDS.

Genes ◽  
2021 ◽  
Vol 12 (10) ◽  
pp. 1571
Author(s):  
Aaron Sievers ◽  
Liane Sauer ◽  
Michael Hausmann ◽  
Georg Hildenbrand

Several strongly conserved DNA sequence patterns in and between introns and intergenic regions (IIRs) consisting of short tandem repeats (STRs) with repeat lengths <3 bp have already been described in the kingdom of Animalia. In this work, we expanded the search and analysis of conserved DNA sequence patterns to a wider range of eukaryotic genomes. Our aims were to confirm the conservation of these patterns, to support the hypothesis on their functional constraints and/or the identification of unknown patterns. We pairwise compared genomic DNA sequences of genes, exons, CDS, introns and intergenic regions of 34 Embryophyta (land plants), 30 Protista and 29 Fungi using established k-mer-based (alignment-free) comparison methods. Additionally, the results were compared with values derived for Animalia in former studies. We confirmed strong correlations between the sequence structures of IIRs spanning over the entire domain of Eukaryotes. We found that the high correlations within introns, intergenic regions and between the two are a result of conserved abundancies of STRs with repeat units ≤2 bp (e.g., (AT)n). For some sequence patterns and their inverse complementary sequences, we found a violation of equal distribution on complementary DNA strands in a subset of genomes. Looking at mismatches within the identified STR patterns, we found specific preferences for certain nucleotides stable over all four phylogenetic kingdoms. We conclude that all of these conserved patterns between IIRs indicate a shared function of these sequence structures related to STRs.


2007 ◽  
Vol 15 (03) ◽  
pp. 299-312
Author(s):  
SU-LONG NYEO ◽  
JUI-PING YU

The length distributions of simple tandem repeats in the genomes of several organisms are evaluated and found to exhibit long-range correlations in A and T nucleotide bases related repeats for most eukaryotes. In particular, the length distributions of the mononucleotide A/T repeat units have longer tails than those of the C/G repeat units. Also, the length distributions of the dinucleotide repeat unit CG show a simple monotonously fast decreasing behavior, while those of repeat units AT, AG and AC have complicated structures at larger repeat lengths, especially for human, mouse and rat chromosomes. These distributive behaviors are due to the CpG deficiency in different genomes with different methylation activities. Especially, methyltransferases in vertebrates appear to methylate specifically the cytosine in CpG dinucleotides, and the methylated cytosines is prone to mutate to thymine by spontaneous deamination. The dinucleotide CpG would gradually decay into TpG and CpA. In addition, there is a peak in the distributions of repeat unit A at repeat-repeat separation 153 nt for humans and chimpanzees. We show that the long-tail behavior of mononucleotide repeat unit A and the peak at repeat separation 153 nt are due to the interspersed repetitive DNA sequences in humans and chimpanzees.


Genetics ◽  
1999 ◽  
Vol 151 (2) ◽  
pp. 511-519 ◽  
Author(s):  
Robert J Kokoska ◽  
Lela Stefanovic ◽  
Andrew B Buermeyer ◽  
R Michael Liskay ◽  
Thomas D Petes

AbstractThe POL30 gene of the yeast Saccharomyces cerevisiae encodes the proliferating cell nuclear antigen (PCNA), a protein required for processive DNA synthesis by DNA polymerase δ and ϵ. We examined the effects of the pol30-52 mutation on the stability of microsatellite (1- to 8-bp repeat units) and minisatellite (20-bp repeat units) DNA sequences. It had previously been shown that this mutation destabilizes dinucleotide repeats 150-fold and that this effect is primarily due to defects in DNA mismatch repair. From our analysis of the effects of pol30-52 on classes of repetitive DNA with longer repeat unit lengths, we conclude that this mutation may also elevate the rate of DNA polymerase slippage. The effect of pol30-52 on tracts of repetitive DNA with large repeat unit lengths was similar, but not identical, to that observed previously for pol3-t, a temperature-sensitive mutation affecting DNA polymerase δ. Strains with both pol30-52 and pol3-t mutations grew extremely slowly and had minisatellite mutation rates considerably greater than those observed in either single mutant strain.


1993 ◽  
Vol 13 (10) ◽  
pp. 6520-6529
Author(s):  
P E Warburton ◽  
J S Waye ◽  
H F Willard

Tandemly repeated DNA families appear to undergo concerted evolution, such that repeat units within a species have a higher degree of sequence similarity than repeat units from even closely related species. While intraspecies homogenization of repeat units can be explained satisfactorily by repeated rounds of genetic exchange processes such as unequal crossing over and/or gene conversion, the parameters controlling these processes remain largely unknown. Alpha satellite DNA is a noncoding tandemly repeated DNA family found at the centromeres of all human and primate chromosomes. We have used sequence analysis to investigate the molecular basis of 13 variant alpha satellite repeat units, allowing comparison of multiple independent recombination events in closely related DNA sequences. The distribution of these events within the 171-bp monomer is nonrandom and clusters in a distinct 20- to 25-bp region, suggesting possible effects of primary sequence and/or chromatin structure. The position of these recombination events may be associated with the location within the higher-order repeat unit of the binding site for the centromere-specific protein CENP-B. These studies have implications for the molecular nature of genetic recombination, mechanisms of concerted evolution, and higher-order structure of centromeric heterochromatin.


Genome ◽  
1998 ◽  
Vol 41 (3) ◽  
pp. 429-434 ◽  
Author(s):  
J B Buntjer ◽  
J A Lenstra

We describe a PCR-like reaction in which genomic DNA acts as a template as well as a primer. Interaction between genomic tandem repeat units leads to self-amplification of satellite DNA. This genomic self-priming PCR (GSP-PCR) allowed the rapid amplification of species-specific tandem repeats of horse, cattle, dolphin, and chicken. A novel specific satellite of ostrich with a repeat unit of 60 bp was isolated using this method.Key words: satellite DNA, amplification, isolation, species-specific probes.


Genes ◽  
2019 ◽  
Vol 10 (7) ◽  
pp. 542
Author(s):  
Kim ◽  
Song ◽  
Ha ◽  
Moon ◽  
Kim ◽  
...  

Variable number tandem repeats (VNTRs) in mitochondrial DNA (mtDNA) of Lentinula edodes are of interest for their role in mtDNA variation and their application as genetic marker. Sequence analysis of three L. edodes mtDNAs revealed the presence of VNTRs of two categories. Type I VNTRs consist of two types of repeat units in a symmetric distribution, whereas Type II VNTRs contain tandemly arrayed repeats of 7- or 17-bp DNA sequences. The number of repeat units was variable depending on the mtDNA of different strains. Using the variations in VNTRs as a mitochondrial marker and the A mating type as a nuclear type marker, we demonstrated that one of the two nuclei in the donor dikaryon preferentially enters into the monokaryotic cytoplasm to establish a new dikaryon which still retains the mitochondria of the monokaryon in the individual mating. Interestingly, we found 6 VNTRs with newly added repeat units from the 22 mates, indicating that elongation of VNTRs occurs during replication of mtDNA. This, together with comparative analysis of the repeating pattern, enables us to propose a mechanistic model that explains the elongation of Type I VNTRs through reciprocal incorporation of basic repeat units, 5’-TCCCTTTAGGG-3’ and its complementary sequence (5’-CCCTAAAGGGA-3’).


1993 ◽  
Vol 13 (10) ◽  
pp. 6520-6529 ◽  
Author(s):  
P E Warburton ◽  
J S Waye ◽  
H F Willard

Tandemly repeated DNA families appear to undergo concerted evolution, such that repeat units within a species have a higher degree of sequence similarity than repeat units from even closely related species. While intraspecies homogenization of repeat units can be explained satisfactorily by repeated rounds of genetic exchange processes such as unequal crossing over and/or gene conversion, the parameters controlling these processes remain largely unknown. Alpha satellite DNA is a noncoding tandemly repeated DNA family found at the centromeres of all human and primate chromosomes. We have used sequence analysis to investigate the molecular basis of 13 variant alpha satellite repeat units, allowing comparison of multiple independent recombination events in closely related DNA sequences. The distribution of these events within the 171-bp monomer is nonrandom and clusters in a distinct 20- to 25-bp region, suggesting possible effects of primary sequence and/or chromatin structure. The position of these recombination events may be associated with the location within the higher-order repeat unit of the binding site for the centromere-specific protein CENP-B. These studies have implications for the molecular nature of genetic recombination, mechanisms of concerted evolution, and higher-order structure of centromeric heterochromatin.


Genetics ◽  
2003 ◽  
Vol 164 (3) ◽  
pp. 1087-1097 ◽  
Author(s):  
F C Hsu ◽  
C J Wang ◽  
C M Chen ◽  
H Y Hu ◽  
C C Chen

Abstract Two families of tandem repeats, 180-bp and TR-1, have been found in the knobs of maize. In this study, we isolated 59 clones belonging to the TR-1 family from maize and teosinte. Southern hybridization and sequence analysis revealed that members of this family are composed of three basic sequences, A (67 bp); B (184 bp) or its variants B′ (184 bp), 2/3B (115 bp), 2/3B′ (115 bp); and C (108 bp), which are arranged in various combinations to produce repeat units that are multiples of ∼180 bp. The molecular structure of TR-1 elements suggests that: (1) the B component may evolve from the 180-bp knob repeat as a result of mutations during evolution; (2) B′ may originate from B through lateral amplification accompanied by base-pair changes; (3) C plus A may be a single sequence that is added to B and B′, probably via nonhomologous recombination; and (4) 69 bp at the 3′ end of B or B′, and the entire sequence of C can be removed from the elements by an unknown mechanism. Sequence comparisons showed partial homologies between TR-1 elements and two centromeric sequences (B repeats) of the supernumerary B chromosome. This result, together with the finding of other investigators that the B repeat is also fragmentarily homologous to the 180-bp repeat, suggests that the B repeat is derived from knob repeats in A chromosomes, which subsequently become structurally modified. Fluorescence in situ hybridization localized the B repeat to the B centromere and the 180-bp and TR-1 repeats to the proximal heterochromatin knob on the B chromosome.


Author(s):  
Yanrong Ji ◽  
Zhihan Zhou ◽  
Han Liu ◽  
Ramana V Davuluri

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.


2013 ◽  
Vol 72 (1) ◽  
pp. 1-133 ◽  
Author(s):  
Višnja Besendorfer ◽  
Jelena Mlinarec

Abstract Satellite DNAis a genomic component present in virtually all eukaryotic organisms. The turnover of highly repetitive satellite DNAis an important element in genome organization and evolution in plants. Here we study the presence, physical distribution and abundance of the satellite DNAfamily AhTR1 in Anemone. Twenty-two Anemone accessions were analyzed by PCR to assess the presence of AhTR1, while fluorescence in situ hybridization and Southern hybridization were used to determine the abundance and genomic distribution of AhTR1. The AhTR1 repeat unit was PCR-amplified only in eight phylogenetically related European Anemone taxa of the Anemone section. FISH signal with AhTR1 probe was visible only in A. hortensis and A. pavonina, showing localization of AhTR1 in the regions of interstitial heterochromatin in both species. The absence of a FISH signal in the six other taxa as well as weak signal after Southern hybridization suggest that in these species AhTR1 family appears as relict sequences. Thus, the data presented here support the »library hypothesis« for AhTR1 satellite evolution in Anemone. Similar species-specific satellite DNAprofiles in A. hortensis and A. pavonina support the treatment of A. hortensis and A. pavonina as one species, i.e. A. hortensis s.l.


Sign in / Sign up

Export Citation Format

Share Document