DiviSSR: Simple arithmetic for efficient identification of tandem repeats

Numerical or vector representations of DNA sequences have been applied for identification of specific sequence characteristics and patterns which are not evident in their character (A, C, G, T) representations. These transformations often reveal a mathematical structure to the sequences which can be captured efficiently using established mathematical methods. One such transformation, the 2-bit format, represents each nucleotide using only two bits instead of eight for efficient storage of genomic data. Here we describe a mathematical property that exists in the 2-bit representation of tandemly repeated DNA sequences. Our tool, DiviSSR (pronounced divisor), leverages this property and subsequent arithmetic for ultrafast and accurate identification of tandem repeats. DiviSSR can process the entire human genome in ~30s, and short sequence reads at a rate of >1 million reads/s on a single CPU thread. Our work also highlights the implications of using simple mathematical properties of DNA sequences for faster algorithms in genomics.

Download Full-text

PCR-BASED SYNTHESIS OF REPETITIVE SINGLE-STRANDED DNA FOR APPLICATIONS TO NANOBIOTECHNOLOGY

International Journal of Nanoscience ◽

10.1142/s0219581x05003140 ◽

2005 ◽

Vol 04 (03) ◽

pp. 287-294

Author(s):

SIMA S. ZEIN ◽

ALEXANDRE A. VETCHER ◽

STEPHEN D. LEVENE

Keyword(s):

Dna Sequences ◽

Tandem Repeats ◽

Repetitive Sequences ◽

Practical Interest ◽

Telomeric Sequence ◽

Automated Synthesis ◽

Dna Molecules ◽

Specific Sequence ◽

Single Stranded Dna ◽

Pcr Method

Recent data show that assembly of repetitive-sequence, single-stranded DNA molecules (ssDNA) and carbon nanotubes (CNTs) depend on the specific sequence repeat. Therefore, it is of practical interest to assess various methods for generating single-stranded DNA molecules that contain repetitive sequences. Existing automated synthesis procedures for generating long (> 100 nt) ssDNA molecules generate ssDNA products of variable purity and yield. An alternative to automated synthesis is the polymerase chain reaction (PCR), which provides a powerful tool for the amplification of minute amounts of specific DNA sequences. Here we show that a modified asymmetric PCR method allows synthesis of long ssDNAs comprised of tandem repeats of the repetitive vertebrate telomeric sequence (TTAGGG)n, and is also applicable to arbitrary (repetitive or nonrepetitive) DNA. Long, repetitive deoxynucleotides produced by automated synthesis are surprisingly heterogeneous with respect to both length and sequence. Benefits of the method described here are that long, repetitive ssDNA sequences are generated with high sequence fidelity and yield.

Download Full-text

Differentially Amplified Repetitive Sequences Among Aegilops tauschii Subspecies and Genotypes

Frontiers in Plant Science ◽

10.3389/fpls.2021.716750 ◽

2021 ◽

Vol 12 ◽

Author(s):

Rahman Ebrahimzadegan ◽

Fatemeh Orooji ◽

Pengtao Ma ◽

Ghader Mirzaghaderi

Keyword(s):

Dna Sequences ◽

Tandem Repeats ◽

Repetitive Sequences ◽

Aegilops Tauschii ◽

Distribution Patterns ◽

Snp Analysis ◽

Specific Sequence ◽

Factors Affecting ◽

Illumina Sequence ◽

Species Specific

Genomic repetitive sequences commonly show species-specific sequence type, abundance, and distribution patterns, however, their intraspecific characteristics have been poorly described. We quantified the genomic repetitive sequences and performed single nucleotide polymorphism (SNP) analysis between 29 Ae. tauschii genotypes and subspecies using publicly available raw genomic Illumina sequence reads and used fluorescence in situ hybridization (FISH) to experimentally analyze some repeats. The majority of the identified repetitive sequences had similar contents and proportions between anathera, meyeri, and strangulata subspecies. However, two Ty3/gypsy retrotransposons (CL62 and CL87) showed significantly higher abundances, and CL1, CL119, CL213, CL217 tandem repeats, and CL142 retrotransposon (Ty1/copia type) showed significantly lower abundances in subspecies strangulata compared with the subspecies anathera and meyeri. One tandem repeat and 45S ribosomal DNA (45S rDNA) abundances showed a high variation between genotypes but their abundances were not subspecies specific. Phylogenetic analysis using the repeat abundances of the aforementioned clusters placed the strangulata subsp. in a distinct clade but could not discriminate anathera and meyeri. A near complete differentiation of anathera and strangulata subspecies was observed using SNP analysis; however, var. meyeri showed higher genetic diversity. FISH using major tandem repeats couldn’t detect differences between subspecies, although (GAA)10 signal patterns generated two different karyotype groups. Taken together, the different classes of repetitive DNA sequences have differentially accumulated between strangulata and the other two subspecies of Ae. tauschii that is generally in agreement with spike morphology, implying that factors affecting repeatome evolution are variable even among highly closely related lineages.

Download Full-text

Sequence, Chromatin and Evolution of Satellite DNA

International Journal of Molecular Sciences ◽

10.3390/ijms22094309 ◽

2021 ◽

Vol 22 (9) ◽

pp. 4309

Author(s):

Jitendra Thakur ◽

Jenika Packiaraj ◽

Steven Henikoff

Keyword(s):

Dna Sequences ◽

Satellite Dna ◽

Tandem Repeats ◽

Large Fraction ◽

Dna Curvature ◽

Sequence Motifs ◽

Specific Sequence ◽

Satellite Dnas ◽

Dna Repeat ◽

Species Specific

Satellite DNA consists of abundant tandem repeats that play important roles in cellular processes, including chromosome segregation, genome organization and chromosome end protection. Most satellite DNA repeat units are either of nucleosomal length or 5–10 bp long and occupy centromeric, pericentromeric or telomeric regions. Due to high repetitiveness, satellite DNA sequences have largely been absent from genome assemblies. Although few conserved satellite-specific sequence motifs have been identified, DNA curvature, dyad symmetries and inverted repeats are features of various satellite DNAs in several organisms. Satellite DNA sequences are either embedded in highly compact gene-poor heterochromatin or specialized chromatin that is distinct from euchromatin. Nevertheless, some satellite DNAs are transcribed into non-coding RNAs that may play important roles in satellite DNA function. Intriguingly, satellite DNAs are among the most rapidly evolving genomic elements, such that a large fraction is species-specific in most organisms. Here we describe the different classes of satellite DNA sequences, their satellite-specific chromatin features, and how these features may contribute to satellite DNA biology and evolution. We also discuss how the evolution of functional satellite DNA classes may contribute to speciation in plants and animals.

Download Full-text

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Bioinformatics ◽

10.1093/bioinformatics/btab083 ◽

2021 ◽

Author(s):

Yanrong Ji ◽

Zhihan Zhou ◽

Han Liu ◽

Ramana V Davuluri

Keyword(s):

Dna Sequences ◽

Regulatory Elements ◽

Ease Of Use ◽

Fine Tuning ◽

Supplementary Information ◽

Sequence Motifs ◽

Semantic Relationship ◽

Accurate Identification ◽

Conserved Sequence ◽

Genome Wide

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A species-specific satellite DNA from the entomopathogenic nematode Heterorhabditis indicus

Genome ◽

10.1139/g98-005 ◽

1998 ◽

Vol 41 (2) ◽

pp. 148-153 ◽

Cited By ~ 8

Author(s):

Monique Abadon ◽

Eric Grenier ◽

Christian Laumond ◽

Pierre Abad

Keyword(s):

Dna Sequence ◽

Satellite Dna ◽

Tandem Repeats ◽

Sequence Data ◽

Entomopathogenic Nematode ◽

Consensus Sequence ◽

Repeated Sequence ◽

Nucleotide Sequence Analysis ◽

Specific Sequence ◽

Species Specific

An AluI satellite DNA family has been cloned from the entomopathogenic nematode Heterorhabditis indicus. This repeated sequence appears to be an unusually abundant satellite DNA, since it constitutes about 45% of the H. indicus genome. The consensus sequence is 174 nucleotides long and has an A + T content of 56%, with the presence of direct and inverted repeat clusters. DNA sequence data reveal that monomers are quite homogeneous. Such homogeneity suggests that some mechanism is acting to maintain the homogeneity of this satellite DNA, despite its abundance, or that this repeated sequence could have appeared recently in the genome of H. indicus. Hybridization analysis of genomic DNAs from different Heterorhabditis species shows that this satellite DNA sequence is specific to the H. indicus genome. Considering the species specificity and the high copy number of this AluI satellite DNA sequence, it could provide a rapid and powerful tool for identifying H. indicus strains.Key words: AluI repeated DNA, tandem repeats, species-specific sequence, nucleotide sequence analysis.

Download Full-text

Highly Condensed Potato Pericentromeric Heterochromatin Contains rDNA-Related Tandem Repeats

Genetics ◽

10.1093/genetics/162.3.1435 ◽

2002 ◽

Vol 162 (3) ◽

pp. 1435-1444 ◽

Cited By ~ 1

Author(s):

Robert M Stupar ◽

Junqi Song ◽

Ahmet L Tek ◽

Zhukuan Cheng ◽

Fenggao Dong ◽

...

Keyword(s):

Repetitive Dna ◽

Dna Sequences ◽

Tandem Repeats ◽

Intergenic Spacer ◽

Pericentromeric Heterochromatin ◽

Dna Repeats ◽

Solanum Bulbocastanum ◽

Origin And Evolution ◽

Potato Genome ◽

Dna Elements

Abstract The heterochromatin in eukaryotic genomes represents gene-poor regions and contains highly repetitive DNA sequences. The origin and evolution of DNA sequences in the heterochromatic regions are poorly understood. Here we report a unique class of pericentromeric heterochromatin consisting of DNA sequences highly homologous to the intergenic spacer (IGS) of the 18S•25S ribosomal RNA genes in potato. A 5.9-kb tandem repeat, named 2D8, was isolated from a diploid potato species Solanum bulbocastanum. Sequence analysis indicates that the 2D8 repeat is related to the IGS of potato rDNA. This repeat is associated with highly condensed pericentromeric heterochromatin at several hemizygous loci. The 2D8 repeat is highly variable in structure and copy number throughout the Solanum genus, suggesting that it is evolutionarily dynamic. Additional IGS-related repetitive DNA elements were also identified in the potato genome. The possible mechanism of the origin and evolution of the IGS-related repeats is discussed. We demonstrate that potato serves as an interesting model for studying repetitive DNA families because it is propagated vegetatively, thus minimizing the meiotic mechanisms that can remove novel DNA repeats.

Download Full-text

SiteOut: an online tool to design binding site-free DNA sequences

10.1101/029645 ◽

2015 ◽

Cited By ~ 1

Author(s):

Javier Estrada ◽

Teresa Ruiz-Herrero ◽

Clarissa Scholes ◽

Zeba Wunderlich ◽

Angela DePace

Keyword(s):

Binding Site ◽

Dna Sequences ◽

Binding Sites ◽

Regulatory Proteins ◽

Biological Processes ◽

Major Goal ◽

Specific Sequence ◽

Online Tool ◽

Protein Binding Sites ◽

Regulatory Dna

DNA-binding proteins control many fundamental biological processes such as transcription, recombination and replication. A major goal is to decipher the role that DNA sequence plays in orchestrating the binding and activity of such regulatory proteins. To address this goal, it is useful to rationally design DNA sequences with desired numbers, affinities and arrangements of protein binding sites. However, removing binding sites from DNA is computationally non-trivial since one risks creating new sites in the process of deleting or moving others. Here we present an online binding site removal tool, SiteOut, that enables users to design arbitrary DNA sequences that entirely lack binding sites for factors of interest. SiteOut can also be used to delete sites from a specific sequence, or to introduce site-free spacers between functional sequences without creating new sites at the junctions. In combination with commercial DNA synthesis services, SiteOut provides a powerful and flexible platform for synthetic projects that interrogate regulatory DNA. Here we describe the algorithm and illustrate the ways in which SiteOut can be used; it is publicly available at https://depace.med.harvard.edu/siteout/

Download Full-text

Genome-Wide Identification of 5-Methylcytosine Sites in Bacterial Genomes By High-Throughput Sequencing of MspJI Restriction Fragments

10.1101/2021.02.10.430591 ◽

2021 ◽

Author(s):

Brian P. Anton ◽

Alexey Fomenkov ◽

Victoria Wu ◽

Richard J. Roberts

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

High Throughput Sequencing ◽

Cost Effective ◽

Restriction Enzymes ◽

Specific Sequence ◽

Genome Wide ◽

Cost Effective Alternative ◽

Simple Column ◽

Sequencing Platforms

ABSTRACTSingle-molecule Real-Time (SMRT) sequencing can easily identify sites of N6-methyladenine and N4-methylcytosine within DNA sequences, but similar identification of 5-methylcytosine sites is not as straightforward. In prokaryotic DNA, methylation typically occurs within specific sequence contexts, or motifs, that are a property of the methyltransferases that “write” these epigenetic marks. We present here a straightforward, cost-effective alternative to both SMRT and bisulfite sequencing for the determination of prokaryotic 5-methylcytosine methylation motifs. The method, called MFRE-Seq, relies on excision and isolation of fully methylated fragments of predictable size using MspJI-Family Restriction Enzymes (MFREs), which depend on the presence of 5-methylcytosine for cleavage. We demonstrate that MFRE-Seq is compatible with both Illumina and Ion Torrent sequencing platforms and requires only a digestion step and simple column purification of size-selected digest fragments prior to standard library preparation procedures. We applied MFRE-Seq to numerous bacterial and archaeal genomic DNA preparations and successfully confirmed known motifs and identified novel ones. This method should be a useful complement to existing methodologies for studying prokaryotic methylomes and characterizing the contributing methyltransferases.

Download Full-text

Genome-wide identification of 5-methylcytosine sites in bacterial genomes by high-throughput sequencing of MspJI restriction fragments

PLoS ONE ◽

10.1371/journal.pone.0247541 ◽

2021 ◽

Vol 16 (5) ◽

pp. e0247541

Author(s):

Brian P. Anton ◽

Alexey Fomenkov ◽

Victoria Wu ◽

Richard J. Roberts

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

High Throughput Sequencing ◽

Cost Effective ◽

Restriction Enzymes ◽

Specific Sequence ◽

Genome Wide ◽

Cost Effective Alternative ◽

Simple Column ◽

Sequencing Platforms

Single-molecule Real-Time (SMRT) sequencing can easily identify sites of N6-methyladenine and N4-methylcytosine within DNA sequences, but similar identification of 5-methylcytosine sites is not as straightforward. In prokaryotic DNA, methylation typically occurs within specific sequence contexts, or motifs, that are a property of the methyltransferases that “write” these epigenetic marks. We present here a straightforward, cost-effective alternative to both SMRT and bisulfite sequencing for the determination of prokaryotic 5-methylcytosine methylation motifs. The method, called MFRE-Seq, relies on excision and isolation of fully methylated fragments of predictable size using MspJI-Family Restriction Enzymes (MFREs), which depend on the presence of 5-methylcytosine for cleavage. We demonstrate that MFRE-Seq is compatible with both Illumina and Ion Torrent sequencing platforms and requires only a digestion step and simple column purification of size-selected digest fragments prior to standard library preparation procedures. We applied MFRE-Seq to numerous bacterial and archaeal genomic DNA preparations and successfully confirmed known motifs and identified novel ones. This method should be a useful complement to existing methodologies for studying prokaryotic methylomes and characterizing the contributing methyltransferases.

Download Full-text

Molecular and cytogenetic analysis of repetitive DNA in pea (Pisum sativum L.)

Genome ◽

10.1139/g01-056 ◽

2001 ◽

Vol 44 (4) ◽

pp. 716-728 ◽

Cited By ~ 21

Author(s):

Pavel Neumann ◽

Marcela Nouzová ◽

Jirí Macas

Keyword(s):

Pisum Sativum ◽

Repetitive Dna ◽

Dna Sequences ◽

Genomic Dna ◽

Tandem Repeats ◽

Genomic Library ◽

Chromosome Morphology ◽

Plant Genome ◽

Genomic Repeats ◽

Dispersed Repeats

A set of pea DNA sequences representing the most abundant genomic repeats was obtained by combining several approaches. Dispersed repeats were isolated by screening a short-insert genomic library using genomic DNA as a probe. Thirty-two clones ranging from 149 to 2961 bp in size and from 1000 to 39 000/1C in their copy number were sequenced and further characterized. Fourteen clones were identified as retrotransposon-like sequences, based on their homologies to known elements. Fluorescence in situ hybridization using clones of reverse transcriptase and integrase coding sequences as probes revealed that corresponding retroelements were scattered along all pea chromosomes. Two novel families of tandem repeats, named PisTR-A and PisTR-B, were isolated by screening a genomic DNA library with Cot-1 DNA and by employing genomic self-priming PCR, respectively. PisTR-A repeats are 211212 bp long, their abundance is 2 × 104 copies/1C, and they are partially clustered in a secondary constriction of one chromosome pair with the rest of their copies dispersed on all chromosomes. PisTR-B sequences are of similar abundance (104 copies/1C) but differ from the "A" family in their monomer length (50 bp), high A/T content, and chromosomal localization in a limited number of discrete bands. These bands are located mainly in (sub)telomeric and pericentromeric regions, and their patterns, together with chromosome morphology, allow discrimination of all chromosome types within the pea karyotype. Whereas both tandem repeat families are mostly specific to the genus Pisum, many of the dispersed repeats were detected in other legume species, mainly those in the genus Vicia.Key words: repetitive DNA, plant genome, retroelements, satellite DNA, Pisum sativum.

Download Full-text