scholarly journals DiviSSR: Simple arithmetic for efficient identification of tandem repeats

2021 ◽  
Author(s):  
Akshay Kumar Avvaru ◽  
Rakesh K Mishra ◽  
Divya Tej Sowpati

Numerical or vector representations of DNA sequences have been applied for identification of specific sequence characteristics and patterns which are not evident in their character (A, C, G, T) representations. These transformations often reveal a mathematical structure to the sequences which can be captured efficiently using established mathematical methods. One such transformation, the 2-bit format, represents each nucleotide using only two bits instead of eight for efficient storage of genomic data. Here we describe a mathematical property that exists in the 2-bit representation of tandemly repeated DNA sequences. Our tool, DiviSSR (pronounced divisor), leverages this property and subsequent arithmetic for ultrafast and accurate identification of tandem repeats. DiviSSR can process the entire human genome in ~30s, and short sequence reads at a rate of >1 million reads/s on a single CPU thread. Our work also highlights the implications of using simple mathematical properties of DNA sequences for faster algorithms in genomics.

2005 ◽  
Vol 04 (03) ◽  
pp. 287-294
Author(s):  
SIMA S. ZEIN ◽  
ALEXANDRE A. VETCHER ◽  
STEPHEN D. LEVENE

Recent data show that assembly of repetitive-sequence, single-stranded DNA molecules (ssDNA) and carbon nanotubes (CNTs) depend on the specific sequence repeat. Therefore, it is of practical interest to assess various methods for generating single-stranded DNA molecules that contain repetitive sequences. Existing automated synthesis procedures for generating long (> 100 nt) ssDNA molecules generate ssDNA products of variable purity and yield. An alternative to automated synthesis is the polymerase chain reaction (PCR), which provides a powerful tool for the amplification of minute amounts of specific DNA sequences. Here we show that a modified asymmetric PCR method allows synthesis of long ssDNAs comprised of tandem repeats of the repetitive vertebrate telomeric sequence (TTAGGG)n, and is also applicable to arbitrary (repetitive or nonrepetitive) DNA. Long, repetitive deoxynucleotides produced by automated synthesis are surprisingly heterogeneous with respect to both length and sequence. Benefits of the method described here are that long, repetitive ssDNA sequences are generated with high sequence fidelity and yield.


2021 ◽  
Vol 12 ◽  
Author(s):  
Rahman Ebrahimzadegan ◽  
Fatemeh Orooji ◽  
Pengtao Ma ◽  
Ghader Mirzaghaderi

Genomic repetitive sequences commonly show species-specific sequence type, abundance, and distribution patterns, however, their intraspecific characteristics have been poorly described. We quantified the genomic repetitive sequences and performed single nucleotide polymorphism (SNP) analysis between 29 Ae. tauschii genotypes and subspecies using publicly available raw genomic Illumina sequence reads and used fluorescence in situ hybridization (FISH) to experimentally analyze some repeats. The majority of the identified repetitive sequences had similar contents and proportions between anathera, meyeri, and strangulata subspecies. However, two Ty3/gypsy retrotransposons (CL62 and CL87) showed significantly higher abundances, and CL1, CL119, CL213, CL217 tandem repeats, and CL142 retrotransposon (Ty1/copia type) showed significantly lower abundances in subspecies strangulata compared with the subspecies anathera and meyeri. One tandem repeat and 45S ribosomal DNA (45S rDNA) abundances showed a high variation between genotypes but their abundances were not subspecies specific. Phylogenetic analysis using the repeat abundances of the aforementioned clusters placed the strangulata subsp. in a distinct clade but could not discriminate anathera and meyeri. A near complete differentiation of anathera and strangulata subspecies was observed using SNP analysis; however, var. meyeri showed higher genetic diversity. FISH using major tandem repeats couldn’t detect differences between subspecies, although (GAA)10 signal patterns generated two different karyotype groups. Taken together, the different classes of repetitive DNA sequences have differentially accumulated between strangulata and the other two subspecies of Ae. tauschii that is generally in agreement with spike morphology, implying that factors affecting repeatome evolution are variable even among highly closely related lineages.


2021 ◽  
Vol 22 (9) ◽  
pp. 4309
Author(s):  
Jitendra Thakur ◽  
Jenika Packiaraj ◽  
Steven Henikoff

Satellite DNA consists of abundant tandem repeats that play important roles in cellular processes, including chromosome segregation, genome organization and chromosome end protection. Most satellite DNA repeat units are either of nucleosomal length or 5–10 bp long and occupy centromeric, pericentromeric or telomeric regions. Due to high repetitiveness, satellite DNA sequences have largely been absent from genome assemblies. Although few conserved satellite-specific sequence motifs have been identified, DNA curvature, dyad symmetries and inverted repeats are features of various satellite DNAs in several organisms. Satellite DNA sequences are either embedded in highly compact gene-poor heterochromatin or specialized chromatin that is distinct from euchromatin. Nevertheless, some satellite DNAs are transcribed into non-coding RNAs that may play important roles in satellite DNA function. Intriguingly, satellite DNAs are among the most rapidly evolving genomic elements, such that a large fraction is species-specific in most organisms. Here we describe the different classes of satellite DNA sequences, their satellite-specific chromatin features, and how these features may contribute to satellite DNA biology and evolution. We also discuss how the evolution of functional satellite DNA classes may contribute to speciation in plants and animals.


Author(s):  
Yanrong Ji ◽  
Zhihan Zhou ◽  
Han Liu ◽  
Ramana V Davuluri

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.


Genome ◽  
1998 ◽  
Vol 41 (2) ◽  
pp. 148-153 ◽  
Author(s):  
Monique Abadon ◽  
Eric Grenier ◽  
Christian Laumond ◽  
Pierre Abad

An AluI satellite DNA family has been cloned from the entomopathogenic nematode Heterorhabditis indicus. This repeated sequence appears to be an unusually abundant satellite DNA, since it constitutes about 45% of the H. indicus genome. The consensus sequence is 174 nucleotides long and has an A + T content of 56%, with the presence of direct and inverted repeat clusters. DNA sequence data reveal that monomers are quite homogeneous. Such homogeneity suggests that some mechanism is acting to maintain the homogeneity of this satellite DNA, despite its abundance, or that this repeated sequence could have appeared recently in the genome of H. indicus. Hybridization analysis of genomic DNAs from different Heterorhabditis species shows that this satellite DNA sequence is specific to the H. indicus genome. Considering the species specificity and the high copy number of this AluI satellite DNA sequence, it could provide a rapid and powerful tool for identifying H. indicus strains.Key words: AluI repeated DNA, tandem repeats, species-specific sequence, nucleotide sequence analysis.


Genetics ◽  
2002 ◽  
Vol 162 (3) ◽  
pp. 1435-1444 ◽  
Author(s):  
Robert M Stupar ◽  
Junqi Song ◽  
Ahmet L Tek ◽  
Zhukuan Cheng ◽  
Fenggao Dong ◽  
...  

Abstract The heterochromatin in eukaryotic genomes represents gene-poor regions and contains highly repetitive DNA sequences. The origin and evolution of DNA sequences in the heterochromatic regions are poorly understood. Here we report a unique class of pericentromeric heterochromatin consisting of DNA sequences highly homologous to the intergenic spacer (IGS) of the 18S•25S ribosomal RNA genes in potato. A 5.9-kb tandem repeat, named 2D8, was isolated from a diploid potato species Solanum bulbocastanum. Sequence analysis indicates that the 2D8 repeat is related to the IGS of potato rDNA. This repeat is associated with highly condensed pericentromeric heterochromatin at several hemizygous loci. The 2D8 repeat is highly variable in structure and copy number throughout the Solanum genus, suggesting that it is evolutionarily dynamic. Additional IGS-related repetitive DNA elements were also identified in the potato genome. The possible mechanism of the origin and evolution of the IGS-related repeats is discussed. We demonstrate that potato serves as an interesting model for studying repetitive DNA families because it is propagated vegetatively, thus minimizing the meiotic mechanisms that can remove novel DNA repeats.


2015 ◽  
Author(s):  
Javier Estrada ◽  
Teresa Ruiz-Herrero ◽  
Clarissa Scholes ◽  
Zeba Wunderlich ◽  
Angela DePace

DNA-binding proteins control many fundamental biological processes such as transcription, recombination and replication. A major goal is to decipher the role that DNA sequence plays in orchestrating the binding and activity of such regulatory proteins. To address this goal, it is useful to rationally design DNA sequences with desired numbers, affinities and arrangements of protein binding sites. However, removing binding sites from DNA is computationally non-trivial since one risks creating new sites in the process of deleting or moving others. Here we present an online binding site removal tool, SiteOut, that enables users to design arbitrary DNA sequences that entirely lack binding sites for factors of interest. SiteOut can also be used to delete sites from a specific sequence, or to introduce site-free spacers between functional sequences without creating new sites at the junctions. In combination with commercial DNA synthesis services, SiteOut provides a powerful and flexible platform for synthetic projects that interrogate regulatory DNA. Here we describe the algorithm and illustrate the ways in which SiteOut can be used; it is publicly available at https://depace.med.harvard.edu/siteout/


2021 ◽  
Author(s):  
Brian P. Anton ◽  
Alexey Fomenkov ◽  
Victoria Wu ◽  
Richard J. Roberts

ABSTRACTSingle-molecule Real-Time (SMRT) sequencing can easily identify sites of N6-methyladenine and N4-methylcytosine within DNA sequences, but similar identification of 5-methylcytosine sites is not as straightforward. In prokaryotic DNA, methylation typically occurs within specific sequence contexts, or motifs, that are a property of the methyltransferases that “write” these epigenetic marks. We present here a straightforward, cost-effective alternative to both SMRT and bisulfite sequencing for the determination of prokaryotic 5-methylcytosine methylation motifs. The method, called MFRE-Seq, relies on excision and isolation of fully methylated fragments of predictable size using MspJI-Family Restriction Enzymes (MFREs), which depend on the presence of 5-methylcytosine for cleavage. We demonstrate that MFRE-Seq is compatible with both Illumina and Ion Torrent sequencing platforms and requires only a digestion step and simple column purification of size-selected digest fragments prior to standard library preparation procedures. We applied MFRE-Seq to numerous bacterial and archaeal genomic DNA preparations and successfully confirmed known motifs and identified novel ones. This method should be a useful complement to existing methodologies for studying prokaryotic methylomes and characterizing the contributing methyltransferases.


PLoS ONE ◽  
2021 ◽  
Vol 16 (5) ◽  
pp. e0247541
Author(s):  
Brian P. Anton ◽  
Alexey Fomenkov ◽  
Victoria Wu ◽  
Richard J. Roberts

Single-molecule Real-Time (SMRT) sequencing can easily identify sites of N6-methyladenine and N4-methylcytosine within DNA sequences, but similar identification of 5-methylcytosine sites is not as straightforward. In prokaryotic DNA, methylation typically occurs within specific sequence contexts, or motifs, that are a property of the methyltransferases that “write” these epigenetic marks. We present here a straightforward, cost-effective alternative to both SMRT and bisulfite sequencing for the determination of prokaryotic 5-methylcytosine methylation motifs. The method, called MFRE-Seq, relies on excision and isolation of fully methylated fragments of predictable size using MspJI-Family Restriction Enzymes (MFREs), which depend on the presence of 5-methylcytosine for cleavage. We demonstrate that MFRE-Seq is compatible with both Illumina and Ion Torrent sequencing platforms and requires only a digestion step and simple column purification of size-selected digest fragments prior to standard library preparation procedures. We applied MFRE-Seq to numerous bacterial and archaeal genomic DNA preparations and successfully confirmed known motifs and identified novel ones. This method should be a useful complement to existing methodologies for studying prokaryotic methylomes and characterizing the contributing methyltransferases.


Genome ◽  
2001 ◽  
Vol 44 (4) ◽  
pp. 716-728 ◽  
Author(s):  
Pavel Neumann ◽  
Marcela Nouzová ◽  
Jirí Macas

A set of pea DNA sequences representing the most abundant genomic repeats was obtained by combining several approaches. Dispersed repeats were isolated by screening a short-insert genomic library using genomic DNA as a probe. Thirty-two clones ranging from 149 to 2961 bp in size and from 1000 to 39 000/1C in their copy number were sequenced and further characterized. Fourteen clones were identified as retrotransposon-like sequences, based on their homologies to known elements. Fluorescence in situ hybridization using clones of reverse transcriptase and integrase coding sequences as probes revealed that corresponding retroelements were scattered along all pea chromosomes. Two novel families of tandem repeats, named PisTR-A and PisTR-B, were isolated by screening a genomic DNA library with Cot-1 DNA and by employing genomic self-priming PCR, respectively. PisTR-A repeats are 211–212 bp long, their abundance is 2 × 104 copies/1C, and they are partially clustered in a secondary constriction of one chromosome pair with the rest of their copies dispersed on all chromosomes. PisTR-B sequences are of similar abundance (104 copies/1C) but differ from the "A" family in their monomer length (50 bp), high A/T content, and chromosomal localization in a limited number of discrete bands. These bands are located mainly in (sub)telomeric and pericentromeric regions, and their patterns, together with chromosome morphology, allow discrimination of all chromosome types within the pea karyotype. Whereas both tandem repeat families are mostly specific to the genus Pisum, many of the dispersed repeats were detected in other legume species, mainly those in the genus Vicia.Key words: repetitive DNA, plant genome, retroelements, satellite DNA, Pisum sativum.


Sign in / Sign up

Export Citation Format

Share Document