repeat masking
Recently Published Documents


TOTAL DOCUMENTS

8
(FIVE YEARS 5)

H-INDEX

3
(FIVE YEARS 0)

Genes ◽  
2021 ◽  
Vol 13 (1) ◽  
pp. 9
Author(s):  
Mikhail Biryukov ◽  
Kirill Ustyantsev

Retrotransposons comprise a substantial fraction of eukaryotic genomes, reaching the highest proportions in plants. Therefore, identification and annotation of retrotransposons is an important task in studying the regulation and evolution of plant genomes. The majority of computational tools for mining transposable elements (TEs) are designed for subsequent genome repeat masking, often leaving aside the element lineage classification and its protein domain composition. Additionally, studies focused on the diversity and evolution of a particular group of retrotransposons often require substantial customization efforts from researchers to adapt existing software to their needs. Here, we developed a computational pipeline to mine sequences of protein-coding retrotransposons based on the sequences of their conserved protein domains—DARTS (Domain-Associated Retrotransposon Search). Using the most abundant group of TEs in plants—long terminal repeat (LTR) retrotransposons (LTR-RTs)—we show that DARTS has radically higher sensitivity for LTR-RT identification compared to the widely accepted tool LTRharvest. DARTS can be easily customized for specific user needs. As a result, DARTS returns a set of structurally annotated nucleotide and amino acid sequences which can be readily used in subsequent comparative and phylogenetic analyses. DARTS may facilitate researchers interested in the discovery and detailed analysis of the diversity and evolution of retrotransposons, LTR-RTs, and other protein-coding TEs.


2021 ◽  
Author(s):  
Mikhail Biryukov ◽  
Kirill Ustyantsev

AbstractRetrotransposons comprise a substantial fraction of eukaryotic genomes reaching the highest proportions in plants. Therefore, identification and annotation of retrotransposons is an important task in studying regulation and evolution of plant genomes. A majority of computational tools for mining transposable elements (TEs) are designed for subsequent genome repeat masking, often leaving aside the element lineage classification and its protein domain composition. Additionally, studies focused on diversity and evolution of a particular group of retrotransposons often require substantial customization efforts from researchers to adapt existing software to their needs. Here, we developed a computational pipeline to mine sequences of protein-coding retrotransposons based on the sequences of their conserved protein domains - DARTS. Using the most abundant group of TEs in plants - long terminal repeat (LTR) retrotransposons (LTR-RTs), we show that DARTS has radically higher sensitivity of LTR-RTs identification compared to a widely accepted LTRharvest tool. DARTS can be easily customized for specific user needs. As a result, DARTS returns a set of structurally annotated nucleotide and amino acid sequences which can be readily used in subsequent comparative and phylogenetic analyses. DARTS should facilitate researchers interested in discovery and in-detail analysis of diversity and evolution of retrotransposons, LTR-RTs, and other protein-coding TEs.


2021 ◽  
Author(s):  
Yaoyao Wu ◽  
Lynn Johnson ◽  
Baoxing Song ◽  
Cinta Romay ◽  
Michelle Stitzer ◽  
...  

Alignments of multiple genomes are a cornerstone of comparative genomics, but generating these alignments remains technically challenging and often impractical. We developed the msa_pipeline workflow (https://bitbucket.org/bucklerlab/msa_pipeline) based on the LAST aligner to allow practical and sensitive multiple alignment of diverged plant genomes with minimal user inputs. Our workflow only requires a set of genomes in FASTA format as input. The workflow outputs multiple alignments in MAF format, and includes utilities to help calculate genome-wide conservation scores. As high repeat content and genomic divergence are substantial challenges in plant genome alignment, we also explored the impact of different masking approaches and alignment parameters using genome assemblies of 33 grass species. Compared to conventional masking with RepeatMasker, a k-mer masking approach increased the alignment rate of CDS and non-coding functional regions by 25% and 14% respectively. We further found that default alignment parameters generally perform well, but parameter tuning can increase the alignment rate for non-coding functional regions by over 52% compared to default LAST settings. Finally, by increasing alignment sensitivity from the default baseline, parameter tuning can increase the number of non-coding sites that can be scored for conservation by over 76%.


2021 ◽  
Author(s):  
Bruno Contreras-Moreira ◽  
Carla V Filippi ◽  
Guy Naamati ◽  
Carlos García Girón ◽  
James E Allen ◽  
...  

Ii.Summary/AbstractThe annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis or pangenome exploration. While homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here we benchmark a two-step approach, where repeats are first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, using the kmer-based Repeat Detector (Red) and two repeat libraries (REdat and nrTEplants, curated for this work). We obtained repeated genome fractions that match those reported in the literature, but with shorter repeated elements than those produced with conventional annotators. Inspection of masked regions overlapping genes revealed no preference for specific protein domains. Half of Red masked sequences can be successfully classified with nrTEplants, with the complete protocol taking less than 2h on a desktop Linux box. The repeat library and the scripts to mask and annotate plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.


Genes ◽  
2020 ◽  
Vol 12 (1) ◽  
pp. 48
Author(s):  
Monika Cechova

Ever since the introduction of high-throughput sequencing following the human genome project, assembling short reads into a reference of sufficient quality posed a significant problem as a large portion of the human genome—estimated 50–69%—is repetitive. As a result, a sizable proportion of sequencing reads is multi-mapping, i.e., without a unique placement in the genome. The two key parameters for whether or not a read is multi-mapping are the read length and genome complexity. Long reads are now able to span difficult, heterochromatic regions, including full centromeres, and characterize chromosomes from “telomere to telomere”. Moreover, identical reads or repeat arrays can be differentiated based on their epigenetic marks, such as methylation patterns, aiding in the assembly process. This is despite the fact that long reads still contain a modest percentage of sequencing errors, disorienting the aligners and assemblers both in accuracy and speed. Here, I review the proposed and implemented solutions to the repeat resolution and the multi-mapping read problem, as well as the downstream consequences of reference choice, repeat masking, and proper representation of sex chromosomes. I also consider the forthcoming challenges and solutions with regards to long reads, where we expect the shift from the problem of repeat localization within a single individual to the problem of repeat positioning within pangenomes.


Nature Plants ◽  
2018 ◽  
Vol 4 (10) ◽  
pp. 762-765 ◽  
Author(s):  
Philipp E. Bayer ◽  
David Edwards ◽  
Jacqueline Batley

Genome ◽  
2013 ◽  
Vol 56 (12) ◽  
pp. 729-735 ◽  
Author(s):  
Marco Ricci ◽  
Andrea Luchetti ◽  
Livia Bonandin ◽  
Barbara Mantovani

The repetitive DNA content of the stick insect species Bacillus rossius (facultative parthenogenetic), Bacillus grandii (gonochoric), and Bacillus atticus (obligate parthenogenetic) was analyzed through the survey of random genomic libraries roughly corresponding to 0.006% of the genome. By repeat masking, 19 families of transposable elements were identified (two LTR and six non-LTR retrotransposons; 11 DNA transposons). Moreover, a de novo analysis revealed, among the three libraries, the first MITE family observed in polyneopteran genomes. On the whole, transposable element abundance represented 23.3% of the genome in B. rossius, 22.9% in B. atticus, and 18% in B. grandii. Tandem repeat content in the three libraries is much lower: 1.32%, 0.64%, and 1.86% in B. rossius, B. grandii, and B. atticus, respectively. Microsatellites are the most abundant in all species. Minisatellites were only found in B. rossius and B. atticus, and five monomers belonging to the Bag320 satellite family were detected in B. atticus. Assuming the survey provides adequate representation of the relative genome, the obligate parthenogenetic species (B. atticus), compared with the other two species analyzed, does not show a lower transposable element content, as expected from some theoretical and empirical studies.


Sign in / Sign up

Export Citation Format

Share Document