RepeatModeler2: automated genomic discovery of transposable element families

Mapping Intimacies ◽

10.1101/856591 ◽

2019 ◽

Cited By ~ 12

Author(s):

Jullien M. Flynn ◽

Robert Hubley ◽

Clément Goubert ◽

Jeb Rosen ◽

Andrew G. Clark ◽

...

Keyword(s):

Transposable Elements ◽

De Novo ◽

False Positive Rate ◽

Fruit Fly ◽

Sequence Coverage ◽

Genome Sequences ◽

Model Species ◽

Link Type ◽

Eukaryotic Species ◽

Ltr Retroelements

AbstractThe accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a new pipeline that greatly facilitates this process. This new program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete LTR retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately three times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. The program had an extremely low false positive rate when applied to simulated genomes devoid of TEs. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, https://github.com/Dfam-consortium/TETools).SignificanceGenome sequences are being produced for more and more eukaryotic species. The bulk of these genomes is composed of parasitic, self-mobilizing transposable elements (TEs) that play important roles in organismal evolution. Thus there is a pressing need for developing software that can accurately identify the diverse set of TEs dispersed in genome sequences. Here we introduce RepeatModeler2, an easy-to-use package for the curation of reference TE libraries which can be applied to any eukaryotic species. Through several major improvements over the previous version, RepeatModeler2 is able to produce libraries that recapitulate the known composition of three model species with some of the most complex TE landscapes. Thus RepeatModeler2 will greatly enhance the discovery and annotation of TEs in genome sequences.

RepeatModeler2 for automated genomic discovery of transposable element families

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1921046117 ◽

2020 ◽

Vol 117 (17) ◽

pp. 9451-9457 ◽

Cited By ~ 26

Author(s):

Jullien M. Flynn ◽

Robert Hubley ◽

Clément Goubert ◽

Jeb Rosen ◽

Andrew G. Clark ◽

...

Keyword(s):

De Novo ◽

Fruit Fly ◽

Automated Identification ◽

Sequence Coverage ◽

Model Species ◽

Consensus Sequences ◽

Sequence Complexity ◽

Link Type ◽

Eukaryotic Genomes ◽

Ltr Retroelements

The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all of the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a pipeline that greatly facilitates this process. This program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete long terminal repeat (LTR) retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately 3 times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, http://www.repeatmasker.org/RepeatModeler/).

Mosquito genomes are frequently invaded by transposable elements through horizontal transfer

PLoS Genetics ◽

10.1371/journal.pgen.1008946 ◽

2020 ◽

Vol 16 (11) ◽

pp. e1008946

Author(s):

Elverson Soares de Melo ◽

Gabriel Luz Wallau

Keyword(s):

Transposable Elements ◽

Horizontal Transfer ◽

De Novo ◽

Mosquito Species ◽

Wuchereria Bancrofti ◽

Model Organisms ◽

Eukaryotic Species ◽

Horizontal Spread ◽

Horizontal Transfers

Transposable elements (TEs) are mobile genetic elements that parasitize basically all eukaryotic species genomes. Due to their complexity, an in-depth TE characterization is only available for a handful of model organisms. In the present study, we performed a de novo and homology-based characterization of TEs in the genomes of 24 mosquito species and investigated their mode of inheritance. More than 40% of the genome of Aedes aegypti, Aedes albopictus, and Culex quinquefasciatus is composed of TEs, while it varied substantially among Anopheles species (0.13%–19.55%). Class I TEs are the most abundant among mosquitoes and at least 24 TE superfamilies were found. Interestingly, TEs have been extensively exchanged by horizontal transfer (172 TE families of 16 different superfamilies) among mosquitoes in the last 30 million years. Horizontally transferred TEs represents around 7% of the genome in Aedes species and a small fraction in Anopheles genomes. Most of these horizontally transferred TEs are from the three ubiquitous LTR superfamilies: Gypsy, Bel-Pao and Copia. Searching more than 32,000 genomes, we also uncovered transfers between mosquitoes and two different Phyla—Cnidaria and Nematoda—and two subphyla—Chelicerata and Crustacea, identifying a vector, the worm Wuchereria bancrofti, that enabled the horizontal spread of a Tc1-mariner element among various Anopheles species. These data also allowed us to reconstruct the horizontal transfer network of this TE involving more than 40 species. In summary, our results suggest that TEs are frequently exchanged by horizontal transfers among mosquitoes, influencing mosquito's genome size and variability.

Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline

10.1101/657890 ◽

2019 ◽

Cited By ~ 7

Author(s):

Shujun Ou ◽

Weija Su ◽

Yi Liao ◽

Kapeel Chougule ◽

Doreen Ware ◽

...

Keyword(s):

Transposable Elements ◽

Transposable Element ◽

Open Source ◽

Performance Metrics ◽

De Novo ◽

Relative Performance ◽

Sequencing Technology ◽

High Quality ◽

Link Type ◽

Assembly Algorithms

AbstractSequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and allow for annotation of TEs. There are numerous methods for each class of elements with unknown relative performance metrics. We benchmarked existing programs based on a curated library of rice TEs. Using the most robust programs, we created a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a condensed TE library for annotations of structurally intact and fragmented elements. EDTA is open-source and freely available: https://github.com/oushujun/EDTA.

Mapping-free variant calling using haplotype reconstruction from k-mer frequencies

10.1101/153619 ◽

2017 ◽

Cited By ~ 1

Author(s):

Peter Audano ◽

Shashidhar Ravishankar ◽

Fredrik Vannberg

Keyword(s):

De Novo ◽

Hybrid Methods ◽

False Positive Rate ◽

Variant Calling ◽

Genomic Variation ◽

De Bruijn Graphs ◽

Link Type ◽

Positive Rate ◽

Large Indels ◽

The Cost

1AbstractMotivationThe standard protocol for detecting variation in DNA is to map millions of short sequence reads to a known reference and find loci that differ. While this approach works well, it cannot be applied where the sample contains dense variants or is too distant from known references. De novo assembly or hybrid methods can recover genomic variation, but the cost of computation is often much higher. We developed a novel k-mer algorithm and software implementation, Kestrel, capable of characterizing densely-packed SNPs and large indels without mapping, assembly, or de Bruijn graphs.ResultsWhen applied to mosaic penicillin binding protein (PBP) genes in Streptococcus pneumoniae, we found near perfect concordance with assembled contigs at a fraction of the CPU time. Multilocus sequence typing (MLST) with this approach was able to bypass de novo assemblies. Kestrel has a very low false-positive rate when calling variants over the whole genome, but limitations of a purely k-mer based approach affect sensitivity.AvailabilitySource code and documentation for a Java implementation of Kestrel can be found at https://github.com/paudano/kestrel. All test code for this publication is located at https://github.com/paudano/[email protected], [email protected]

MobiSeq: De novo SNP discovery in model and non‐model species through sequencing the flanking region of transposable elements

Molecular Ecology Resources ◽

10.1111/1755-0998.12984 ◽

2019 ◽

Vol 19 (2) ◽

pp. 512-525 ◽

Cited By ~ 1

Author(s):

Alba Rey‐Iglesia ◽

Shyam Gopalakrishan ◽

Christian Carøe ◽

David E. Alquezar‐Planas ◽

Anne Ahlmann Nielsen ◽

...

Keyword(s):

Transposable Elements ◽

De Novo ◽

Snp Discovery ◽

Model Species ◽

Flanking Region

ZGA: a flexible pipeline for read processing, de novo assembly and annotation of prokaryotic genomes

10.1101/2021.04.27.441618 ◽

2021 ◽

Author(s):

A.A. Korzhenkov

Keyword(s):

Genome Sequencing ◽

De Novo ◽

Wide Spectrum ◽

Source Code ◽

Routine Method ◽

Genome Sequences ◽

Bioinformatic Pipeline ◽

Internet Connection ◽

Link Type ◽

Prokaryotic Genomes

AbstractWhole genome sequencing (WGS) became a routine method in modern days and may be applied to study a wide spectrum of scientific problems. Despite increasing availability of genome sequencing by itself, genome assembly and annotation could be a challenge for an inexperienced researcher. To solve this problem, a bioinformatic pipeline was developed to conduct a user from raw sequencing reads to annotated bacterial or archaeal genome ready for deposition to any INSDC database as NCBI, ENA or DDBJ. The pipeline is fully automated and doesn’t require internet connection after installation which prevents data leakage and premature publication of genome sequences. The source code of the pipeline is freely available at https://github.com/laxeye/zga/. The software may be installed from popular repositories: Anaconda Cloud (https://anaconda.org/bioconda/zga/) and PyPI (https://pypi.org/project/zga/).

Phylogeny of actin and tubulin gene homologs in diverse eukaryotic species

Indian Journal of Genetics and Plant Breeding (The) ◽

10.31742/ijgpb.79s.1.20 ◽

2019 ◽

Vol 79 (01S) ◽

Author(s):

Pawan Kumar Jayaswal ◽

Asheesh Shanker ◽

Nagendra Kumar Singh

Keyword(s):

Evolutionary Relationship ◽

Genomic Data ◽

Gene Clusters ◽

Species Trees ◽

Model Species ◽

Tubulin Gene ◽

Tubulin Genes ◽

Eukaryotic Species ◽

Relationship Of ◽

Insight Into

Actin and tubulin are cytoskeleton proteins, which are important components of the celland are conserved across species. Despite their crucial significance in cell motility and cell division the distribution and phylogeny of actin and tubulin genes across taxa is poorly understood. Here we used publicly available genomic data of 49 model species of plants, animals, fungi and Protista for further understanding the distribution of these genes among diverse eukaryotic species using rice as reference. The highest numbers of rice actin and tubulin gene homologs were present in plants followed by animals, fungi and Protista species, whereas ten actin and nine tubulin genes were conserved in all 49 species. Phylogenetic analysis of 19 actin and 18 tubulin genes clustered them into four major groups each. One each of the actin and tubulin gene clusters was conserved across eukaryotic species. Species trees based on the conserved actin and tubulin genes showed evolutionary relationship of 49 different taxa clustered into plants, animals, fungi and Protista. This study provides a phylogenetic insight into the evolution of actin and tubulin genes in diverse eukaryotic species.

De novo whole-genome assembly in Chrysanthemum seticuspe, a model species of Chrysanthemums, and its application to genetic and gene discovery analysis

DNA Research ◽

10.1093/dnares/dsy048 ◽

2019 ◽

Vol 26 (3) ◽

pp. 195-203 ◽

Cited By ~ 19

Author(s):

Hideki Hirakawa ◽

Katsuhiko Sumitomo ◽

Tamotsu Hisamatsu ◽

Soichiro Nagano ◽

Kenta Shirasawa ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Gene Discovery ◽

Whole Genome ◽

Model Species

Integrating genomics into the taxonomy and systematics of the Bacteria and Archaea

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.054171-0 ◽

2014 ◽

Vol 64 (Pt_2) ◽

pp. 316-324 ◽

Cited By ~ 258

Author(s):

Jongsik Chun ◽

Fred A. Rainey

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Original Research ◽

Rrna Gene ◽

New Taxon ◽

Genome Sequences ◽

Microbial World ◽

Content Type ◽

Link Type ◽

Type Strains

The polyphasic approach used today in the taxonomy and systematics of the Bacteria and Archaea includes the use of phenotypic, chemotaxonomic and genotypic data. The use of 16S rRNA gene sequence data has revolutionized our understanding of the microbial world and led to a rapid increase in the number of descriptions of novel taxa, especially at the species level. It has allowed in many cases for the demarcation of taxa into distinct species, but its limitations in a number of groups have resulted in the continued use of DNA–DNA hybridization. As technology has improved, next-generation sequencing (NGS) has provided a rapid and cost-effective approach to obtaining whole-genome sequences of microbial strains. Although some 12 000 bacterial or archaeal genome sequences are available for comparison, only 1725 of these are of actual type strains, limiting the use of genomic data in comparative taxonomic studies when there are nearly 11 000 type strains. Efforts to obtain complete genome sequences of all type strains are critical to the future of microbial systematics. The incorporation of genomics into the taxonomy and systematics of the Bacteria and Archaea coupled with computational advances will boost the credibility of taxonomy in the genomic era. This special issue of International Journal of Systematic and Evolutionary Microbiology contains both original research and review articles covering the use of genomic sequence data in microbial taxonomy and systematics. It includes contributions on specific taxa as well as outlines of approaches for incorporating genomics into new strain isolation to new taxon description workflows.

Identification of α-enolase as a prognostic and diagnostic precancer biomarker in oral submucous fibrosis

Journal of Clinical Pathology ◽

10.1136/jclinpath-2017-204430 ◽

2017 ◽

Vol 71 (3) ◽

pp. 228-238 ◽

Cited By ~ 4

Author(s):

Swarnendu Bag ◽

Debabrata Dutta ◽

Amrita Chaudhary ◽

Bidhan Chandra Sing ◽

Mousumi Pal ◽

...

Keyword(s):

De Novo ◽

Pain Treatment ◽

Oral Submucous Fibrosis ◽

Peptide Sequencing ◽

Pcr Analysis ◽

Rt Pcr ◽

Sequence Coverage ◽

Protein Marker ◽

Peptide Mass ◽

Submucous Fibrosis

AimsDiagnostic ambiguities regarding the malignant potentiality of oral submucous fibrosis (OSF), an oral precancerous condition having dysplastic and non-dysplastic isoforms are the major failure for early intervention of oral squamous cell carcinoma (OSCC) patients. Our goal is to identify proteomic signatures from biopsies that can be used as precancer diagnostic marker for patient suffering from OSF.MethodsThe high throughput techniques adopting de novo peptide sequencing (1D SDS-PAGE coupled nanoLC MALDI tandem mass spectrometry (MS/MS)-based peptide mass fingerprint), immunohistochemistry (IHC), Western blot (WB) and real-time PCR (RT-PCR) analysis are considered for such biomarker identification and multilevel validations.ResultsAlpha-enolase is identified as an overexpressed protein in biopsies of oral submucous fibrosis with dysplasia (OSFWD) compared with oral submucous fibrosis without dysplasia (OSFWT) and normal oral mucosa (NOM). Total proteome analysis of an overexpressed protein band around 47 kDa of OSFWD identifies 334 peptides corresponding to 61 human proteins. Among them α-enolase is identified as a prime protein with highest number of peptides (44 out of 334 peptides) and sequence coverage (66.4%). Furthermore, RT-PCR, WB and IHC analysis also show mRNA and tissue level upregulation of α-enolase in OSFWD validating α-enolase as precancer marker.ConclusionsThis study for the first time identifies and validates α-enolase as a novel biomarker for early diagnosis of malignant potentiality of OSF. Hence, the identified protein marker, α-enolase can help in early therapeutic intervention of OSF patients leading to the reduction of patient’s pain, treatment cost and enhancement of patient’s quality of life.