RepeatModeler2 for automated genomic discovery of transposable element families

The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all of the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a pipeline that greatly facilitates this process. This program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete long terminal repeat (LTR) retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately 3 times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, http://www.repeatmasker.org/RepeatModeler/).

Download Full-text

RepeatModeler2: automated genomic discovery of transposable element families

10.1101/856591 ◽

2019 ◽

Cited By ~ 12

Author(s):

Jullien M. Flynn ◽

Robert Hubley ◽

Clément Goubert ◽

Jeb Rosen ◽

Andrew G. Clark ◽

...

Keyword(s):

Transposable Elements ◽

De Novo ◽

False Positive Rate ◽

Fruit Fly ◽

Sequence Coverage ◽

Genome Sequences ◽

Model Species ◽

Link Type ◽

Eukaryotic Species ◽

Ltr Retroelements

AbstractThe accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a new pipeline that greatly facilitates this process. This new program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete LTR retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately three times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. The program had an extremely low false positive rate when applied to simulated genomes devoid of TEs. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, https://github.com/Dfam-consortium/TETools).SignificanceGenome sequences are being produced for more and more eukaryotic species. The bulk of these genomes is composed of parasitic, self-mobilizing transposable elements (TEs) that play important roles in organismal evolution. Thus there is a pressing need for developing software that can accurately identify the diverse set of TEs dispersed in genome sequences. Here we introduce RepeatModeler2, an easy-to-use package for the curation of reference TE libraries which can be applied to any eukaryotic species. Through several major improvements over the previous version, RepeatModeler2 is able to produce libraries that recapitulate the known composition of three model species with some of the most complex TE landscapes. Thus RepeatModeler2 will greatly enhance the discovery and annotation of TEs in genome sequences.

Download Full-text

DeepTE: a computational method for de novo classification of transposons with convolutional neural network

10.1101/2020.01.27.921874 ◽

2020 ◽

Author(s):

Haidong Yan ◽

Aureliano Bombarely ◽

Song Li

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

De Novo ◽

Genomic Sequence ◽

Computational Method ◽

Model Species ◽

Essential Step ◽

Genomic Sequence Analysis ◽

Eukaryotic Genomes

AbstractMotivationTransposable elements (TEs) classification is an essential step to decode their roles in genome evolution. With a large number of genomes from non-model species becoming available, accurate and efficient TE classification has emerged as a new challenge in genomic sequence analysis.ResultsWe developed a novel tool, DeepTE, which classifies unknown TEs using convolutional neural networks. DeepTE transferred sequences into input vectors based on k-mer counts. A tree structured classification process was used where eight models were trained to classify TEs into super families and orders. DeepTE also detected domains inside TEs to correct false classification. An additional model was trained to distinguish between non-TEs and TEs in plants. Given unclassified TEs of different species, DeepTE can classify TEs into seven orders, which include 15, 24, and 16 super families in plants, metazoans, and fungi, respectively. In several benchmarking tests, DeepTE outperformed other existing tools for TE classification. In conclusion, DeepTE successfully leverages convolutional neural network for TE classification, and can be used to precisely identify and annotate TEs in newly sequenced eukaryotic genomes.AvailabilityDeepTE is accessible at https://github.com/LiLabAtVT/[email protected]

Download Full-text

DeepTE: a computational method for de novo classification of transposons with convolutional neural network

Bioinformatics ◽

10.1093/bioinformatics/btaa519 ◽

2020 ◽

Vol 36 (15) ◽

pp. 4269-4275 ◽

Cited By ~ 3

Author(s):

Haidong Yan ◽

Aureliano Bombarely ◽

Song Li

Keyword(s):

De Novo ◽

Genomic Sequence ◽

Computational Method ◽

Supplementary Information ◽

Supplementary Data ◽

Model Species ◽

Essential Step ◽

Genomic Sequence Analysis ◽

Eukaryotic Genomes

Abstract Motivation Transposable elements (TEs) classification is an essential step to decode their roles in genome evolution. With a large number of genomes from non-model species becoming available, accurate and efficient TE classification has emerged as a new challenge in genomic sequence analysis. Results We developed a novel tool, DeepTE, which classifies unknown TEs using convolutional neural networks (CNNs). DeepTE transferred sequences into input vectors based on k-mer counts. A tree structured classification process was used where eight models were trained to classify TEs into super families and orders. DeepTE also detected domains inside TEs to correct false classification. An additional model was trained to distinguish between non-TEs and TEs in plants. Given unclassified TEs of different species, DeepTE can classify TEs into seven orders, which include 15, 24 and 16 super families in plants, metazoans and fungi, respectively. In several benchmarking tests, DeepTE outperformed other existing tools for TE classification. In conclusion, DeepTE successfully leverages CNN for TE classification, and can be used to precisely classify TEs in newly sequenced eukaryotic genomes. Availability and implementation DeepTE is accessible at https://github.com/LiLabAtVT/DeepTE. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

De novo whole-genome assembly in Chrysanthemum seticuspe, a model species of Chrysanthemums, and its application to genetic and gene discovery analysis

DNA Research ◽

10.1093/dnares/dsy048 ◽

2019 ◽

Vol 26 (3) ◽

pp. 195-203 ◽

Cited By ~ 19

Author(s):

Hideki Hirakawa ◽

Katsuhiko Sumitomo ◽

Tamotsu Hisamatsu ◽

Soichiro Nagano ◽

Kenta Shirasawa ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Gene Discovery ◽

Whole Genome ◽

Model Species

Download Full-text

Identification of α-enolase as a prognostic and diagnostic precancer biomarker in oral submucous fibrosis

Journal of Clinical Pathology ◽

10.1136/jclinpath-2017-204430 ◽

2017 ◽

Vol 71 (3) ◽

pp. 228-238 ◽

Cited By ~ 4

Author(s):

Swarnendu Bag ◽

Debabrata Dutta ◽

Amrita Chaudhary ◽

Bidhan Chandra Sing ◽

Mousumi Pal ◽

...

Keyword(s):

De Novo ◽

Pain Treatment ◽

Oral Submucous Fibrosis ◽

Peptide Sequencing ◽

Pcr Analysis ◽

Rt Pcr ◽

Sequence Coverage ◽

Protein Marker ◽

Peptide Mass ◽

Submucous Fibrosis

AimsDiagnostic ambiguities regarding the malignant potentiality of oral submucous fibrosis (OSF), an oral precancerous condition having dysplastic and non-dysplastic isoforms are the major failure for early intervention of oral squamous cell carcinoma (OSCC) patients. Our goal is to identify proteomic signatures from biopsies that can be used as precancer diagnostic marker for patient suffering from OSF.MethodsThe high throughput techniques adopting de novo peptide sequencing (1D SDS-PAGE coupled nanoLC MALDI tandem mass spectrometry (MS/MS)-based peptide mass fingerprint), immunohistochemistry (IHC), Western blot (WB) and real-time PCR (RT-PCR) analysis are considered for such biomarker identification and multilevel validations.ResultsAlpha-enolase is identified as an overexpressed protein in biopsies of oral submucous fibrosis with dysplasia (OSFWD) compared with oral submucous fibrosis without dysplasia (OSFWT) and normal oral mucosa (NOM). Total proteome analysis of an overexpressed protein band around 47 kDa of OSFWD identifies 334 peptides corresponding to 61 human proteins. Among them α-enolase is identified as a prime protein with highest number of peptides (44 out of 334 peptides) and sequence coverage (66.4%). Furthermore, RT-PCR, WB and IHC analysis also show mRNA and tissue level upregulation of α-enolase in OSFWD validating α-enolase as precancer marker.ConclusionsThis study for the first time identifies and validates α-enolase as a novel biomarker for early diagnosis of malignant potentiality of OSF. Hence, the identified protein marker, α-enolase can help in early therapeutic intervention of OSF patients leading to the reduction of patient’s pain, treatment cost and enhancement of patient’s quality of life.

Download Full-text

Accurate long-read de novo assembly evaluation with Inspector

Genome Biology ◽

10.1186/s13059-021-02527-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yu Chen ◽

Yixin Zhang ◽

Amy Y. Wang ◽

Min Gao ◽

Zechen Chong

Keyword(s):

Genome Assembly ◽

De Novo Assembly ◽

In Silico ◽

Large Scale ◽

De Novo ◽

Small Scale ◽

De Novo Genome Assembly ◽

Consensus Sequences ◽

Assembly Evaluation ◽

Long Read

AbstractLong-read de novo genome assembly continues to advance rapidly. However, there is a lack of effective tools to accurately evaluate the assembly results, especially for structural errors. We present Inspector, a reference-free long-read de novo assembly evaluator which faithfully reports types of errors and their precise locations. Notably, Inspector can correct the assembly errors based on consensus sequences derived from raw reads covering erroneous regions. Based on in silico and long-read assembly results from multiple long-read data and assemblers, we demonstrate that in addition to providing generic metrics, Inspector can accurately identify both large-scale and small-scale assembly errors.

Download Full-text

The Making of Long-Lasting Memories: A Fruit Fly Perspective

Frontiers in Behavioral Neuroscience ◽

10.3389/fnbeh.2021.662129 ◽

2021 ◽

Vol 15 ◽

Author(s):

Camilla Roselli ◽

Mani Ramaswami ◽

Tamara Boto ◽

Isaac Cervantes-Sandoval

Keyword(s):

Learning And Memory ◽

Neural Activity ◽

Molecular Mechanisms ◽

De Novo ◽

Fruit Fly ◽

Memory Formation ◽

Synaptic Activity ◽

Model Organisms ◽

Transient Wave ◽

Control Of Gene Expression

Understanding the nature of the molecular mechanisms underlying memory formation, consolidation, and forgetting are some of the fascinating questions in modern neuroscience. The encoding, stabilization and elimination of memories, rely on the structural reorganization of synapses. These changes will enable the facilitation or depression of neural activity in response to the acquisition of new information. In other words, these changes affect the weight of specific nodes within a neural network. We know that these plastic reorganizations require de novo protein synthesis in the context of Long-term memory (LTM). This process depends on neural activity triggered by the learned experience. The use of model organisms like Drosophila melanogaster has been proven essential for advancing our knowledge in the field of neuroscience. Flies offer an optimal combination of a more straightforward nervous system, composed of a limited number of cells, and while still displaying complex behaviors. Studies in Drosophila neuroscience, which expanded over several decades, have been critical for understanding the cellular and molecular mechanisms leading to the synaptic and behavioral plasticity occurring in the context of learning and memory. This is possible thanks to sophisticated technical approaches that enable precise control of gene expression in the fruit fly as well as neural manipulation, like chemogenetics, thermogenetics, or optogenetics. The search for the identity of genes expressed as a result of memory acquisition has been an active interest since the origins of behavioral genetics. From screenings of more or less specific candidates to broader studies based on transcriptome analysis, our understanding of the genetic control behind LTM has expanded exponentially in the past years. Here we review recent literature regarding how the formation of memories induces a rapid, extensive and, in many cases, transient wave of transcriptional activity. After a consolidation period, transcriptome changes seem more stable and likely represent the synthesis of new proteins. The complexity of the circuitry involved in memory formation and consolidation is such that there are localized changes in neural activity, both regarding temporal dynamics and the nature of neurons and subcellular locations affected, hence inducing specific temporal and localized changes in protein expression. Different types of neurons are recruited at different times into memory traces. In LTM, the synthesis of new proteins is required in specific subsets of cells. This de novo translation can take place in the somatic cytoplasm and/or locally in distinct zones of compartmentalized synaptic activity, depending on the nature of the proteins and the plasticity-inducing processes that occur. We will also review recent advances in understanding how localized changes are confined to the relevant synapse. These recent studies have led to exciting discoveries regarding proteins that were not previously involved in learning and memory processes. This invaluable information will lead to future functional studies on the roles that hundreds of new molecular actors play in modulating neural activity.

Download Full-text

Software Evaluation for de novo Detection of Transposons

10.1101/2021.02.08.430290 ◽

2021 ◽

Author(s):

Matias Rodriguez ◽

Wojciech Makałowski

Keyword(s):

Transposable Elements ◽

Genome Evolution ◽

De Novo ◽

Simulated Data ◽

Genomic Sequences ◽

Software Evaluation ◽

Easy Task ◽

Eukaryotic Genomes

AbstractTransposable elements (TEs) are major genomic components in most eukaryotic genomes and play an important role in genome evolution. However, despite their relevance the identification of TEs is not an easy task and a number of tools were developed to tackle this problem. To better understand how they perform, we tested several widely used tools for de novo TE detection and compared their performance on both simulated data and well curated genomic sequences. The results will be helpful for identifying common issues associated with TE-annotation and for evaluating how comparable are the results obtained with different tools.

Download Full-text

gcaPDA: A Haplotype-resolved Diploid Assembler

10.1101/2021.05.31.446328 ◽

2021 ◽

Author(s):

Xie Min ◽

Linfeng Yang ◽

Chenglin Jiang ◽

Shenshen Wu ◽

Cheng Luo ◽

...

Keyword(s):

De Novo ◽

Low Complexity ◽

F1 Hybrid ◽

Functional Studies ◽

Assembly Result ◽

Eukaryotic Genomes

Generating chromosome-scale haplotype resolved assembly is important for functional studies. However, current de novo assemblers are either haploid assemblers that discard allelic information, or diploid assemblers that can only tackle genomes of low complexity. Here, we report a diploid assembler, gcaPDA (gamete cells assisted Phased Diploid Assembler), which exploits haploid gamete cells to assist in resolving haplotypes. We generate chromosome-scale phased diploid assemblies for the highly heterozygous and repetitive genome of a maize F1 hybrid using gcaPDA and evaluate the assembly result thoroughly. With applicability of coping with complex genomes and fewer restrictions on application than other diploid assemblers, gcaPDA is likely to find broad applications in studies of eukaryotic genomes.

Download Full-text

An optimized approach for local de novo assembly of overlapping paired-end RAD reads from multiple individuals

Royal Society Open Science ◽

10.1098/rsos.171589 ◽

2018 ◽

Vol 5 (2) ◽

pp. 171589 ◽

Cited By ~ 4

Author(s):

Yu-Long Li ◽

Dong-Xiu Xue ◽

Bai-Dong Zhang ◽

Jin-Xian Liu

Keyword(s):

Data Reduction ◽

De Novo Assembly ◽

Genetic Variance ◽

Restriction Site ◽

De Novo ◽

Optimal Number ◽

Rad Sequencing ◽

Conservation Genomics ◽

Model Species

Restriction site-associated DNA (RAD) sequencing is revolutionizing studies in ecological, evolutionary and conservation genomics. However, the assembly of paired-end RAD reads with random-sheared ends is still challenging, especially for non-model species with high genetic variance. Here, we present an efficient optimized approach with a pipeline software, RADassembler, which makes full use of paired-end RAD reads with random-sheared ends from multiple individuals to assemble RAD contigs. RADassembler integrates the algorithms for choosing the optimal number of mismatches within and across individuals at the clustering stage, and then uses a two-step assembly approach at the assembly stage. RADassembler also uses data reduction and parallelization strategies to promote efficiency. Compared to other tools, both the assembly results based on simulation and real RAD datasets demonstrated that RADassembler could always assemble the appropriate number of contigs with high qualities, and more read pairs were properly mapped to the assembled contigs. This approach provides an optimal tool for dealing with the complexity in the assembly of paired-end RAD reads with random-sheared ends for non-model species in ecological, evolutionary and conservation studies. RADassembler is available at https://github.com/lyl8086/RADscripts.

Download Full-text