Comparative Genomics Analysis of Repetitive Elements in Ten Gymnosperm Species: “Dark Repeatome” and Its Abundance in Conifer and Gnetum Species

Repetitive elements (RE) and transposons (TE) can comprise up to 80% of some plant genomes and may be essential for regulating their evolution and adaptation. The “repeatome” information is often unavailable in assembled genomes because genomic areas of repeats are challenging to assemble and are often missing from final assembly. However, raw genomic sequencing data contain rich information about RE/TEs. Here, raw genomic NGS reads of 10 gymnosperm species were studied for the content and abundance patterns of their “repeatome”. We utilized a combination of alignment on databases of repetitive elements and de novo assembly of highly repetitive sequences from genomic sequencing reads to characterize and calculate the abundance of known and putative repetitive elements in the genomes of 10 conifer plants: Pinus taeda, Pinus sylvestris, Pinus sibirica, Picea glauca, Picea abies, Abies sibirica, Larix sibirica, Juniperus communis, Taxus baccata, and Gnetum gnemon. We found that genome abundances of known and newly discovered putative repeats are specific to phylogenetically close groups of species and match biological taxa. The grouping of species based on abundances of known repeats closely matches the grouping based on abundances of newly discovered putative repeats (kChains) and matches the known taxonomic relations.

Download Full-text

Genome size and identification of abundant repetitive sequences in Vallisneria spinulosa

PeerJ ◽

10.7717/peerj.3982 ◽

2017 ◽

Vol 5 ◽

pp. e3982 ◽

Cited By ~ 3

Author(s):

RuiJuan Feng ◽

Xin Wang ◽

Min Tao ◽

Guanchao Du ◽

Qishuo Wang

Keyword(s):

Genome Size ◽

Aquatic Plant ◽

Nuclear Dna ◽

De Novo ◽

Repetitive Sequences ◽

Nuclear Dna Content ◽

Ltr Retrotransposons ◽

Sequencing Data ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Vallisneria spinulosa is a freshwater aquatic plant of ecological and economic importance. However, there is limited cytogenetic and genomics information on Vallisneria. In this study, we measured the nuclear DNA content of Vallisneria spinulosa by flow cytometry, performed a de novo assembly, and annotated repetitive sequences by using a combination of next-generation sequencing (NGS) and bioinformatics tools. The genome size of Vallisneria spinulosa is approximately 3,595 Mbp, in which nearly 60% of the genome consists of repetitive sequences. The majority of the repetitive sequences are LTR-retrotransposons comprising 43% of the genome. Although the amount of sequencing data used in this study was not sufficient for a whole-genome assembly, it could generate an overview of representative elements in the genome. These results will lay a new foundation for further studies on various species that belong to the Vallisneria genus.

Download Full-text

Contamination as a major factor in poor Illumina assembly of microbial isolate genomes

10.1101/081885 ◽

2016 ◽

Cited By ~ 5

Author(s):

Haeyoung Jeong ◽

Jae-Goo Pan ◽

Seung-Hwan Park

Keyword(s):

Illumina Sequencing ◽

De Novo ◽

Repetitive Sequences ◽

Low Frequency ◽

Read Depth ◽

16S Rrna Genes ◽

Rrna Genes ◽

Sequencing Error ◽

Sequencing Data ◽

Long Reads

ABSTRACTThe nonhybrid hierarchical assembly of PacBio long reads is becoming the most preferred method for obtaining genomes for microbial isolates. On the other hand, among massive numbers of Illumina sequencing reads produced, there is a slim chance of re-evaluating failed microbial genome assembly (high contig number, large total contig size, and/or the presence of low-depth contigs). We generated Illumina-type test datasets with various levels of sequencing error, pretreatment (trimming and error correction), repetitive sequences, contamination, and ploidy from both simulated and real sequencing data and applied k-mer abundance analysis to quickly detect possible diagnostic signatures of poor assemblies. Contamination was the only factor leading to poor assemblies for the test dataset derived from haploid microbial genomes, resulting in an extraordinary peak within low-frequency k-mer range. When thirteen Illumina sequencing reads of microbes belonging to genera Bacillus or Paenibacillus from a single multiplexed run were subjected to a k-mer abundance analysis, all three samples leading to poor assemblies showed peculiar patterns of contamination. Read depth distribution along the contig length indicated that all problematic assemblies suffered from too many contigs with low average read coverage, where 1% to 15% of total reads were mapped to low-coverage contigs. We found that subsampling or filtering out reads having rare k-mers could efficiently remove low-level contaminants and greatly improve the de novo assemblies. An analysis of 16S rRNA genes recruited from reads or contigs and the application of read classification tools originally designed for metagenome analyses can help identify the source of a contamination. The unexpected presence of proteobacterial reads across multiple samples, which had no relevance to our lab environment, implies that such prevalent contamination might have occurred after the DNA preparation step, probably at the place where sequencing service was provided.

Download Full-text

Telomere-to-telomere assembly of the genome of an individual Oikopleura dioica from Okinawa using Nanopore-based sequencing

BMC Genomics ◽

10.1186/s12864-021-07512-6 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Aleksandra Bliznina ◽

Aki Masunaga ◽

Michael J. Mansfield ◽

Yongkai Tan ◽

Andrew W. Liu ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Hybrid Approach ◽

Gc Content ◽

Repetitive Elements ◽

Genome Comparison ◽

Sequencing Data ◽

Single Male ◽

Oikopleura Dioica ◽

Genomic Features

Abstract Background The larvacean Oikopleura dioica is an abundant tunicate plankton with the smallest (65–70 Mbp) non-parasitic, non-extremophile animal genome identified to date. Currently, there are two genomes available for the Bergen (OdB3) and Osaka (OSKA2016) O. dioica laboratory strains. Both assemblies have full genome coverage and high sequence accuracy. However, a chromosome-scale assembly has not yet been achieved. Results Here, we present a chromosome-scale genome assembly (OKI2018_I69) of the Okinawan O. dioica produced using long-read Nanopore and short-read Illumina sequencing data from a single male, combined with Hi-C chromosomal conformation capture data for scaffolding. The OKI2018_I69 assembly has a total length of 64.3 Mbp distributed among 19 scaffolds. 99% of the assembly is contained within five megabase-scale scaffolds. We found telomeres on both ends of the two largest scaffolds, which represent assemblies of two fully contiguous autosomal chromosomes. Each of the other three large scaffolds have telomeres at one end only and we propose that they correspond to sex chromosomes split into a pseudo-autosomal region and X-specific or Y-specific regions. Indeed, these five scaffolds mostly correspond to equivalent linkage groups in OdB3, suggesting overall agreement in chromosomal organization between the two populations. At a more detailed level, the OKI2018_I69 assembly possesses similar genomic features in gene content and repetitive elements reported for OdB3. The Hi-C map suggests few reciprocal interactions between chromosome arms. At the sequence level, multiple genomic features such as GC content and repetitive elements are distributed differently along the short and long arms of the same chromosome. Conclusions We show that a hybrid approach of integrating multiple sequencing technologies with chromosome conformation information results in an accurate de novo chromosome-scale assembly of O. dioica’s highly polymorphic genome. This genome assembly opens up the possibility of cross-genome comparison between O. dioica populations, as well as of studies of chromosomal evolution in this lineage.

Download Full-text

Telomere-to-telomere assembly of the genome of an individual Oikopleura dioica from Okinawa using Nanopore-based sequencing

10.1101/2020.09.11.292656 ◽

2020 ◽

Author(s):

Aleksandra Bliznina ◽

Aki Masunaga ◽

Michael J. Mansfield ◽

Yongkai Tan ◽

Andrew W. Liu ◽

...

Keyword(s):

De Novo ◽

Hybrid Approach ◽

Gc Content ◽

Chromosomal Evolution ◽

Repetitive Elements ◽

Sequencing Data ◽

Single Male ◽

Oikopleura Dioica ◽

Chromosome Conformation ◽

Genomic Features

AbstractBackgroundThe larvacean Oikopleura dioica is an abundant tunicate plankton with the smallest (65-70 Mbp) non-parasitic, non-extremophile animal genome identified to date. Currently, there are two genomes available for the Bergen (OdB3) and Osaka (OSKA2016) O. dioica laboratory strains. Both assemblies have full genome coverage and high sequence accuracy. However, a chromosome-scale assembly has not yet been achieved.ResultsHere, we present a chromosome-scale genome assembly (OKI2018_I69) of the Okinawan O. dioica produced using long-read Nanopore and short-read Illumina sequencing data from a single male, combined with Hi-C chromosomal conformation capture data for scaffolding. The OKI2018_I69 assembly has a total length of 64.3 Mbp distributed among 19 scaffolds. 99% of the assembly is in five megabase-scale scaffolds. We found telomeres on both ends of the two largest scaffolds, which represent assemblies of two fully contiguous autosomal chromosomes. Each of the other three large scaffolds have telomeres at one end only and we propose that they correspond to sex chromosomes split into a pseudo-autosomal region and X-specific or Y-specific regions. Indeed, these five scaffolds mostly correspond to equivalent linkage groups of OdB3, suggesting overall agreement in chromosomal organization between the two populations. At a more detailed level, the OKI2018_I69 assembly possesses similar genomic features in gene content and repetitive elements reported for OdB3. The Hi-C map suggests few reciprocal interactions between chromosome arms. At the sequence level, multiple genomic features such as GC content and repetitive elements are distributed differently along the short and long arms of the same chromosome.ConclusionsWe show that a hybrid approach of integrating multiple sequencing technologies with chromosome conformation information results in an accurate de novo chromosome-scale assembly of O. dioica’s highly polymorphic genome. This assembly will be a useful resource for genome-wide comparative studies between O. dioica and other species, as well as studies of chromosomal evolution in this lineage.

Download Full-text

Insights from the first genome assembly of Onion (Allium cepa)

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab243 ◽

2021 ◽

Author(s):

Richard Finkers ◽

Martijn van Kaauwen ◽

Kai Ament ◽

Karin Burger-Meijer ◽

Raymond Egging ◽

...

Keyword(s):

Ab Initio ◽

De Novo ◽

Gene Prediction ◽

Repetitive Sequences ◽

Linkage Maps ◽

Vegetable Crop ◽

Putative Gene ◽

Final Assembly ◽

Genetic Linkage Maps ◽

High Quality Genome

Abstract Onion is an important vegetable crop with an estimated genome size of 16 Gb. We describe the de novo assembly and ab initio annotation of the genome of a doubled haploid onion line DHCU066619, which resulted in a final assembly of 14.9 Gb with a N50 of 464 Kb. Of this, 2.4 Gb was ordered into 8 pseudomolecules using four genetic linkage maps. The remainder of the genome is available in 89.6 K scaffolds. Only 72.4% of the genome could be identified as repetitive sequences and consist, to a large extent, of (retro) transposons. In addition, an estimated 20% of the putative (retro) transposons had accumulated a large number of mutations, hampering their identification, but facilitating their assembly. These elements are probably already quite old. The ab initio gene prediction indicated 540,925 putative gene models, which is far more than expected, possibly due to the presence of pseudogenes. Of these models, 47,066 showed RNASeq support. No gene rich regions were found, genes are uniformly distributed over the genome. Analysis of synteny with A. sativum (garlic) showed collinearity but also major rearrangements between both species. This assembly is the first high-quality genome sequence available for the study of onion and will be a valuable resource for further research.

Download Full-text

Genomes of the willow-galling sawflies Euura lappo and Eupontania aestiva (Hymenoptera: Tenthredinidae): a resource for research on ecological speciation, adaptation, and gall induction

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab094 ◽

2021 ◽

Author(s):

Craig Michell ◽

Saskia Wutke ◽

Manuel Aranda ◽

Tommi Nyman

Keyword(s):

De Novo ◽

Repetitive Elements ◽

Insect Order ◽

Plant Feeding ◽

Final Assembly ◽

Oxford Nanopore ◽

Long Read ◽

Hymenopteran Species ◽

Genomic Adaptation ◽

Ecological Importance

Abstract Hymenoptera are a hyperdiverse insect order represented by over 153,000 different species. As many hymenopteran species perform various crucial roles for our environment, such as pollination, herbivory, and parasitism, they are of high economic and ecological importance. There are 99 hymenopteran genomes in the NCBI database, yet only five are representative of the paraphyletic suborder Symphyta (sawflies, woodwasps, and horntails), while the rest represent the suborder Apocrita (bees, wasps, and ants). Here, using a combination of 10X Genomics linked-read sequencing, Oxford Nanopore long-read technology, and Illumina short-read data, we assembled the genomes of two willow-galling sawflies (Hymenoptera: Tenthredinidae: Nematinae: Euurina): the bud-galling species Euura lappo and the leaf-galling species Eupontania aestiva. The final assembly for E. lappo is 259.85 Mbp in size, with a contig N50 of 209.0 kbp and a BUSCO score of 93.5%. The E. aestiva genome is 222.23 Mbp in size, with a contig N50 of 49.7 kbp and an 90.2% complete BUSCO score. De novo annotation of repetitive elements showed that 27.45% of the genome was composed of repetitive elements in E. lappo and 16.89% in E. aestiva, which is a marked increase compared to previously published hymenopteran genomes. The genomes presented here provide a resource for inferring phylogenetic relationships among basal hymenopterans, comparative studies on host-related genomic adaptation in plant-feeding insects, and research on the mechanisms of plant manipulation by gall-inducing insects.

Download Full-text

First de novo whole genome sequencing and assembly of the bar-headed goose

PeerJ ◽

10.7717/peerj.8914 ◽

2020 ◽

Vol 8 ◽

pp. e8914 ◽

Cited By ~ 1

Author(s):

Wen Wang ◽

Fang Wang ◽

Rongkai Hao ◽

Aizhen Wang ◽

Kirill Sharshov ◽

...

Keyword(s):

High Altitude ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Genome Assembly ◽

De Novo ◽

Gene Prediction ◽

Repetitive Sequences ◽

Gene Families ◽

Whole Genome ◽

Sequencing Data

Background The bar-headed goose (Anser indicus) mainly inhabits the plateau wetlands of Asia. As a specialized high-altitude species, bar-headed geese can migrate between South and Central Asia and annually fly twice over the Himalayan mountains along the central Asian flyway. The physiological, biochemical and behavioral adaptations of bar-headed geese to high-altitude living and flying have raised much interest. However, to date, there is still no genome assembly information publicly available for bar-headed geese. Methods In this study, we present the first de novo whole genome sequencing and assembly of the bar-headed goose, along with gene prediction and annotation. Results 10X Genomics sequencing produced a total of 124 Gb sequencing data, which can cover the estimated genome size of bar-headed goose for 103 times (average coverage). The genome assembly comprised 10,528 scaffolds, with a total length of 1.143 Gb and a scaffold N50 of 10.09 Mb. Annotation of the bar-headed goose genome assembly identified a total of 102 Mb (8.9%) of repetitive sequences, 16,428 protein-coding genes, and 282 tRNAs. In total, we determined that there were 63 expanded and 20 contracted gene families in the bar-headed goose compared with the other 15 vertebrates. We also performed a positive selection analysis between the bar-headed goose and the closely related low-altitude goose, swan goose (Anser cygnoides), to uncover its genetic adaptations to the Qinghai-Tibetan Plateau. Conclusion We reported the currently most complete genome sequence of the bar-headed goose. Our assembly will provide a valuable resource to enhance further studies of the gene functions of bar-headed goose. The data will also be valuable for facilitating studies of the evolution, population genetics and high-altitude adaptations of the bar-headed geese at the genomic level.

Download Full-text

Reference Quality Assembly of the 3.5 Gb genome of Capsicum annuum from a Single Linked-Read Library

10.1101/152777 ◽

2017 ◽

Cited By ~ 1

Author(s):

Amanda M. Hulse-Kemp ◽

Shamoni Maheshwari ◽

Kevin Stoffel ◽

Theresa A. Hill ◽

David Jaffe ◽

...

Keyword(s):

Capsicum Annuum ◽

De Novo ◽

Repetitive Sequences ◽

Complete Recovery ◽

Plant Genomes ◽

Final Assembly ◽

Eukaryotic Genes ◽

Reference Quality ◽

Pepper Genome ◽

Genome Assemblies

AbstractBackgroundLinked-Read sequencing technology has recently been employed successfully for de novo assembly of multiple human genomes, however the utility of this technology for complex plant genomes is unproven. We evaluated the technology for this purpose by sequencing the 3.5 gigabase (Gb) diploid pepper (Capsicum annuum) genome with a single Linked-Read library. Plant genomes, including pepper, are characterized by long, highly similar repetitive sequences. Accordingly, significant effort is used to ensure the sequenced plant is highly homozygous and the resulting assembly is a haploid consensus. With a phased assembly approach, we targeted a heterozygous F1 derived from a wide cross to assess the ability to derive both haplotypes for a pungency gene characterized by a large insertion/deletion.ResultsThe Supernova software generated a highly ordered, more contiguous sequence assembly than all currently available C. annuum reference genomes. Eighty-four percent of the final assembly was anchored and oriented using four de novo linkage maps. A comparison of the annotation of conserved eukaryotic genes indicated the completeness of assembly. The validity of the phased assembly is further demonstrated with the complete recovery of both 2.5 kb insertion/deletion haplotypes of the PUN1 locus in the F1 sample that represents pungent and non-pungent peppers.ConclusionsThe most contiguous pepper genome assembly to date has been generated through this work which demonstrates that Linked-Read library technology provides a rapid tool to assemble de novo complex highly repetitive heterozygous plant genomes. This technology can provide an opportunity to cost-effectively develop high-quality reference genome assemblies for other complex plants and compare structural and gene differences through accurate haplotype reconstruction.

Download Full-text

Transcriptome sequencing reveals high isoform diversity in the ant Formica exsecta

PeerJ ◽

10.7717/peerj.3998 ◽

2017 ◽

Vol 5 ◽

pp. e3998 ◽

Cited By ~ 4

Author(s):

Kishor Dhaygude ◽

Kalevi Trontti ◽

Jenni Paviala ◽

Claire Morandin ◽

Christopher Wheat ◽

...

Keyword(s):

Rna Sequencing ◽

De Novo ◽

Splice Variants ◽

Transcriptome Assembly ◽

Sequencing Data ◽

Genetic Studies ◽

Final Assembly ◽

Isoform Diversity ◽

Gene Ontologies ◽

Scaffolding Software

Transcriptome resources for social insects have the potential to provide new insight into polyphenism, i.e., how divergent phenotypes arise from the same genome. Here we present a transcriptome based on paired-end RNA sequencing data for the ant Formica exsecta (Formicidae, Hymenoptera). The RNA sequencing libraries were constructed from samples of several life stages of both sexes and female castes of queens and workers, in order to maximize representation of expressed genes. We first compare the performance of common assembly and scaffolding software (Trinity, Velvet-Oases, and SOAPdenovo-trans), in producing de novo assemblies. Second, we annotate the resulting expressed contigs to the currently published genomes of ants, and other insects, including the honeybee, to filter genes that have annotation evidence of being true genes. Our pipeline resulted in a final assembly of altogether 39,262 mRNA transcripts, with an average coverage of >300X, belonging to 17,496 unique genes with annotation in the related ant species. From these genes, 536 genes were unique to one caste or sex only, highlighting the importance of comprehensive sampling. Our final assembly also showed expression of several splice variants in 6,975 genes, and we show that accounting for splice variants affects the outcome of downstream analyses such as gene ontologies. Our transcriptome provides an outstanding resource for future genetic studies on F. exsecta and other ant species, and the presented transcriptome assembly can be adapted to any non-model species that has genomic resources available from a related taxon.

Download Full-text

DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00199-0 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Fabian Hausmann ◽

Stefan Kurtz

Keyword(s):

Machine Translation ◽

De Novo ◽

Repetitive Sequences ◽

Software Tool ◽

Repetitive Elements ◽

Training Data ◽

Implementation Framework ◽

Neural Machine Translation ◽

Species Specific

Abstract Background Repetitive elements contribute a large part of eukaryotic genomes. For example, about 40 to 50% of human, mouse and rat genomes are repetitive. So identifying and classifying repeats is an important step in genome annotation. This annotation step is traditionally performed using alignment based methods, either in a de novo approach or by aligning the genome sequence to a species specific set of repetitive sequences. Recently, Li (Bioinformatics 35:4408–4410, 2019) developed a novel software tool to annotate repetitive sequences using a recurrent neural network trained on sample annotations of repetitive elements. Results We have developed the methods of further and engineered a new software tool . This combines the basic concepts of Li (Bioinformatics 35:4408–4410, 2019) with current techniques developed for neural machine translation, the attention mechanism, for the task of nucleotide-level annotation of repetitive elements. An evaluation on the human genome shows a 20% improvement of the Matthews correlation coefficient for the predictions delivered by , when compared to . predicts two additional classes of repeats (compared to ) and is able to transfer repeat annotations, using RepeatMasker-based training data to a different species (mouse). Additionally, we could show that predicts repeats annotated in the Dfam database, but not annotated by RepeatMasker. is highly scalable due to its implementation in the TensorFlow framework. For example, the GPU-accelerated version of is approx. 1.8 times faster than , approx. 8.6 times faster than RepeatMasker and over 100 times faster than HMMER searching for models of the Dfam database. Conclusions By incorporating methods from neural machine translation, achieves a consistent improvement of the quality of the predictions compared to . Improved running times are obtained by employing TensorFlow as implementation framework and the use of GPUs. By incorporating two additional classes of repeats, provides more complete annotations, which were evaluated against three state-of-the-art tools for repeat annotation.

Download Full-text