scholarly journals On the (Im)possibility to Reconstruct Plasmids from Whole Genome Short-Read Sequencing Data

2016 ◽  
Author(s):  
Sergio Arredondo-Alonso ◽  
Willem van Schaik ◽  
Rob J. Willems ◽  
Anita C. Schürch

AbstractPlasmids are autonomous extra-chromosomal elements in bacterial cells that can carry genes that are important for bacterial survival. To benchmark algorithms for automated plasmid sequence reconstruction from short read sequencing data, we selected 42 publicly available complete bacterial genome sequences which were assembled by a combination of long- and short-read data. The selected bacterial genome sequence projects span 12 genera, containing 148 plasmids. We predicted plasmids from short-read data with four different programs (PlasmidSPAdes, Recycler, cBar and PlasmidFinder) and compared the outcome to the reference sequences.PlasmidSPAdes reconstructs plasmids based on coverage differences in the assembly graph. It reconstructed most of the reference plasmids (recall = 0.82) but approximately a quarter of the predicted plasmid contigs were false positives (precision = 0.76). PlasmidSPAdes merged 83 % of the predictions from genomes with multiple plasmids in a single bin. Recycler searches the assembly graph for sub-graphs corresponding to circular sequences and correctly predicted small plasmids but failed with long plasmids (recall = 0.12, precision = 0.30). cBar, which applies pentamer frequency composition analysis to detect plasmid-derived contigs, showed an overall recall and precision of 0.78 and 0.64. However, cBar only categorizes contigs as plasmid-derived and does not bin the different plasmids correctly within a bacterial isolate. PlasmidFinder, which searches for matches in a replicon database, had the highest precision (1.0) but was restricted by the contents of its database and the contig length obtained from de novo assembly (recall = 0.36).Surprisingly, PlasmidSPAdes and Recycler detected single isolated components corresponding to putative novel small plasmids (<10 kbp) which were also predicted as plasmids by cBar.This study shows that it is possible to automatically predict plasmid sequences, but only for small plasmids. The reconstruction of large plasmids (>50 kbp) containing repeated sequences remains challenging and limits the high-throughput analysis of WGS data.Author SummaryShort read sequencing of the DNA of bacteria is often used to understand characteristics such as antibiotic resistance. However the assembly of short read sequencing data with the goal of reconstructing a complete genome is often fragmented and leaves gaps. Therefore independently replicating DNA fragments called plasmids cannot easily be identified from an assembly. Lately a number of programs have been developed to enable the automated prediction of the sequences of plasmids. Here we tested these programs by comparing their outcomes with complete genome sequences. None of the tested programs were able to fully and unambiguously predict distinct plasmid sequences. All programs performed best with the prediction of plasmids smaller than 50 kbp. Larger plasmids were only correctly predicted if they were present as a single contig in the assembly. While predictions by PlasmidSPAdes and cBar contained most of the plasmids, they were merged with or indistinguishable from other plasmids and sometimes chromosome sequences. PlasmidFinder missed most plasmids but all its predictions were correct. Without manual steps or long-read sequencing information, plasmid reconstruction from short read sequencing data remains challenging.

2019 ◽  
Vol 8 (34) ◽  
Author(s):  
Natsuki Tomariguchi ◽  
Kentaro Miyazaki

Rubrobacter xylanophilus strain AA3-22, belonging to the phylum Actinobacteria, was isolated from nonvolcanic Arima Onsen (hot spring) in Japan. Here, we report the complete genome sequence of this organism, which was obtained by combining Oxford Nanopore long-read and Illumina short-read sequencing data.


2017 ◽  
Author(s):  
Alexander Seitz ◽  
Friederike Hanssen ◽  
Kay Nieselt

The reconstruction of genomes using mapping based approaches with short reads experiences difficulties when resolving repetitive regions. These repetitive regions in genomes result in low mapping qualities of the respective reads, which in turn lead to many unresolved bases of the genotypers. Currently, the reconstruction of these regions is often based on modified references in which the repetitive regions are masked. However, for many references such masked genomes are not available or are based on repetitive regions of other genomes. Our idea is to identify repetitive regions in the reference genome de novo. These regions can then be used to reconstruct them separately using short read sequencing data. Afterwards the reconstructed repetitive sequence can be inserted into the reconstructed genome. We present the program DACCOR, which performs these steps automatically. Our results show an increased base pair resolution of the repetitive regions in the reconstruction of Treponema pallidum samples, resulting in fewer unresolved bases.


2020 ◽  
Vol 9 (21) ◽  
Author(s):  
Kentaro Miyazaki ◽  
Apirak Wiseschart ◽  
Kusol Pootanakit ◽  
Kei Kitahara

ABSTRACT We isolated the novel strain Vibrio rotiferianus AM7 from the shell of an abalone. In this article, we report the complete genome sequence of this organism, which was obtained by combining Oxford Nanopore long-read and Illumina short-read sequencing data.


2021 ◽  
Author(s):  
Benjamin Jaegle ◽  
Luz Mayela Soto-Jimenez ◽  
Robin Burns ◽  
Fernando A. Rabanal ◽  
Magnus Nordborg

Background: It is becoming apparent that genomes harbor massive amounts of structural variation, and that this variation has largely gone undetected for technical reasons. In addition to being inherently interesting, structural variation can cause artifacts when short-read sequencing data are mapped to a reference genome. In particular, spurious SNPs (that do not show Mendelian segregation) may result from mapping of reads to duplicated regions. Recalling SNP using the raw reads of the 1001 Arabidopsis Genomes Project we identified 3.3 million heterozygous SNPs (44% of total). Given that Arabidopsis thaliana (A. thaliana) is highly selfing, we hypothesized that these SNPs reflected cryptic copy number variation, and investigated them further. Results: While genuine heterozygosity should occur in tracts within individuals, heterozygosity at a particular locus is instead shared across individuals in a manner that strongly suggests it reflects segregating duplications rather than actual heterozygosity. Focusing on pseudo-heterozygosity in annotated genes, we used GWAS to map the position of the duplicates, identifying 2500 putatively duplicated genes. The results were validated using de novo genome assemblies from six lines. Specific examples included an annotated gene and nearby transposon that, in fact, transpose together. Conclusions: Our study confirms that most heterozygous SNPs calls in A. thaliana are artifacts, and suggest that great caution is needed when analysing SNP data from short-read sequencing. The finding that 10% of annotated genes are copy-number variables, and the realization that neither gene- nor transposon-annotation necessarily tells us what is actually mobile in the genome suggest that future analyses based on independently assembled genomes will be very informative.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e4742 ◽  
Author(s):  
Alexander Seitz ◽  
Friederike Hanssen ◽  
Kay Nieselt

The reconstruction of genomes using mapping-based approaches with short reads experiences difficulties when resolving repetitive regions. These repetitive regions in genomes result in low mapping qualities of the respective reads, which in turn lead to many unresolved bases. Currently, the reconstruction of these regions is often based on modified references in which the repetitive regions are masked. However, for many references, such masked genomes are not available or are based on repetitive regions of other genomes. Our idea is to identify repetitive regions in the reference genome de novo. These regions can then be used to reconstruct them separately using short read sequencing data. Afterward, the reconstructed repetitive sequence can be inserted into the reconstructed genome. We present the program detection, characterization, and reconstruction of repetitive regions, which performs these steps automatically. Our results show an increased base pair resolution of the repetitive regions in the reconstruction of Treponema pallidum samples, resulting in fewer unresolved bases.


2021 ◽  
Author(s):  
Ari Löytynoja

Variation within human genomes is distributed unevenly and variants show spatial clustering. DNA-replication related template switching is a poorly known mutational mechanism capable of causing major chromosomal rearrangements as well as creating short inverted sequence copies that appear as local mutation clusters in sequence comparisons. We reanalyzed haplotype-resolved genome assemblies representing 25 human populations and multinucleotide variants aggregated from 140,000 human sequencing experiments. We found local template switching to explain thousands of complex mutation clusters across the human genome, the loci segregating within and between populations with a small number appearing as de novo mutations. We developed computational tools for genotyping candidate template switch loci using short-read sequencing data and for identification of template switch events using both short-read data and genotype data. These tools will enable building a catalogue of affected loci and studying the cellular mechanisms behind template switching both in healthy organisms and in disease. Strikingly, we noticed that widely-used analysis pipelines for short-read sequencing data - capable of identifying single nucleotide changes - may miss TSM-origin inversions of tens of base pairs, potentially invalidating medical genetic studies searching for causative alleles behind genetic diseases.


Data in Brief ◽  
2021 ◽  
Vol 34 ◽  
pp. 106674
Author(s):  
Stafny DSouza ◽  
Koushik Ponnanna ◽  
Amruthavalli Chokkanna ◽  
Nallur Ramachandra

2017 ◽  
Author(s):  
Alexander Seitz ◽  
Friederike Hanssen ◽  
Kay Nieselt

The reconstruction of genomes using mapping based approaches with short reads experiences difficulties when resolving repetitive regions. These repetitive regions in genomes result in low mapping qualities of the respective reads, which in turn lead to many unresolved bases of the genotypers. Currently, the reconstruction of these regions is often based on modified references in which the repetitive regions are masked. However, for many references such masked genomes are not available or are based on repetitive regions of other genomes. Our idea is to identify repetitive regions in the reference genome de novo. These regions can then be used to reconstruct them separately using short read sequencing data. Afterwards the reconstructed repetitive sequence can be inserted into the reconstructed genome. We present the program DACCOR, which performs these steps automatically. Our results show an increased base pair resolution of the repetitive regions in the reconstruction of Treponema pallidum samples, resulting in fewer unresolved bases.


Sign in / Sign up

Export Citation Format

Share Document