scholarly journals Strategies and difficulties in assembling highly recombinogenic plant organelle genomes: a case study

Author(s):  
Concita Cantarella ◽  
Rachele Tamburino ◽  
Nunzia Scotti ◽  
Teodoro Cardi ◽  
Nunzio D'Agostino

Mitochondrial genomes in plants are larger and more complex than in other eukaryotes due to their recombinogenic nature as widely demonstrated. The mitochondrial DNA (mtDNA) is usually represented as a single circular map, the so-called master molecule. This molecule includes repeated sequences, some of which are able to recombine, generating sub-genomic molecules in various amounts, depending on the balance between their recombination and replication rates. Recent advances in DNA sequencing technology gave a huge boost to plant mitochondrial genome projects. Conventional approaches to mitochondrial genome sequencing involve extraction and enrichment of mitochondrial DNA, cloning, and sequencing. Large repeats and the dynamic mitochondrial genome organization complicate de novo sequence assembly from short reads. The PacBio RS long-read sequencing platform offers the promise of increased read length and unbiased genome coverage and thus the potential to produce genome sequence data of a finished quality (fewer gaps and longer contigs). However, recently published articles revealed that PacBio sequencing is still not sufficient to address mtDNA assembly-related issues. Here we present a preliminary hybrid assembly of a potato mtDNA based on both PacBio and Illumina reads and debate the strategies and obstacles in assembling genomes containing repeated sequences that are recombinationally active and serve as a constant source of rearrangements.

2015 ◽  
Author(s):  
Concita Cantarella ◽  
Rachele Tamburino ◽  
Nunzia Scotti ◽  
Teodoro Cardi ◽  
Nunzio D'Agostino

Mitochondrial genomes in plants are larger and more complex than in other eukaryotes due to their recombinogenic nature as widely demonstrated. The mitochondrial DNA (mtDNA) is usually represented as a single circular map, the so-called master molecule. This molecule includes repeated sequences, some of which are able to recombine, generating sub-genomic molecules in various amounts, depending on the balance between their recombination and replication rates. Recent advances in DNA sequencing technology gave a huge boost to plant mitochondrial genome projects. Conventional approaches to mitochondrial genome sequencing involve extraction and enrichment of mitochondrial DNA, cloning, and sequencing. Large repeats and the dynamic mitochondrial genome organization complicate de novo sequence assembly from short reads. The PacBio RS long-read sequencing platform offers the promise of increased read length and unbiased genome coverage and thus the potential to produce genome sequence data of a finished quality (fewer gaps and longer contigs). However, recently published articles revealed that PacBio sequencing is still not sufficient to address mtDNA assembly-related issues. Here we present a preliminary hybrid assembly of a potato mtDNA based on both PacBio and Illumina reads and debate the strategies and obstacles in assembling genomes containing repeated sequences that are recombinationally active and serve as a constant source of rearrangements.


2021 ◽  
Author(s):  
Kishwar Shafin ◽  
Trevor Pesout ◽  
Pi-Chuan Chang ◽  
Maria Nattestad ◽  
Alexey Kolesnikov ◽  
...  

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read based phasing. Third-generation nanopore sequence data has demonstrated a long read length, but current interpretation methods for its novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline PEPPER-Margin-DeepVariant that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single nucleotide variant identification method at the whole genome-scale and produces high-quality single nucleotide variants in segmental duplications and low-mappability regions where short-read based genotyping fails. We show that our pipeline can provide highly-contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% to 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance than the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio-HiFi-polished).


2020 ◽  
Author(s):  
Anna E. Syme ◽  
Todd G.B. McLay ◽  
Frank Udovicic ◽  
David J. Cantrill ◽  
Daniel J. Murphy

AbstractAlthough organelle genomes are typically represented as single, static, circular molecules, there is evidence that the chloroplast genome exists in two structural haplotypes and that the mitochondrial genome can display multiple circular, linear or branching forms. We sequenced and assembled chloroplast and mitochondrial genomes of the Golden Wattle, Acacia pycnantha, using long reads, iterative baiting to extract organelle-only reads, and several assembly algorithms to explore genomic structure. Using a de novo assembly approach agnostic to previous hypotheses about structure, we found different assemblies revealed contrasting arrangements of genomic segments; a hypothesis supported by mapped reads spanning alternate paths.


The mitochondrial genomes of higher plants are among the largest and most complex organelle genomes described. They are generally multicircular or partly linear; in some species, extrachromosomal plasmids are present. It is proposed that inter- and intramolecular homologous recombination can account for the diversity of the observed genome organizations. The ability of mitochondria to fuse establishes a panmictic mitochondrial DNA population which is in recombinational equilibrium. It is suggested that this suppresses the base mutation rate, and unequal partitioning of the cytoplasm during cell division can lead to the rapid evolution of mitochondrial genome structure. This contrasts with the observed rates of base-sequence and genome evolution in chloroplasts. This difference can be accounted for solely by the inability of chloroplasts to fuse, thereby preventing chloroplast genome panmixis.


GigaScience ◽  
2020 ◽  
Vol 9 (10) ◽  
Author(s):  
Willem de Koning ◽  
Milad Miladi ◽  
Saskia Hiltemann ◽  
Astrid Heikema ◽  
John P Hays ◽  
...  

Abstract Background Long-read sequencing can be applied to generate very long contigs and even completely assembled genomes at relatively low cost and with minimal sample preparation. As a result, long-read sequencing platforms are becoming more popular. In this respect, the Oxford Nanopore Technologies–based long-read sequencing “nanopore" platform is becoming a widely used tool with a broad range of applications and end-users. However, the need to explore and manipulate the complex data generated by long-read sequencing platforms necessitates accompanying specialized bioinformatics platforms and tools to process the long-read data correctly. Importantly, such tools should additionally help democratize bioinformatics analysis by enabling easy access and ease-of-use solutions for researchers. Results The Galaxy platform provides a user-friendly interface to computational command line–based tools, handles the software dependencies, and provides refined workflows. The users do not have to possess programming experience or extended computer skills. The interface enables researchers to perform powerful bioinformatics analysis, including the assembly and analysis of short- or long-read sequence data. The newly developed “NanoGalaxy" is a Galaxy-based toolkit for analysing long-read sequencing data, which is suitable for diverse applications, including de novo genome assembly from genomic, metagenomic, and plasmid sequence reads. Conclusions A range of best-practice tools and workflows for long-read sequence genome assembly has been integrated into a NanoGalaxy platform to facilitate easy access and use of bioinformatics tools for researchers. NanoGalaxy is freely available at the European Galaxy server https://nanopore.usegalaxy.eu with supporting self-learning training material available at https://training.galaxyproject.org.


2020 ◽  
Author(s):  
Yuya Kiguchi ◽  
Suguru Nishijima ◽  
Naveen Kumar ◽  
Masahira Hattori ◽  
Wataru Suda

Abstract Background: The ecological and biological features of the indigenous phage community (virome) in the human gut microbiome are poorly understood, possibly due to many fragmented contigs and fewer complete genomes based on conventional short-read metagenomics. Long-read sequencing technologies have attracted attention as an alternative approach to reconstruct long and accurate contigs from microbial communities. However, the impact of long-read metagenomics on human gut virome analysis has not been well evaluated. Results: Here we present chimera-less PacBio long-read metagenomics of multiple displacement amplification (MDA)-treated human gut virome DNA. The method included the development of a novel bioinformatics tool, SACRA (Split Amplified Chimeric Read Algorithm), which efficiently detects and splits numerous chimeric reads in PacBio reads from the MDA-treated virome samples. SACRA treatment of PacBio reads from five samples markedly reduced the average chimera ratio from 72 to 1.5%, generating chimera-less PacBio reads with an average read-length of 1.8 kb. De novo assembly of the chimera-less long reads generated contigs with an average N50 length of 11.1 kb, whereas those of MiSeq short reads from the same samples were 0.7 kb, dramatically improving contig extension. Alignment of both contig sets generated 378 high-quality merged contigs (MCs) composed of the minimum scaffolds of 434 MiSeq and 637 PacBio contigs, respectively, and also identified numerous MiSeq short fragmented contigs ≤500 bp additionally aligned to MCs, which possibly originated from a small fraction of MiSeq chimeric reads. The alignment also revealed that fragmentations of the scaffolded MiSeq contigs were caused primarily by genomic complexity of the community, including local repeats, hypervariable regions, and highly conserved sequences in and between the phage genomes. We identified 142 complete and near-complete phage genomes including 108 novel genomes, varying from 5 to 185 kb in length, the majority of which were predicted to be Microviridae phages including several variants with homologous but distinct genomes, which were fragmented in MiSeq contigs. Conclusions: Long-read metagenomics coupled with SACRA provides an improved method to reconstruct accurate and extended phage genomes from MDA-treated virome samples of the human gut, and potentially from other environmental virome samples.


Author(s):  
Ann McCartney ◽  
Elena Hilario ◽  
Seung-Sub Choi ◽  
Joseph Guhlin ◽  
Jessie Prebble ◽  
...  

We used long read sequencing data generated from Knightia excelsaI R.Br, a nectar producing Proteaceae tree endemic to Aotearoa New Zealand, to explore how sequencing data type, volume and workflows can impact final assembly accuracy and chromosome construction. Establishing a high-quality genome for this species has specific cultural importance to Māori, the indigenous people, as well as commercial importance to honey producers in Aotearoa New Zealand. Assemblies were produced by five long read assemblers using data subsampled based on read lengths, two polishing strategies, and two Hi-C mapping methods. Our results from subsampling the data by read length showed that each assembler tested performed differently depending on the coverage and the read length of the data. Assemblies that used longer read lengths (>30 kb) and lower coverage were the most contiguous, kmer and gene complete. The final genome assembly was constructed into pseudo-chromosomes using all available data assembled with FLYE, polished using Racon/Medaka/Pilon combined, scaffolded using SALSA2 and AllHiC, curated using Juicebox, and validated by synteny with Macadamia. We highlighted the importance of developing assembly workflows based on the volume and type of sequencing data and establishing a set of robust quality metrics for generating high quality assemblies. Scaffolding analyses highlighted that problems found in the initial assemblies could not be resolved accurately by utilizing Hi-C data and that scaffolded assemblies were more accurate when the underlying contig assembly was of higher accuracy. These findings provide insight into what is required for future high-quality de-novo assemblies of non-model organisms.


2019 ◽  
Author(s):  
Mitchell R. Vollger ◽  
Glennis A. Logsdon ◽  
Peter A. Audano ◽  
Arvis Sulovari ◽  
David Porubsky ◽  
...  

AbstractThe sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective stand-alone technology for de novo assembly of human genomes.


2017 ◽  
Vol 37 (03) ◽  
pp. 125-136
Author(s):  
Tolulope A. Agunbiade ◽  
Brad S. Coates ◽  
Weilin Sun ◽  
Mu-Rou Tsai ◽  
Maria Carmen Valero ◽  
...  

Abstract Maruca vitrata (Fabricius, 1787) is a cryptic pantropical species of Lepidoptera that are comprised of two unique strains that inhabit the American continents (New World strain) and regions spanning from Africa through to Southeast Asia and Northern Australia (Old World strain). In this study, we de novo assembled the complete mitochondrial genome sequence of the New World legume pod borer, M. vitrata, from shotgun sequence data generated on an Illumina HiSeq 2000. Phylogenomic comparisons were made with other previously published mitochondrial genome sequences from crambid moths, including the Old World strain of M. vitrata. The 15,385 bp M. vitrata (New World) sequence has an 80.7% A+T content and encodes the 13 protein-coding, 2 ribosomal RNA and 22 transfer RNA genes in the typical orientation and arrangement of lepidopteran mitochondrial DNAs. Mitochondrial genome-wide comparison between the New and Old World strains of M. vitrata detected 476 polymorphic sites (4.23% nucleotide divergence) with an excess of synonymous substitution as a result of purifying selection. Furthermore, this level of sequence variation suggests that these strains diverged from ~1.83 to 2.12 million years ago, assuming a linear rate of short-term substitution. The de novo assemblies of mitochondrial genomes from next-generation sequencing (NGS) reads provide readily available data for similar comparative studies.


2019 ◽  
Author(s):  
Hannes Becher ◽  
Richard A Nichols

AbstractNuclear inserts derived from mitochondrial DNA (Numts) encode valuable information. Being mostly non-functional, and accumulating mutations more slowly than mitochondrial sequence, they act like molecular fossils – they preserve information on the ancestral sequences of the mitochondrial DNA. In addition, changes to the Numt sequence since their insertion into the nuclear genome carry information about the nuclear phylogeny. These attributes cannot be reliably exploited if Numt sequence is confused with the mitochondrial genome (mtDNA). The analysis of mtDNA would be similarly compromised by any confusion, for example producing misleading results in DNA barcoding that used mtDNA sequence. We propose a method to distinguish Numts from mtDNA, without the need for comprehensive assembly of the nuclear genome or the physical separation of organelles and nuclei. It exploits the different biases of long and short-read sequencing. We find that short-read data yield mainly mtDNA sequences, whereas long-read sequencing strongly enriches for Numt sequences. We demonstrate the method using genome-skimming (coverage < 1x) data obtained on Illumina short-read and PacBio long-read technology from DNA extracted from six grasshopper individuals. The mitochondrial genome sequences were assembled from the short-read data despite the presence of Numts. The PacBio data contained a much higher proportion of Numt reads (over 16-fold), making us caution against the use of long-read methods for studies using mitochondrial loci. We obtained two estimates of the genomic proportion of Numts. Finally, we introduce “tangle plots”, a way of visualising Numt structural rearrangements and comparing them between samples.


Sign in / Sign up

Export Citation Format

Share Document