scholarly journals Nanopore long reads enable the first complete genome assembly of a Malaysian Vibrio parahaemolyticus isolate bearing the pVa plasmid associated with acute hepatopancreatic necrosis disease

F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 2108 ◽  
Author(s):  
Han Ming Gan ◽  
Christopher M Austin

Background: The genome of Vibrio parahaemolyticus MVP1, isolated from a Malaysian aquaculture farm with shrimp acute hepatopancreatic necrosis disease (AHPND), was previously sequenced using Illumina MiSeq and assembled de novo, producing a relatively fragmented assembly. Despite identifying the binary toxin genes in the MVP1 draft genome that were linked to AHPND, the toxin genes were localized on a very small contig precluding proper analysis of gene neighbourhood. Methods: The genome of MVP1 was sequenced on Nanopore MinION to obtain long reads to improve genome contiguity. De novo genome assembly was performed using long-read only assembler followed by genome polishing and hybrid assembler. Results: Long-read assembly produced three complete circular MVP1 contigs: chromosome 1, chromosome 2 and the pVa plasmid encoding pirABvp binary toxin genes. Polishing of the long-read assembly with Illumina short reads was necessary to remove indel errors. Complete assembly of the pVa plasmid could not be achieved using Illumina reads due to identical repetitive elements flanking the binary toxin genes leading to multiple contigs. These regions were fully spanned by the Nanopore long-reads resulting in a single contig. Alignment of Illumina reads to the complete genome assembly indicated there is sequencing bias as read depth was lowest in low-GC genomic regions. Comparative genomic analysis revealed a gene cluster coding for additional insecticidal toxins in chromosome 2 of MVP1 that may further contribute to host pathogenesis pending functional validation. Scanning of publicly available V. parahaemolyticus genomes revealed the presence of a single AinS-family quorum-sensing system that can be targeted for future microbial management. Conclusions: We generated the first chromosome-scale genome assembly of a Malaysian pirABVp-bearing V. parahaemolyticus isolate. Structural variations identified from comparative genomic analysis provide new insights into the genomic features of V. parahaemolyticus MVP1 that may be associated with host colonization and pathogenicity.

2019 ◽  
Author(s):  
Han Ming Gan ◽  
Christopher M. Austin

AbstractBackgroundVibrio parahaemolyticus MVP1 was isolated from a Malaysian aquaculture farm affected with shrimp acute hepatopancreatic necrosis disease (AHPND). Its genome was previously sequenced on the Illumina MiSeq platform and assembled de novo producing a relatively fragmented assembly. Despite identifying the binary toxin genes in the MVP1 draft genome that were linked to AHPND, the toxin genes were localized on a very small contig precluding proper analysis of gene neighbourhood.MethodsThe genome of Vibrio parahaemolyticus MVP1 was sequenced on the Nanopore MinION device to obtain long reads that can span longer repeats and improve genome contiguity. De novo genome assembly was subsequently performed using long-read only assembler (Flye) followed by genome polishing as well as hybrid assembler (Unicycler).ResultsLong-read only assembly produced three complete circular MVP1 contigs consisting of chromosome 1, chromosome 2 and the pVa plasmid that pirABvp binary toxin genes. Polishing of the long read assembly with Illumina short reads was necessary to remove indel errors. The complete assembly of the pVa plasmid could not be achieved using Illumina reads due to the presence of identical repetitive elements flanking the binary toxin genes leading to multiple contigs. Whereas these regions were fully spanned by the Nanopore long reads resulting in a single contig. In addition, alignment of Illumina reads to the complete genome assembly indicated there is sequencing bias as read depth was lowest in low-GC genomic regions. Comparative genomic analysis revealed the presence of a gene cluster coding for additional insecticidal toxins in chromosome 2 of MVP1 that may further contribute to host pathogenesis pending functional validation. Scanning of all publicly available V. parahaemolyticus genomes revealed the presence of a single AinS-family quorum-sensing system in this species that can be targeted for future microbial management.ConclusionsWe generated the first chromosome-scale genome assembly of a Malaysian pirABVp-bearing V. parahaemolyticus isolate. Structural variations identified from comparative genomic analysis provide new insights into the genomic features of V. parahaemolyticus MVP1 that may be associated with host colonization and pathogenicity.


2019 ◽  
Author(s):  
Nabil Girollet ◽  
Bernadette Rubio ◽  
Pierre-François Bert

AbstractGrapevine is one of the most important fruit species in the world. In order to better understand genetic basis of traits variation and facilitate the breeding of new genotypes, we sequenced, assembled, and annotated the genome of the American native Vitis riparia, one of the main species used worldwide for rootstock and scion breeding. A total of 164 Gb raw DNA reads were obtained from Vitis riparia resulting in a 225X depth of coverage. We generated a genome assembly of the V. riparia grape de novo using the PacBio long-reads that was phased with the 10x Genomics Chromium linked-reads. At the chromosome level, a 500 Mb genome was generated with a scaffold N50 size of 1 Mb. More than 34% of the whole genome were identified as repeat sequences, and 37,207 protein-coding genes were predicted. This genome assembly sets the stage for comparative genomic analysis of the diversification and adaptation of grapevine and will provide a solid resource for further genetic analysis and breeding of this economically important species.


2021 ◽  
Author(s):  
Xinxin Yi ◽  
Jing Liu ◽  
Shengcai Chen ◽  
Hao Wu ◽  
Min Liu ◽  
...  

Cultivated soybean (Glycine max) is an important source for protein and oil. Many elite cultivars with different traits have been developed for different conditions. Each soybean strain has its own genetic diversity, and the availability of more high-quality soybean genomes can enhance comparative genomic analysis for identifying genetic underpinnings for its unique traits. In this study, we constructed a high-quality de novo assembly of an elite soybean cultivar Jidou 17 (JD17) with chromsome contiguity and high accuracy. We annotated 52,840 gene models and reconstructed 74,054 high-quality full-length transcripts. We performed a genome-wide comparative analysis based on the reference genome of JD17 with three published soybeans (WM82, ZH13 and W05) , which identified five large inversions and two large translocations specific to JD17, 20,984 - 46,912 PAVs spanning 13.1 - 46.9 Mb in size, and 5 - 53 large PAV clusters larger than 500kb. 1,695,741 - 3,664,629 SNPs and 446,689 - 800,489 Indels were identified and annotated between JD17 and them. Symbiotic nitrogen fixation (SNF) genes were identified and the effects from these variants were further evaluated. It was found that the coding sequences of 9 nitrogen fixation-related genes were greatly affected. The high-quality genome assembly of JD17 can serve as a valuable reference for soybean functional genomics research.


2021 ◽  
Author(s):  
Lauren Coombe ◽  
Janet X Li ◽  
Theodora Lo ◽  
Johnathan Wong ◽  
Vladimir Nikolic ◽  
...  

Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 2.0-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently runs in under five hours using less than 23GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.


Author(s):  
Robert Vaser ◽  
Mile Šikić

We present new methods for the improvement of long-read de novo genome assembly incorporated into a straightforward tool called Raven (https://github.com/lbcb-sci/raven). Compared with other assemblers, Raven is one of two fastest, it reconstructs the sequenced genome in the least amount of fragments, has better or comparable accuracy, and maintains similar performance for various genomes. Raven takes 500 CPU hours to assemble a 44x human genome dataset in only 259 fragments.


2020 ◽  
Author(s):  
Mohamed Awad ◽  
Xiangchao Gan

AbstractHigh-quality genome assembly has wide applications in genetics and medical studies. However, it is still very challenging to achieve gap-free chromosome-scale assemblies using current workflows for long-read platforms. Here we propose GALA (Gap-free long-read assembler), a chromosome-by-chromosome assembly method implemented through a multi-layer computer graph that identifies mis-assemblies within preliminary assemblies or chimeric raw reads and partitions the data into chromosome-scale linkage groups. The subsequent independent assembly of each linkage group generates a gap-free assembly free from the mis-assembly errors which usually hamper existing workflows. This flexible framework also allows us to integrate data from various technologies, such as Hi-C, genetic maps, a reference genome and even motif analyses, to generate gap-free chromosome-scale assemblies. We de novo assembled the C. elegans and A. thaliana genomes using combined Pacbio and Nanopore sequencing data from publicly available datasets. We also demonstrated the new method’s applicability with a gap-free assembly of a human genome with the help a reference genome. In addition, GALA showed promising performance for Pacbio high-fidelity long reads. Thus, our method enables straightforward assembly of genomes with multiple data sources and overcomes barriers that at present restrict the application of de novo genome assembly technology.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Lauren Coombe ◽  
Janet X. Li ◽  
Theodora Lo ◽  
Johnathan Wong ◽  
Vladimir Nikolic ◽  
...  

Abstract Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.


2021 ◽  
Vol 12 ◽  
Author(s):  
Sigmund Ramberg ◽  
Bjørn Høyheim ◽  
Tone-Kari Knutsdatter Østbye ◽  
Rune Andreassen

Atlantic salmon (Salmo salar) is a major species produced in world aquaculture and an important vertebrate model organism for studying the process of rediploidization following whole genome duplication events (Ss4R, 80 mya). The current Salmo salar transcriptome is largely generated from genome sequence based in silico predictions supported by ESTs and short-read sequencing data. However, recent progress in long-read sequencing technologies now allows for full-length transcript sequencing from single RNA-molecules. This study provides a de novo full-length mRNA transcriptome from liver, head-kidney and gill materials. A pipeline was developed based on Iso-seq sequencing of long-reads on the PacBio platform (HQ reads) followed by error-correction of the HQ reads by short-reads from the Illumina platform. The pipeline successfully processed more than 1.5 million long-reads and more than 900 million short-reads into error-corrected HQ reads. A surprisingly high percentage (32%) represented expressed interspersed repeats, while the remaining were processed into 71 461 full-length mRNAs from 23 071 loci. Each transcript was supported by several single-molecule long-read sequences and at least three short-reads, assuring a high sequence accuracy. On average, each gene was represented by three isoforms. Comparisons to the current Atlantic salmon transcripts in the RefSeq database showed that the long-read transcriptome validated 25% of all known transcripts, while the remaining full-length transcripts were novel isoforms, but few were transcripts from novel genes. A comparison to the current genome assembly indicates that the long-read transcriptome may aid in improving transcript annotation as well as provide long-read linkage information useful for improving the genome assembly. More than 80% of transcripts were assigned GO terms and thousands of transcripts were from genes or splice-variants expressed in an organ-specific manner demonstrating that hybrid error-corrected long-read transcriptomes may be applied to study genes and splice-variants expressed in certain organs or conditions (e.g., challenge materials). In conclusion, this is the single largest contribution of full-length mRNAs in Atlantic salmon. The results will be of great value to salmon genomics research, and the pipeline outlined may be applied to generate additional de novo transcriptomes in Atlantic Salmon or applied for similar projects in other species.


2021 ◽  
Author(s):  
Xiao Luo ◽  
Xiongbin Kang ◽  
Alexander Schoenhuth

Haplotype-aware diploid genome assembly is crucial in genomics, precision medicine, and many other disciplines. Long-read sequencing technologies have greatly improved genome assembly thanks to advantages of read length. However, current long-read assemblers usually introduce disturbing biases or fail to capture the haplotype diversity of the diploid genome. Here, we present phasebook, a novel approach for reconstructing the haplotypes of diploid genomes from long reads de novo. Benchmarking experiments demonstrate that our method outperforms other approaches in terms of haplotype coverage by large margins, while preserving competitive performance or even achieving advantages in terms of all other aspects relevant for genome assembly.


Author(s):  
Guangtu Gao ◽  
Susana Magadan ◽  
Geoffrey C Waldbieser ◽  
Ramey C Youngblood ◽  
Paul A Wheeler ◽  
...  

Abstract Currently, there is still a need to improve the contiguity of the rainbow trout reference genome and to use multiple genetic backgrounds that will represent the genetic diversity of this species. The Arlee doubled haploid line was originated from a domesticated hatchery strain that was originally collected from the northern California coast. The Canu pipeline was used to generate the Arlee line genome de-novo assembly from high coverage PacBio long-reads sequence data. The assembly was further improved with Bionano optical maps and Hi-C proximity ligation sequence data to generate 32 major scaffolds corresponding to the karyotype of the Arlee line (2 N = 64). It is composed of 938 scaffolds with N50 of 39.16 Mb and a total length of 2.33 Gb, of which ∼95% was in 32 chromosome sequences with only 438 gaps between contigs and scaffolds. In rainbow trout the haploid chromosome number can vary from 29 to 32. In the Arlee karyotype the haploid chromosome number is 32 because chromosomes Omy04, 14 and 25 are divided into six acrocentric chromosomes. Additional structural variations that were identified in the Arlee genome included the major inversions on chromosomes Omy05 and Omy20 and additional 15 smaller inversions that will require further validation. This is also the first rainbow trout genome assembly that includes a scaffold with the sex-determination gene (sdY) in the chromosome Y sequence. The utility of this genome assembly is demonstrated through the improved annotation of the duplicated genome loci that harbor the IGH genes on chromosomes Omy12 and Omy13.


Sign in / Sign up

Export Citation Format

Share Document