W2RAP: a pipeline for high quality, robust assemblies of large complex genomes from short read data

AbstractProducing high-quality whole-genome shotgun de novo assemblies from plant and animal species with large and complex genomes using low-cost short read sequencing technologies remains a challenge. But when the right sequencing data, with appropriate quality control, is assembled using approaches focused on robustness of the process rather than maximization of a single metric such as the usual contiguity estimators, good quality assemblies with informative value for comparative analyses can be produced. Here we present a complete method described from data generation and qc all the way up to scaffold of complex genomes using Illumina short reads and its application to data from plants and human datasets. We show how to use the w2rap pipeline following a metric-guided approach to produce cost-effective assemblies. The assemblies are highly accurate, provide good coverage of the genome and show good short range contiguity. Our pipeline has already enabled the rapid, cost-effective generation of de novo genome assemblies from large, polyploid crop species with a focus on comparative genomics.Availabilityw2rap is available under MIT license, with some subcomponents under GPL-licenses. A ready-to-run docker with all software pre-requisites and example data is also available.http://github.com/bioinfologics/w2raphttp://github.com/bioinfologics/w2rap-contigger

Download Full-text

Harmonization of whole-genome sequencing for outbreak surveillance of Enterobacteriaceae and Enterococci

Microbial Genomics ◽

10.1099/mgen.0.000567 ◽

2021 ◽

Vol 7 (7) ◽

Author(s):

Casper Jamin ◽

Sien De Koster ◽

Stefanie van Koeveringe ◽

Dieter De Coninck ◽

Klaas Mensaert ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Type Species ◽

De Novo ◽

Whole Genome ◽

Data Generation ◽

Sequencing Data ◽

Content Type ◽

Link Type ◽

Antimicrobial Resistance Genes

Whole-genome sequencing (WGS) is becoming the de facto standard for bacterial typing and outbreak surveillance of resistant bacterial pathogens. However, interoperability for WGS of bacterial outbreaks is poorly understood. We hypothesized that harmonization of WGS for outbreak surveillance is achievable through the use of identical protocols for both data generation and data analysis. A set of 30 bacterial isolates, comprising of various species belonging to the Enterobacteriaceae family and Enterococcus genera, were selected and sequenced using the same protocol on the Illumina MiSeq platform in each individual centre. All generated sequencing data were analysed by one centre using BioNumerics (6.7.3) for (i) genotyping origin of replications and antimicrobial resistance genes, (ii) core-genome multi-locus sequence typing (cgMLST) for Escherichia coli and Klebsiella pneumoniae and whole-genome multi-locus sequencing typing (wgMLST) for all species. Additionally, a split k-mer analysis was performed to determine the number of SNPs between samples. A precision of 99.0% and an accuracy of 99.2% was achieved for genotyping. Based on cgMLST, a discrepant allele was called only in 2/27 and 3/15 comparisons between two genomes, for E. coli and K. pneumoniae, respectively. Based on wgMLST, the number of discrepant alleles ranged from 0 to 7 (average 1.6). For SNPs, this ranged from 0 to 11 SNPs (average 3.4). Furthermore, we demonstrate that using different de novo assemblers to analyse the same dataset introduces up to 150 SNPs, which surpasses most thresholds for bacterial outbreaks. This shows the importance of harmonization of data-processing surveillance of bacterial outbreaks. In summary, multi-centre WGS for bacterial surveillance is achievable, but only if protocols are harmonized.

Download Full-text

Ultra-low input single tube linked-read library method enables short-read NGS systems to generate highly accurate and economical long-range sequencing information for de novo genome assembly and haplotype phasing

10.1101/852947 ◽

2019 ◽

Cited By ~ 3

Author(s):

Zhoutao Chen ◽

Long Pham ◽

Tsai-Chin Wu ◽

Guoya Mo ◽

Yu Xia ◽

...

Keyword(s):

Long Range ◽

De Novo ◽

Low Cost ◽

Cost Effective ◽

De Novo Genome Assembly ◽

Short Read ◽

Single Tube ◽

Haplotype Phasing ◽

A Genome ◽

Long Read

AbstractLong-range sequencing information is required for haplotype phasing, de novo assembly and structural variation detection. Current long-read sequencing technologies can provide valuable long-range information but at a high cost with low accuracy and high DNA input requirement. We have developed a single-tube Transposase Enzyme Linked Long-read Sequencing (TELL-Seq™) technology, which enables a low-cost, high-accuracy and high-throughput short-read next generation sequencer to routinely generate over 100 Kb long-range sequencing information with as little as 0.1 ng input material. In a PCR tube, millions of clonally barcoded beads are used to uniquely barcode long DNA molecules in an open bulk reaction without dilution and compartmentation. The barcode linked reads are used to successfully assemble genomes ranging from microbes to human. These linked-reads also generate mega-base-long phased blocks and provide a cost-effective tool for detecting structural variants in a genome, which are important to identify compound heterozygosity in recessive Mendelian diseases and discover genetic drivers and diagnostic biomarkers in cancers.

Download Full-text

Efficient long single molecule sequencing for cost effective and accurate sequencing, haplotyping, and de novo assembly

10.1101/324392 ◽

2018 ◽

Author(s):

Ou Wang ◽

Robert Chin ◽

Xiaofang Cheng ◽

Michelle Ka Wu ◽

Qing Mao ◽

...

Keyword(s):

Single Molecule ◽

Genome Assembly ◽

De Novo ◽

Low Cost ◽

Variant Calling ◽

Cost Effective ◽

High Quality ◽

Single Molecule Sequencing ◽

Single Tube ◽

Complex Structural

Obtaining accurate sequences from long DNA molecules is very important for genome assembly and other applications. Here we describe single tube long fragment read (stLFR), a technology that enables this a low cost. It is based on adding the same barcode sequence to sub-fragments of the original long DNA molecule (DNA co-barcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process up to 3.6 billion unique barcode sequences were generated on beads, enabling practically non-redundant co-barcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique co-barcoding of over 8 million 20-300 kb genomic DNA fragments. Analysis of the genome of the human genome NA12878 with stLFR demonstrated high quality variant calling and phasing into contigs up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries and their construction did not significantly add to the time or cost of whole genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.

Download Full-text

PgRC: Pseudogenome based Read Compressor

10.1101/710822 ◽

2019 ◽

Author(s):

Tomasz Kowalski ◽

Szymon Grabowski

Keyword(s):

High Throughput ◽

Compression Ratio ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Quality ◽

Link Type ◽

Sequencing Technologies ◽

Significant Interest ◽

The One ◽

Shortest Common Superstring

AbstractMotivationThe amount of sequencing data from High-Throughput Sequencing technologies grows at a pace exceeding the one predicted by Moore’s law. One of the basic requirements is to efficiently store and transmit such huge collections of data. Despite significant interest in designing FASTQ compressors, they are still imperfect in terms of compression ratio or decompression resources.ResultsWe present Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. Experiments show that PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 18 and 21 percent on average, respectively, while being at least comparably fast in decompression.AvailabilityPgRC can be downloaded from https://github.com/kowallus/[email protected]

Download Full-text

Dual indexed design of in-Drop single-cell RNA-seq libraries improves sequencing quality and throughput

10.1101/835488 ◽

2019 ◽

Author(s):

Austin N. Southard Smith ◽

Alan J. Simmons ◽

Bob Chen ◽

Angela L. Jones ◽

Marisol A. Ramirez Solano ◽

...

Keyword(s):

Single Cell ◽

High Throughput ◽

Cost Effective ◽

Quality Data ◽

Sequencing Data ◽

High Quality ◽

High Data ◽

Sequencing Technologies ◽

Effective Manner ◽

Sequencing Quality

AbstractThe increasing demand of single-cell RNA-sequencing (scRNA-seq) experiments, such as the number of experiments and cells queried per experiment, necessitates higher sequencing depth coupled to high data quality. New high-throughput sequencers, such as the Illumina NovaSeq 6000, enables this demand to be filled in a cost-effective manner. However, current scRNA-seq library designs present compatibility challenges with newer sequencing technologies, such as index-hopping, and their ability to generate high quality data has yet to be systematically evaluated. Here, we engineered a new dual-indexed library structure, called TruDrop, on top of the inDrop scRNA-seq platform to solve these compatibility challenges, such that TruDrop libraries and standard Illumina libraries can be sequenced alongside each other on the NovaSeq. We overcame the index-hopping issue, demonstrated significant improvements in base-calling accuracy, and provided an example of multiplexing twenty-four scRNA-seq libraries simultaneously. We showed favorable comparisons in transcriptional diversity of TruDrop compared with prior library structures. Our approach enables cost-effective, high throughput generation of sequencing data with high quality, which should enable more routine use of scRNA-seq technologies.

Download Full-text

WENGAN: Efficient and high quality hybrid de novo assembly of human genomes

10.1101/840447 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alex Di Genova ◽

Elena Buena-Atienza ◽

Stephan Ossowski ◽

Marie-France Sagot

Keyword(s):

De Novo ◽

Computational Cost ◽

Sequence Information ◽

Sequencing Data ◽

High Quality ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

The continuous improvement of long-read sequencing technologies along with the development of ad-doc algorithms has launched a new de novo assembly era that promises high-quality genomes. However, it has proven difficult to use only long reads to generate accurate genome assemblies of large, repeat-rich human genomes. To date, most of the human genomes assembled from long error-prone reads add accurate short reads to further polish the consensus quality. Here, we report the development of a novel algorithm for hybrid assembly, WENGAN, and the de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms that exploit the sequence information of short and long reads to tackle assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity (contig NG50:16.67-62.06 Mb), few assembly errors (contig NGA50:10.9-45.91 Mb), good consensus quality (QV:27.79-33.61), and high gene completeness (BUSCO complete: 94.6-95.1%), while consuming low computational resources (CPU hours:153-1027). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 62.06 Mb (NGA50:45.91 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50:57.88 Mb). Providing highest quality at low computational cost, WENGAN is an important step towards the democratization of the de novo assembly of human genomes. The WENGAN assembler is available at https://github.com/adigenova/wengan

Download Full-text

Chromosome-scale assembly comparison of the Korean Reference Genome KOREF from PromethION and PacBio with Hi-C mapping information

GigaScience ◽

10.1093/gigascience/giz125 ◽

2019 ◽

Vol 8 (12) ◽

Cited By ~ 6

Author(s):

Hui-Su Kim ◽

Sungwon Jeon ◽

Changjae Kim ◽

Yeon Kyung Kim ◽

Yun Sung Cho ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Low Cost ◽

Cost Effective ◽

Sequencing Data ◽

Smrt Sequencing ◽

Human Genome Assembly

Abstract Background Long DNA reads produced by single-molecule and pore-based sequencers are more suitable for assembly and structural variation discovery than short-read DNA fragments. For de novo assembly, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are the favorite options. However, PacBio's SMRT sequencing is expensive for a full human genome assembly and costs more than $40,000 US for 30× coverage as of 2019. ONT PromethION sequencing, on the other hand, is 1/12 the price of PacBio for the same coverage. This study aimed to compare the cost-effectiveness of ONT PromethION and PacBio's SMRT sequencing in relation to the quality. Findings We performed whole-genome de novo assemblies and comparison to construct an improved version of KOREF, the Korean reference genome, using sequencing data produced by PromethION and PacBio. With PromethION, an assembly using sequenced reads with 64× coverage (193 Gb, 3 flowcell sequencing) resulted in 3,725 contigs with N50s of 16.7 Mb and a total genome length of 2.8 Gb. It was comparable to a KOREF assembly constructed using PacBio at 62× coverage (188 Gb, 2,695 contigs, and N50s of 17.9 Mb). When we applied Hi-C–derived long-range mapping data, an even higher quality assembly for the 64× coverage was achieved, resulting in 3,179 scaffolds with an N50 of 56.4 Mb. Conclusion The pore-based PromethION approach provided a high-quality chromosome-scale human genome assembly at a low cost with long maximum contig and scaffold lengths and was more cost-effective than PacBio at comparable quality measurements.

Download Full-text

Comparison of long-read methods for sequencing and assembly of a plant genome

GigaScience ◽

10.1093/gigascience/giaa146 ◽

2020 ◽

Vol 9 (12) ◽

Author(s):

Valentine Murigneux ◽

Subash Kumar Rai ◽

Agnelo Furtado ◽

Timothy J C Bruxner ◽

Wei Tian ◽

...

Keyword(s):

De Novo ◽

Cost Effective ◽

Genome Project ◽

Plant Genome ◽

Sequencing Data ◽

Pacific Biosciences ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

The Cost

Abstract Background Sequencing technologies have advanced to the point where it is possible to generate high-accuracy, haplotype-resolved, chromosome-scale assemblies. Several long-read sequencing technologies are available, and a growing number of algorithms have been developed to assemble the reads generated by those technologies. When starting a new genome project, it is therefore challenging to select the most cost-effective sequencing technology, as well as the most appropriate software for assembly and polishing. It is thus important to benchmark different approaches applied to the same sample. Results Here, we report a comparison of 3 long-read sequencing technologies applied to the de novo assembly of a plant genome, Macadamia jansenii. We have generated sequencing data using Pacific Biosciences (Sequel I), Oxford Nanopore Technologies (PromethION), and BGI (single-tube Long Fragment Read) technologies for the same sample. Several assemblers were benchmarked in the assembly of Pacific Biosciences and Nanopore reads. Results obtained from combining long-read technologies or short-read and long-read technologies are also presented. The assemblies were compared for contiguity, base accuracy, and completeness, as well as sequencing costs and DNA material requirements. Conclusions The 3 long-read technologies produced highly contiguous and complete genome assemblies of M. jansenii. At the time of sequencing, the cost associated with each method was significantly different, but continuous improvements in technologies have resulted in greater accuracy, increased throughput, and reduced costs. We propose updating this comparison regularly with reports on significant iterations of the sequencing technologies.

Download Full-text

Chromosome-scale assembly comparison of the Korean Reference Genome KOREF from PromethION and PacBio with Hi-C mapping information

10.1101/674804 ◽

2019 ◽

Cited By ~ 2

Author(s):

Hui-Su Kim ◽

Sungwon Jeon ◽

Changjae Kim ◽

Yeon Kyung Kim ◽

Yun Sung Cho ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Low Cost ◽

Cost Effective ◽

Sequencing Data ◽

Smrt Sequencing ◽

Human Genome Assembly

AbstractBackgroundLong DNA reads produced by single molecule and pore-based sequencers are more suitable for assembly and structural variation discovery than short read DNA fragments. For de novo assembly, PacBio and Oxford Nanopore Technologies (ONT) are favorite options. However, PacBio’s SMRT sequencing is expensive for a full human genome assembly and costs over 40,000 USD for 30x coverage as of 2019. ONT PromethION sequencing, on the other hand, is one-twelfth the price of PacBio for the same coverage. This study aimed to compare the cost-effectiveness of ONT PromethION and PacBio’s SMRT sequencing in relation to the quality.FindingsWe performed whole genome de novo assemblies and comparison to construct an improved version of KOREF, the Korean reference genome, using sequencing data produced by PromethION and PacBio. With PromethION, an assembly using sequenced reads with 64x coverage (193 Gb, 3 flowcell sequencing) resulted in 3,725 contigs with N50s of 16.7 Mbp and a total genome length of 2.8 Gbp. It was comparable to a KOREF assembly constructed using PacBio at 62x coverage (188 Gbp, 2,695 contigs and N50s of 17.9 Mbp). When we applied Hi-C-derived long-range mapping data, an even higher quality assembly for the 64x coverage was achieved, resulting in 3,179 scaffolds with an N50 of 56.4 Mbp.ConclusionThe pore-based PromethION approach provides a good quality chromosome-scale human genome assembly at a low cost with long maximum contig and scaffold lengths and is more cost-effective than PacBio at comparable quality measurements.

Download Full-text

Chloroplast Genomes of Two Species of Cypripedium: Expanded Genome Size and Proliferation of AT-Biased Repeat Sequences

Frontiers in Plant Science ◽

10.3389/fpls.2021.609729 ◽

2021 ◽

Vol 12 ◽

Author(s):

Yan-Yan Guo ◽

Jia-Xing Yang ◽

Hong-Kun Li ◽

Hu-Sheng Zhao

Keyword(s):

Genome Size ◽

De Novo ◽

Gc Content ◽

Single Copy ◽

Sequencing Data ◽

Short Read ◽

Coding Regions ◽

Repeat Sequences ◽

Sequencing Technologies ◽

Chloroplast Genomes

The size of the chloroplast genome (plastome) of autotrophic angiosperms is generally conserved. However, the chloroplast genomes of some lineages are greatly expanded, which may render assembling these genomes from short read sequencing data more challenging. Here, we present the sequencing, assembly, and annotation of the chloroplast genomes of Cypripedium tibeticum and Cypripedium subtropicum. We de novo assembled the chloroplast genomes of the two species with a combination of short-read Illumina data and long-read PacBio data. The plastomes of the two species are characterized by expanded genome size, proliferated AT-rich repeat sequences, low GC content and gene density, as well as low substitution rates of the coding genes. The plastomes of C. tibeticum (197,815 bp) and C. subtropicum (212,668 bp) are substantially larger than those of the three species sequenced in previous studies. The plastome of C. subtropicum is the longest one of Orchidaceae to date. Despite the increase in genome size, the gene order and gene number of the plastomes are conserved, with the exception of an ∼75 kb large inversion in the large single copy (LSC) region shared by the two species. The most striking is the record-setting low GC content in C. subtropicum (28.2%). Moreover, the plastome expansion of the two species is strongly correlated with the proliferation of AT-biased non-coding regions: the non-coding content of C. subtropicum is in excess of 57%. The genus provides a typical example of plastome expansion induced by the expansion of non-coding regions. Considering the pros and cons of different sequencing technologies, we recommend hybrid assembly based on long and short reads applied to the sequencing of plastomes with AT-biased base composition.

Download Full-text