High-coverage, long-read sequencing of Han Chinese trio reference samples

AbstractSingle-molecule long-read sequencing datasets were generated for a son-father-mother trio of Han Chinese descent that is part of the Genome In a Bottle (GIAB) consortium portfolio. The dataset was generated using the Pacific Biosciences Sequel System. The son and each parent were sequenced to an average coverage of 60 and 30, respectively, with N50 subread lengths between 16 and 18 kb. Raw reads and reads aligned to both the GRCh37 and GRCh38 are available at the NCBI GIAB ftp site (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/ChineseTrio/) and the raw read data is archived in NCBI SRA (SRX4739017, SRX4739121, and SRX4739122). This dataset is available for anyone to develop and evaluate long-read bioinformatics methods.

Download Full-text

High-coverage, long-read sequencing of Han Chinese trio reference samples

Scientific Data ◽

10.1038/s41597-019-0098-2 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 2

Author(s):

Ying-Chih Wang ◽

Nathan D. Olson ◽

Gintaras Deikus ◽

Hardik Shah ◽

Aaron M. Wenger ◽

...

Keyword(s):

Han Chinese ◽

High Coverage ◽

Long Read ◽

Reference Samples

Download Full-text

Erratum: Corrigendum: Long-read sequencing of the human cytomegalovirus transcriptome with the pacific biosciences RSII platform

Scientific Data ◽

10.1038/sdata.2018.32 ◽

2018 ◽

Vol 5 (1) ◽

Author(s):

Zsolt Balázs ◽

Dóra Tombácz ◽

Attila Szűcs ◽

Michael Snyder ◽

Zsolt Boldogkői

Keyword(s):

Human Cytomegalovirus ◽

Pacific Biosciences ◽

The Pacific ◽

Long Read

Download Full-text

Microsatellite marker discovery using single molecule real-time circular consensus sequencing on the Pacific Biosciences RS

BioTechniques ◽

10.2144/000114104 ◽

2013 ◽

Vol 55 (5) ◽

Cited By ~ 15

Author(s):

Markus A. Grohme ◽

Roberto Frias Soler ◽

Michael Wink ◽

Marcus Frohme

Keyword(s):

Real Time ◽

Microsatellite Marker ◽

Single Molecule ◽

Pacific Biosciences ◽

The Pacific ◽

Marker Discovery ◽

Circular Consensus Sequencing

Download Full-text

Preparation of next-generation DNA sequencing libraries from ultra-low amounts of input DNA: Application to single-molecule, real-time (SMRT) sequencing on the Pacific Biosciences RS II.

10.1101/003566 ◽

2014 ◽

Cited By ~ 4

Author(s):

Castle Raley ◽

David Munroe ◽

Kristie Jones ◽

Yu-Chih Tsai ◽

Yan Guo ◽

...

Keyword(s):

Dna Sequencing ◽

Single Molecule ◽

Smrt Sequencing ◽

High Input ◽

Single Molecule Sequencing ◽

Pacific Biosciences ◽

Preparation Conditions ◽

Pacbio Rs Ii ◽

The Pacific ◽

The Common

We have developed and validated an amplification-free method for generating DNA sequencing libraries from very low amounts of input DNA (500 picograms - 20 nanograms) for single-molecule sequencing on the Pacific Biosciences (PacBio) RS II sequencer. The common challenge of high input requirements for single-molecule sequencing is overcome by using a carrier DNA in conjunction with optimized sequencing preparation conditions and re-use of the MagBead-bound complex. Here we describe how this method can be used to produce sequencing yields comparable to those generated from standard input amounts, but by using 1000-fold less starting material.

Download Full-text

A chromosome-level genome assembly for the Pacific oyster Crassostrea gigas

GigaScience ◽

10.1093/gigascience/giab020 ◽

2021 ◽

Vol 10 (3) ◽

Cited By ~ 1

Author(s):

Carolina Peñaloza ◽

Alejandro P Gutierrez ◽

Lél Eöry ◽

Shan Wang ◽

Ximing Guo ◽

...

Keyword(s):

Crassostrea Gigas ◽

Pacific Oyster ◽

Sequence Data ◽

High Coverage ◽

Protein Coding ◽

Rolling Circle ◽

Repeat Elements ◽

Pacific Biosciences ◽

The Pacific ◽

Chromosome Level

Abstract Background The Pacific oyster (Crassostrea gigas) is a bivalve mollusc with vital roles in coastal ecosystems and aquaculture globally. While extensive genomic tools are available for C. gigas, highly contiguous reference genomes are required to support both fundamental and applied research. Herein we report the creation and annotation of a chromosome-level assembly for C. gigas. Findings High-coverage long- and short-read sequence data generated on Pacific Biosciences and Illumina platforms were used to generate an initial assembly, which was then scaffolded into 10 pseudo-chromosomes using both Hi-C sequencing and a high-density linkage map. The assembly has a scaffold N50 of 58.4 Mb and a contig N50 of 1.8 Mb, representing a step advance on the previously published C. gigas assembly. Annotation based on Pacific Biosciences Iso-Seq and Illumina RNA-Seq resulted in identification of ∼30,000 putative protein-coding genes. Annotation of putative repeat elements highlighted an enrichment of Helitron rolling-circle transposable elements, suggesting their potential role in shaping the evolution of the C. gigas genome. Conclusions This new chromosome-level assembly will be an enabling resource for genetics and genomics studies to support fundamental insight into bivalve biology, as well as for selective breeding of C. gigas in aquaculture.

Download Full-text

A chromosome-level genome assembly for the Pacific oyster (Crassostrea gigas)

10.1101/2020.09.25.313494 ◽

2020 ◽

Author(s):

Carolina Peñaloza ◽

Alejandro P. Gutierrez ◽

Lel Eory ◽

Shan Wang ◽

Ximing Guo ◽

...

Keyword(s):

Crassostrea Gigas ◽

Pacific Oyster ◽

Sequence Data ◽

Gene Annotation ◽

Marine Bivalve ◽

High Coverage ◽

Rolling Circle ◽

Pacific Biosciences ◽

The Pacific ◽

Chromosome Level

AbstractThe Pacific oyster (Crassostrea gigas) is a marine bivalve species with vital roles in coastal ecosystems and aquaculture globally. While extensive genomic tools are available for C. gigas, highly contiguous reference genomes are required to support both fundamental and applied research. In the current study, high coverage long and short read sequence data generated on Pacific Biosciences and Illumina platforms from a single female individual specimen was used to generate an initial assembly, which was then scaffolded into 10 pseudo chromosomes using both Hi-C sequencing and a high density SNP linkage map. The final assembly has a scaffold N50 of 58.4 Mb and a contig N50 of 1.8 Mb, representing a step advance on the previously published C. gigas assembly. The new assembly was annotated using Pacific Biosciences Iso-Seq and Illumina RNA-Seq data, identifying 30K putative protein coding genes, with an average of 3.9 transcripts per gene. Annotation of putative repeat elements highlighted an inverse relationship with gene density, and identified putative centromeres of the metacentric chromosomes. An enrichment of Helitron rolling circle transponsable elements was observed, suggesting their potential role in shaping the evolution of the C. gigas genome. This new chromosome-level assembly will be an enabling resource for genetics and genomics studies to support fundamental insight into bivalve biology, as well as for genetic improvement of C. gigas in aquaculture breeding programmes.

Download Full-text

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

10.1101/008003 ◽

2014 ◽

Cited By ~ 13

Author(s):

Konstantin Berlin ◽

Sergey Koren ◽

Chen-Shan Chin ◽

James Drake ◽

Jane M Landolin ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Locality Sensitive Hashing ◽

Model Organisms ◽

Smrt Sequencing ◽

High Coverage ◽

Celera Assembler ◽

Single Molecule Sequencing ◽

Long Reads ◽

Long Read

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.

Download Full-text

Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing

Nature Communications ◽

10.1038/s41467-019-12493-y ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 26

Author(s):

Peter Edge ◽

Vikas Bansal

Keyword(s):

Single Molecule ◽

Variant Calling ◽

Small Scale ◽

Whole Genome ◽

Limited Information ◽

Single Nucleotide Variants ◽

Pacific Biosciences ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read

Abstract Whole-genome sequencing using sequencing technologies such as Illumina enables the accurate detection of small-scale variants but provides limited information about haplotypes and variants in repetitive regions of the human genome. Single-molecule sequencing (SMS) technologies such as Pacific Biosciences and Oxford Nanopore generate long reads that can potentially address the limitations of short-read sequencing. However, the high error rate of SMS reads makes it challenging to detect small-scale variants in diploid genomes. We introduce a variant calling method, Longshot, which leverages the haplotype information present in SMS reads to accurately detect and phase single-nucleotide variants (SNVs) in diploid genomes. We demonstrate that Longshot achieves very high accuracy for SNV detection using whole-genome Pacific Biosciences data, outperforms existing variant calling methods, and enables variant detection in duplicated regions of the genome that cannot be mapped using short reads.

Download Full-text

Long-read sequencing of the human cytomegalovirus transcriptome with the Pacific Biosciences RSII platform

Scientific Data ◽

10.1038/sdata.2017.194 ◽

2017 ◽

Vol 4 (1) ◽

Cited By ~ 13

Author(s):

Zsolt Balázs ◽

Dóra Tombácz ◽

Attila Szűcs ◽

Michael Snyder ◽

Zsolt Boldogkői

Keyword(s):

Rna Sequencing ◽

Human Cytomegalovirus ◽

Preparation Methods ◽

Human Lung Fibroblast ◽

Rna Molecules ◽

Pacific Biosciences ◽

The Pacific ◽

Sequencing Studies ◽

Long Read ◽

Indispensable Tool

Abstract Long-read RNA sequencing allows for the precise characterization of full-length transcripts, which makes it an indispensable tool in transcriptomics. The human cytomegalovirus (HCMV) genome has been first sequenced in 1989 and although short-read sequencing studies have uncovered much of the complexity of its transcriptome, only few of its transcripts have been fully annotated. We hereby present a long-read RNA sequencing dataset of HCMV infected human lung fibroblast cells sequenced by the Pacific Biosciences RSII platform. Seven SMRT cells were sequenced using oligo(dT) primers to reverse transcribe poly(A)-selected RNA molecules and one library was prepared using random primers for the reverse transcription of the rRNA-depleted sample. Our dataset contains 122,636 human and 33,086 viral (HMCV strain Towne) reads. The described data include raw and processed sequencing files, and combined with other datasets, they can be used to validate transcriptome analysis tools, to compare library preparation methods, to test base calling algorithms or to identify genetic variants.

Download Full-text

lra: A long read aligner for sequences and contigs

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009078 ◽

2021 ◽

Vol 17 (6) ◽

pp. e1009078

Author(s):

Jingwen Ren ◽

Mark J. P. Chaisson

Keyword(s):

Dynamic Programming ◽

Single Molecule ◽

De Novo Assembly ◽

De Novo ◽

Concave Function ◽

Single Molecule Sequencing ◽

Link Type ◽

Oxford Nanopore ◽

Concave Cost ◽

Long Read

It is computationally challenging to detect variation by aligning single-molecule sequencing (SMS) reads, or contigs from SMS assemblies. One approach to efficiently align SMS reads is sparse dynamic programming (SDP), where optimal chains of exact matches are found between the sequence and the genome. While straightforward implementations of SDP penalize gaps with a cost that is a linear function of gap length, biological variation is more accurately represented when gap cost is a concave function of gap length. We have developed a method, lra, that uses SDP with a concave-cost gap penalty, and used lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs. This alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods. When applied to calling variation from de novo assembly contigs, there is a 3.2% increase in Truvari F1 score compared to minimap2+htsbox. lra is available in bioconda (https://anaconda.org/bioconda/lra) and github (https://github.com/ChaissonLab/LRA).

Download Full-text