scholarly journals High-coverage, long-read sequencing of Han Chinese trio reference samples

2019 ◽  
Author(s):  
Ying-Chih Wang ◽  
Nathan D Olson ◽  
Gintaras Deikus ◽  
Hardik Shah ◽  
Aaron M Wenger ◽  
...  

AbstractSingle-molecule long-read sequencing datasets were generated for a son-father-mother trio of Han Chinese descent that is part of the Genome In a Bottle (GIAB) consortium portfolio. The dataset was generated using the Pacific Biosciences Sequel System. The son and each parent were sequenced to an average coverage of 60 and 30, respectively, with N50 subread lengths between 16 and 18 kb. Raw reads and reads aligned to both the GRCh37 and GRCh38 are available at the NCBI GIAB ftp site (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/ChineseTrio/) and the raw read data is archived in NCBI SRA (SRX4739017, SRX4739121, and SRX4739122). This dataset is available for anyone to develop and evaluate long-read bioinformatics methods.

2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Ying-Chih Wang ◽  
Nathan D. Olson ◽  
Gintaras Deikus ◽  
Hardik Shah ◽  
Aaron M. Wenger ◽  
...  

2018 ◽  
Vol 5 (1) ◽  
Author(s):  
Zsolt Balázs ◽  
Dóra Tombácz ◽  
Attila Szűcs ◽  
Michael Snyder ◽  
Zsolt Boldogkői

2014 ◽  
Author(s):  
Castle Raley ◽  
David Munroe ◽  
Kristie Jones ◽  
Yu-Chih Tsai ◽  
Yan Guo ◽  
...  

We have developed and validated an amplification-free method for generating DNA sequencing libraries from very low amounts of input DNA (500 picograms - 20 nanograms) for single-molecule sequencing on the Pacific Biosciences (PacBio) RS II sequencer. The common challenge of high input requirements for single-molecule sequencing is overcome by using a carrier DNA in conjunction with optimized sequencing preparation conditions and re-use of the MagBead-bound complex. Here we describe how this method can be used to produce sequencing yields comparable to those generated from standard input amounts, but by using 1000-fold less starting material.


GigaScience ◽  
2021 ◽  
Vol 10 (3) ◽  
Author(s):  
Carolina Peñaloza ◽  
Alejandro P Gutierrez ◽  
Lél Eöry ◽  
Shan Wang ◽  
Ximing Guo ◽  
...  

Abstract Background The Pacific oyster (Crassostrea gigas) is a bivalve mollusc with vital roles in coastal ecosystems and aquaculture globally. While extensive genomic tools are available for C. gigas, highly contiguous reference genomes are required to support both fundamental and applied research. Herein we report the creation and annotation of a chromosome-level assembly for C. gigas. Findings High-coverage long- and short-read sequence data generated on Pacific Biosciences and Illumina platforms were used to generate an initial assembly, which was then scaffolded into 10 pseudo-chromosomes using both Hi-C sequencing and a high-density linkage map. The assembly has a scaffold N50 of 58.4 Mb and a contig N50 of 1.8 Mb, representing a step advance on the previously published C. gigas assembly. Annotation based on Pacific Biosciences Iso-Seq and Illumina RNA-Seq resulted in identification of ∼30,000 putative protein-coding genes. Annotation of putative repeat elements highlighted an enrichment of Helitron rolling-circle transposable elements, suggesting their potential role in shaping the evolution of the C. gigas genome. Conclusions This new chromosome-level assembly will be an enabling resource for genetics and genomics studies to support fundamental insight into bivalve biology, as well as for selective breeding of C. gigas in aquaculture.


2020 ◽  
Author(s):  
Carolina Peñaloza ◽  
Alejandro P. Gutierrez ◽  
Lel Eory ◽  
Shan Wang ◽  
Ximing Guo ◽  
...  

AbstractThe Pacific oyster (Crassostrea gigas) is a marine bivalve species with vital roles in coastal ecosystems and aquaculture globally. While extensive genomic tools are available for C. gigas, highly contiguous reference genomes are required to support both fundamental and applied research. In the current study, high coverage long and short read sequence data generated on Pacific Biosciences and Illumina platforms from a single female individual specimen was used to generate an initial assembly, which was then scaffolded into 10 pseudo chromosomes using both Hi-C sequencing and a high density SNP linkage map. The final assembly has a scaffold N50 of 58.4 Mb and a contig N50 of 1.8 Mb, representing a step advance on the previously published C. gigas assembly. The new assembly was annotated using Pacific Biosciences Iso-Seq and Illumina RNA-Seq data, identifying 30K putative protein coding genes, with an average of 3.9 transcripts per gene. Annotation of putative repeat elements highlighted an inverse relationship with gene density, and identified putative centromeres of the metacentric chromosomes. An enrichment of Helitron rolling circle transponsable elements was observed, suggesting their potential role in shaping the evolution of the C. gigas genome. This new chromosome-level assembly will be an enabling resource for genetics and genomics studies to support fundamental insight into bivalve biology, as well as for genetic improvement of C. gigas in aquaculture breeding programmes.


2014 ◽  
Author(s):  
Konstantin Berlin ◽  
Sergey Koren ◽  
Chen-Shan Chin ◽  
James Drake ◽  
Jane M Landolin ◽  
...  

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.


2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Peter Edge ◽  
Vikas Bansal

Abstract Whole-genome sequencing using sequencing technologies such as Illumina enables the accurate detection of small-scale variants but provides limited information about haplotypes and variants in repetitive regions of the human genome. Single-molecule sequencing (SMS) technologies such as Pacific Biosciences and Oxford Nanopore generate long reads that can potentially address the limitations of short-read sequencing. However, the high error rate of SMS reads makes it challenging to detect small-scale variants in diploid genomes. We introduce a variant calling method, Longshot, which leverages the haplotype information present in SMS reads to accurately detect and phase single-nucleotide variants (SNVs) in diploid genomes. We demonstrate that Longshot achieves very high accuracy for SNV detection using whole-genome Pacific Biosciences data, outperforms existing variant calling methods, and enables variant detection in duplicated regions of the genome that cannot be mapped using short reads.


2017 ◽  
Vol 4 (1) ◽  
Author(s):  
Zsolt Balázs ◽  
Dóra Tombácz ◽  
Attila Szűcs ◽  
Michael Snyder ◽  
Zsolt Boldogkői

Abstract Long-read RNA sequencing allows for the precise characterization of full-length transcripts, which makes it an indispensable tool in transcriptomics. The human cytomegalovirus (HCMV) genome has been first sequenced in 1989 and although short-read sequencing studies have uncovered much of the complexity of its transcriptome, only few of its transcripts have been fully annotated. We hereby present a long-read RNA sequencing dataset of HCMV infected human lung fibroblast cells sequenced by the Pacific Biosciences RSII platform. Seven SMRT cells were sequenced using oligo(dT) primers to reverse transcribe poly(A)-selected RNA molecules and one library was prepared using random primers for the reverse transcription of the rRNA-depleted sample. Our dataset contains 122,636 human and 33,086 viral (HMCV strain Towne) reads. The described data include raw and processed sequencing files, and combined with other datasets, they can be used to validate transcriptome analysis tools, to compare library preparation methods, to test base calling algorithms or to identify genetic variants.


2021 ◽  
Vol 17 (6) ◽  
pp. e1009078
Author(s):  
Jingwen Ren ◽  
Mark J. P. Chaisson

It is computationally challenging to detect variation by aligning single-molecule sequencing (SMS) reads, or contigs from SMS assemblies. One approach to efficiently align SMS reads is sparse dynamic programming (SDP), where optimal chains of exact matches are found between the sequence and the genome. While straightforward implementations of SDP penalize gaps with a cost that is a linear function of gap length, biological variation is more accurately represented when gap cost is a concave function of gap length. We have developed a method, lra, that uses SDP with a concave-cost gap penalty, and used lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs. This alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods. When applied to calling variation from de novo assembly contigs, there is a 3.2% increase in Truvari F1 score compared to minimap2+htsbox. lra is available in bioconda (https://anaconda.org/bioconda/lra) and github (https://github.com/ChaissonLab/LRA).


Sign in / Sign up

Export Citation Format

Share Document