scholarly journals Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads

2019 ◽  
Author(s):  
Mitchell R. Vollger ◽  
Glennis A. Logsdon ◽  
Peter A. Audano ◽  
Arvis Sulovari ◽  
David Porubsky ◽  
...  

AbstractThe sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective stand-alone technology for de novo assembly of human genomes.

2019 ◽  
Author(s):  
Karen H. Miga ◽  
Sergey Koren ◽  
Arang Rhie ◽  
Mitchell R. Vollger ◽  
Ariel Gershman ◽  
...  

After nearly two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no one chromosome has been finished end to end, and hundreds of unresolved gaps persist 1,2. The remaining gaps include ribosomal rDNA arrays, large near-identical segmental duplications, and satellite DNA arrays. These regions harbor largely unexplored variation of unknown consequence, and their absence from the current reference genome can lead to experimental artifacts and hide true variants when re-sequencing additional human genomes. Here we present a de novo human genome assembly that surpasses the continuity of GRCh38 2, along with the first gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome 3, we reconstructed the ∼2.8 megabase centromeric satellite DNA array and closed all 29 remaining gaps in the current reference, including new sequence from the human pseudoautosomal regions and cancer-testis ampliconic gene families (CT-X and GAGE). This complete chromosome X, combined with the ultra-long nanopore data, also allowed us to map methylation patterns across complex tandem repeats and satellite arrays for the first time. These results demonstrate that finishing the human genome is now within reach and will enable ongoing efforts to complete the remaining human chromosomes.


2019 ◽  
Author(s):  
Kishwar Shafin ◽  
Trevor Pesout ◽  
Ryan Lorig-Roach ◽  
Marina Haukness ◽  
Hugh E. Olsen ◽  
...  

AbstractPresent workflows for producing human genome assemblies from long-read technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. We demonstrate an optimized PromethION nanopore sequencing method for eleven human genomes. The sequencing, performed on one machine in nine days, achieved an average 63x coverage, 42 Kb read N50, 90% median read identity and 6.5x coverage in 100 Kb+ reads using just three flow cells per sample. To assemble these data we introduce new computational tools: Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms. On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone. We evaluate assembly performance for diploid, haploid and trio-binned human samples in terms of accuracy, cost, and time and demonstrate improvements relative to current state-of-the-art methods in all areas. We further show that addition of proximity ligation (Hi-C) sequencing yields near chromosome-level scaffolds for all eleven genomes.


F1000Research ◽  
2018 ◽  
Vol 7 ◽  
pp. 1391
Author(s):  
Evan Biederstedt ◽  
Jeffrey C. Oliver ◽  
Nancy F. Hansen ◽  
Aarti Jajoo ◽  
Nathan Dunn ◽  
...  

Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enables de novo assembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-based de novo assembly, including large structural variants and divergent haplotypes. Here we present NovoGraph, a method for the construction of a human genome graph directly from a set of de novo assemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles from de novo assemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped.


2018 ◽  
Author(s):  
Alexander Lim ◽  
Bryan Naidenov ◽  
Haley Bates ◽  
Karyn Willyerd ◽  
Timothy Snider ◽  
...  

AbstractDisruptive innovations in long-range, cost-effective direct template nucleic acid sequencing are transforming clinical and diagnostic medicine. A multidrug resistant strain and a pan-susceptible strain ofMannheimia haemolytica, isolated from pneumonic bovine lung samples, were respectively sequenced at 146x and 111x coverage with Oxford Nanopore Technologies MinION.De novoassembly produced a complete genome for the non-resistant strain and a nearly complete assembly for the drug resistant strain. Functional annotation using RAST (Rapid Annotations using Subsystems Technology), CARD (Comprehensive Antibiotic Resistance Database) and ResFinder databases identified genes conferring resistance to different classes of antibiotics including beta lactams, tetracyclines, lincosamides, phenicols, aminoglycosides, sulfonamides and macrolides. Antibiotic resistance phenotypes of theM. haemolyticastrains were confirmed with minimum inhibitory concentration (MIC) assays. The sequencing capacity of highly portable MinION devices was verified by sub-sampling sequencing reads; potential for antimicrobial resistance determined by identification of resistance genes in the draft assemblies with as little as 5,437 MinION reads corresponded to all classes of MIC assays. The resulting quality assemblies and AMR gene annotation highlight efficiency of ultra long-read, whole-genome sequencing (WGS) as a valuable tool in diagnostic veterinary medicine.


2021 ◽  
Vol 10 ◽  
Author(s):  
Edith Heard ◽  
Alexander D Johnson ◽  
Jan O Korbel ◽  
Charles Lee ◽  
Michael P Snyder ◽  
...  

While the human genome represents the most accurate vertebrate reference assembly to date, it still contains numerous gaps, including centromeric and other large repeat-containing regions – often termed the “dark side” of the genome – many of which are of fundamental biological importance. Miga et al. present the first gapless assembly of the human X chromosome, with the help of ultra-long-read nanopore reads generated for the haploid complete hydatidiform mole (CHM13) genome. They reconstruct the ~3.1 megabase centromeric satellite DNA array and map DNA methylation patterns across complex tandem repeats and satellite arrays. This Telomere-to-Telomere assembly provides a superior human X chromosome reference enabling future sex-determination and X-linked disease research, and provides a path towards finishing the entire human genome sequence.


2016 ◽  
Author(s):  
Scott L. Allen ◽  
Emily K. Delaney ◽  
Artyom Kopp ◽  
Stephen F. Chenoweth

ABSTRACTLong read sequencing technology promises to greatly enhancede novoassembly of genomes for non-model species. While error rates have been a large stumbling block, sequencing at high coverage allows reads to be self-corrected. Here we sequence andde novoassemble the genome ofDrosophila serrata, a non-model species from themontiumsubgroup that has been well studied for clines and sexual selection. Using 11 PacBio SMRT cells, we generated 12 Gbp of raw sequence data comprising approximately 65x whole genome coverage. Read lengths averaged 8,940 bp (NRead50 12,200) with the longest read at 53 Kbp. We self-corrected reads using the PBDagCon algorithm and assembled the genome using the MHAP algorithm within the PBcR assembler. Total genome length was 198 Mbp with an N50 just under 1 Mbp. Contigs displayed a high degree of arm-level conservation withD. melanogaster. We also provide an initial annotation for this genome usingin silicogene predictions that were supported by RNA-seq data.


2014 ◽  
Author(s):  
Kristi E Kim ◽  
Paul Peluso ◽  
Primo Baybayan ◽  
Patricia Jane Yeadon ◽  
Charles Yu ◽  
...  

Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas of biological research including de novo genome assembly, structural-variant identification, haplotype phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of SMRT sequences can spur development of analytic tools that can accommodate unique characterisitcs of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from five organisms (Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4-C2 and P5-C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction by the research community to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research.


2021 ◽  
Vol 12 ◽  
Author(s):  
Fiza Liaquat ◽  
Muhammad Farooq Hussain Munis ◽  
Samiah Arif ◽  
Urooj Haroon ◽  
Jianxin Shi ◽  
...  

Schima superba (Theaceae) is a subtropical evergreen tree and is used widely for forest firebreaks and gardening. It is a plant that tolerates salt and typically accumulates elevated amounts of manganese in the leaves. With large ecological amplitude, this tree species grows quickly. Due to its substantial biomass, it has a great potential for soil remediation. To evaluate the thorough framework of the mRNA, we employed PacBio sequencing technology for the first time to generate S. Superba transcriptome. In this analysis, overall, 511,759 full length non-chimeric reads were acquired, and 163,834 high-quality full-length reads were obtained. Overall, 93,362 open reading frames were obtained, of which 78,255 were complete. In gene annotation analyses, the Kyoto Encyclopedia of Genes and Genomes (KEGG), Clusters of Orthologous Genes (COG), Gene Ontology (GO), and Non-Redundant (Nr) databases were allocated 91,082, 71,839, 38,914, and 38,376 transcripts, respectively. To identify long non-coding RNAs (lncRNAs), we utilized four computational methods associated with protein families (Pfam), Cooperative Data Classification (CPC), Coding Assessing Potential Tool (CPAT), and Coding Non-Coding Index (CNCI) databases and observed 8,551, 9,174, 20,720, and 18,669 lncRNAs, respectively. Moreover, nine genes were randomly selected for the expression analysis, which showed the highest expression of Gene 6 (Na_Ca_ex gene), and CAX (CAX-interacting protein 4) was higher in manganese (Mn)-treated group. This work provided significant number of full-length transcripts and refined the annotation of the reference genome, which will ease advanced genetic analyses of S. superba.


2021 ◽  
Vol 12 ◽  
Author(s):  
Kuan Yao ◽  
Narjol González-Escalona ◽  
Maria Hoffmann

Plasmids play a major role in bacterial adaptation to environmental stress and often contribute to antibiotic resistance and disease virulence. Although the complete sequence of each plasmid is essential for studying plasmid biology, most antibiotic resistance and virulence plasmids in Salmonella are present only in a low copy number, making extraction and sequencing difficult. Long read sequencing technologies require higher concentrations of DNA to provide optimal results. To resolve this problem, we assessed the sufficiency of multiple displacement amplification (MDA) for replicating Salmonella plasmid DNA to a satisfactory concentration for accurate sequencing and multiplexing. Nine Salmonella enterica isolates, representing nine different serovars carrying plasmids for which sequence data are already available at NCBI, were cultured and their plasmids isolated using an alkaline lysis extraction protocol. We then used the Phi29 polymerase to perform MDA, thereby obtaining enough plasmid DNA for long read sequencing. These amplified plasmids were multiplexed and sequenced on one single molecule, real-time (SMRT) cell with the Pacific Biosciences (Pacbio) Sequel sequencer. We were able to close all Salmonella plasmids (sizes ranged from 38 to 166 Kb) with sequencing coverage from 24 to 2,582X. This protocol, consisting of plasmid isolation, MDA, and multiplex sequencing, is an effective and fast method for closing high-molecular weight and low-copy-number plasmids. This high throughput protocol reduces the time and cost of plasmid closure.


2017 ◽  
Author(s):  
Mircea Cretu Stancu ◽  
Markus J. van Roosmalen ◽  
Ivo Renkens ◽  
Marleen Nieboer ◽  
Sjors Middelkamp ◽  
...  

AbstractStructural genomic variants form a common type of genetic alteration underlying human genetic disease and phenotypic variation. Despite major improvements in genome sequencing technology and data analysis, the detection of structural variants still poses challenges, particularly when variants are of high complexity. Emerging long-read single-molecule sequencing technologies provide new opportunities for detection of structural variants. Here, we demonstrate sequencing of the genomes of two patients with congenital abnormalities using the ONT MinION at 11x and 16x mean coverage, respectively. We developed a bioinformatic pipeline - NanoSV - to efficiently map genomic structural variants (SVs) from the long-read data. We demonstrate that the nanopore data are superior to corresponding short-read data with regard to detection of de novo rearrangements originating from complex chromothripsis events in the patients. Additionally, genome-wide surveillance of SVs, revealed 3,253 (33%) novel variants that were missed in short-read data of the same sample, the majority of which are duplications < 200bp in size. Long sequencing reads enabled efficient phasing of genetic variations, allowing the construction of genome-wide maps of phased SVs and SNVs. We employed read-based phasing to show that all de novo chromothripsis breakpoints occurred on paternal chromosomes and we resolved the long-range structure of the chromothripsis. This work demonstrates the value of long-read sequencing for screening whole genomes of patients for complex structural variants.


Sign in / Sign up

Export Citation Format

Share Document