Segmental duplications and their variation in a complete human genome

Despite their importance in disease and evolution, highly identical segmental duplications (SDs) have been among the last regions of the human reference genome (GRCh38) to be finished. Based on a complete telomere-to-telomere human genome (T2T CHM13), we present the first comprehensive view of human SD organization. SDs account for nearly one-third of the additional sequence increasing the genome-wide estimate from 5.4% to 7.0% (218 Mbp). An analysis of 266 human genomes shows that 91% of the new T2T CHM13 SD sequence (68.3 Mbp) better represents human copy number. We find that SDs show increased single-nucleotide variation diversity when compared to unique regions; we characterize methylation signatures that correlate with duplicate gene transcription and predict 182 novel protein-coding gene candidates. We find that 63% (35.11/55.7 Mbp) of acrocentric chromosomes consist of SDs distinct from rDNA and satellite sequences. Acrocentric SDs are 1.75-fold longer (p=0.00034) than other SDs, are frequently shared with autosomal pericentromeric regions, and are heteromorphic among human chromosomes. Comparing long-read assemblies from other human (n=12) and nonhuman primate (n=5) genomes, we use the T2T CHM13 genome to systematically reconstruct the evolution and structural haplotype diversity of biomedically relevant (LPA, SMN) and duplicated genes (TBC1D3, SRGAP2C, ARHGAP11B) important in the expansion of the human frontal cortex. The analysis reveals unprecedented patterns of structural heterozygosity and massive evolutionary differences in SD organization between humans and their closest living relatives.

Download Full-text

HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads

10.1101/2020.03.14.992248 ◽

2020 ◽

Cited By ~ 10

Author(s):

Sergey Nurk ◽

Brian P. Walenz ◽

Arang Rhie ◽

Mitchell R. Vollger ◽

Glennis A. Logsdon ◽

...

Keyword(s):

Haplotype Diversity ◽

Human Cell Line ◽

Significant Advance ◽

Full Potential ◽

Segmental Duplications ◽

High Coverage ◽

Satellite Dnas ◽

Human Genomes ◽

Long Reads ◽

Long Read

AbstractComplete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced PacBio HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a significant modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultra-long Oxford Nanopore reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of 9 complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance towards the complete assembly of human genomes.AvailabilityHiCanu is implemented within the Canu assembly framework and is available from https://github.com/marbl/canu.

Download Full-text

Genotyping common, large structural variations in 5,202 genomes using pangenomes, the Giraffe mapper, and the vg toolkit

10.1101/2020.12.04.412486 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jouni Sirén ◽

Jean Monlong ◽

Xian Chang ◽

Adam M. Novak ◽

Jordan M. Eizenga ◽

...

Keyword(s):

Human Populations ◽

Structural Variations ◽

Single Nucleotide ◽

Human Genomes ◽

Genome Wide ◽

Sequence Graph ◽

Long Read ◽

Comparable Accuracy ◽

Single Nucleotide Variations ◽

Allelic Variations

ABSTRACTWe introduce Giraffe, a pangenome short read mapper that can efficiently map to a collection of haplotypes threaded through a sequence graph. Giraffe, part of the variation graph toolkit (vg)1, maps reads to thousands of human genomes at around the same speed BWA-MEM2 maps reads to a single reference genome, while maintaining comparable accuracy to VG-MAP, vg’s original mapper. We have developed efficient genotyping pipelines using Giraffe. We demonstrate improvements in genotyping for single nucleotide variations (SNVs), insertions and deletions (indels) and structural variations (SVs) genome-wide. We use Giraffe to genotype and phase 167 thousands structural variations ascertained from long read studies in 5,202 human genomes sequenced with short reads, including the complete 1000 Genomes Project dataset, at an average cost of $1.50 per sample. We determine the frequency of these variations in diverse human populations, characterize their complex allelic variations and identify thousands of expression quantitative trait loci (eQTLs) driven by these variations.

Download Full-text

Illuminating the dark side of the human transcriptome with long read transcript sequencing

10.21203/rs.3.rs-23156/v3 ◽

2020 ◽

Author(s):

Richard Kuo ◽

Yuanyuan Cheng ◽

Runxuan Zhang ◽

John W.S. Brown ◽

Jacqueline Smith ◽

...

Keyword(s):

Data Processing ◽

Error Correction ◽

Human Genome ◽

Parameter Tuning ◽

Dark Side ◽

Sequencing Data ◽

Protein Coding ◽

Human Transcriptome ◽

Model Predictions ◽

Long Read

Abstract Background: The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide additional evidence for rare transcripts and genes such as mono-exonic and non-coding genes that were previously either undetectable or impossible to differentiate from sequencing noise. Results: We developed the Transcriptome Annotation by Modular Algorithms (TAMA) software to leverage the power of long read transcript sequencing and address the issues with current data processing pipelines. TAMA achieved high sensitivity and precision for gene and transcript model predictions in both reference guided and unguided approaches in our benchmark tests using simulated Pacific Biosciences (PacBio) and Nanopore sequencing data and real PacBio datasets. By analyzing PacBio Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using TAMA and other commonly used tools, we found that the convention of using alignment identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction can cause major changes to read mapping, resulting in potentially over 6K erroneous gene model predictions in the Iso-Seq based human genome annotation. Using TAMA’s genome assembly based error correction and gene feature evidence, we predicted 2,566 putative novel non-coding genes and 1,557 putative novel protein coding gene models.Conclusions: Long read transcript sequencing data has the power to identify novel genes within the highly annotated human genome. The use of parameter tuning and extensive output information of the TAMA software package allows for in depth exploration of eukaryotic transcriptomes. We have found long read data based evidence for thousands of unannotated genes within the human genome. More development in sequencing library preparation and data processing are required for differentiating sequencing noise from real genes in long read RNA sequencing data.

Download Full-text

Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit

10.1101/715722 ◽

2019 ◽

Cited By ~ 21

Author(s):

Kishwar Shafin ◽

Trevor Pesout ◽

Ryan Lorig-Roach ◽

Marina Haukness ◽

Hugh E. Olsen ◽

...

Keyword(s):

Human Genome ◽

De Novo ◽

Proximity Ligation ◽

Current State ◽

Human Genomes ◽

Sequencing Method ◽

Human Genome Assembly ◽

Long Read ◽

Genome Assemblies ◽

Assembly Performance

AbstractPresent workflows for producing human genome assemblies from long-read technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. We demonstrate an optimized PromethION nanopore sequencing method for eleven human genomes. The sequencing, performed on one machine in nine days, achieved an average 63x coverage, 42 Kb read N50, 90% median read identity and 6.5x coverage in 100 Kb+ reads using just three flow cells per sample. To assemble these data we introduce new computational tools: Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms. On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone. We evaluate assembly performance for diploid, haploid and trio-binned human samples in terms of accuracy, cost, and time and demonstrate improvements relative to current state-of-the-art methods in all areas. We further show that addition of proximity ligation (Hi-C) sequencing yields near chromosome-level scaffolds for all eleven genomes.

Download Full-text

Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads

10.1101/635037 ◽

2019 ◽

Cited By ~ 7

Author(s):

Mitchell R. Vollger ◽

Glennis A. Logsdon ◽

Peter A. Audano ◽

Arvis Sulovari ◽

David Porubsky ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Tandem Repeats ◽

De Novo ◽

Sequence Data ◽

Gene Annotation ◽

Hydatidiform Mole ◽

High Fidelity ◽

Human Genomes ◽

Long Read

AbstractThe sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective stand-alone technology for de novo assembly of human genomes.

Download Full-text

NovoGraph: Human genome graph construction from multiple long-read de novo assemblies

F1000Research ◽

10.12688/f1000research.15895.2 ◽

2018 ◽

Vol 7 ◽

pp. 1391

Author(s):

Evan Biederstedt ◽

Jeffrey C. Oliver ◽

Nancy F. Hansen ◽

Aarti Jajoo ◽

Nathan Dunn ◽

...

Keyword(s):

Human Genome ◽

De Novo ◽

Wide Spectrum ◽

Third Party ◽

Sequencing Data ◽

Multiple Sequence ◽

Human Genomes ◽

A Genome ◽

Long Read ◽

Genome Graph

Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enables de novo assembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-based de novo assembly, including large structural variants and divergent haplotypes. Here we present NovoGraph, a method for the construction of a human genome graph directly from a set of de novo assemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles from de novo assemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped.

Download Full-text

Six new reference-quality bat genomes illuminate the molecular basis and evolution of bat adaptations

10.1101/836874 ◽

2019 ◽

Cited By ~ 7

Author(s):

David Jebb ◽

Zixia Huang ◽

Martin Pippel ◽

Graham M. Hughes ◽

Ksenia Lavrichenko ◽

...

Keyword(s):

De Novo ◽

Phenotypic Diversity ◽

Phylogenetic Analyses ◽

Mammal Species ◽

Protein Coding ◽

Genome Wide ◽

Long Read ◽

Reference Quality ◽

Transposon Activity

AbstractBats account for ~20% of all extant mammal species and are considered exceptional given their extraordinary adaptations, including biosonar, true flight, extreme longevity, and unparalleled immune systems. To understand these adaptations, we generated reference-quality genomes of six species representing the key divergent lineages. We assembled these genomes with a novel pipeline incorporating state-of-the-art long-read and long-range sequencing and assembly techniques. The genomes were annotated using a maximal evidence approach, de novo predictions, protein/mRNA alignments, Iso-seq long read and RNA-seq short read transcripts, and gene projections from our new TOGA pipeline, retrieving virtually all (>99%) mammalian BUSCO genes. Phylogenetic analyses of 12,931 protein coding-genes and 10,857 conserved non-coding elements identified across 48 mammalian genomes helped to resolve bats’ closest extant relatives within Laurasiatheria, supporting a basal position for bats within Scrotifera. Genome-wide screens along the bat ancestral branch revealed (a) selection on hearing-involved genes (e.g LRP2, SERPINB6, TJP2), which suggest that laryngeal echolocation is a shared ancestral trait of bats; (b) selection (e.g INAVA, CXCL13, NPSR1) and loss of immunity related proteins (e.g. LRRC70, IL36G), including pro-inflammatory NF-kB signalling; and (c) expansion of the APOBEC family, associated with restricting viral infection, transposon activity and interferon signalling. We also identified unique integrated viruses, indicating that bats have a history of tolerating viral pathogens, lethal to other mammal species. Non-coding RNA analyses identified variant and novel microRNAs, revealing regulatory relationships that may contribute to phenotypic diversity in bats. Together, our reference-quality genomes, high-quality annotations, genome-wide screens and in-vitro tests revealed previously unknown genomic adaptations in bats that may explain their extraordinary traits.

Download Full-text

Illuminating the dark side of the human transcriptome with long read transcript sequencing

10.21203/rs.3.rs-23156/v2 ◽

2020 ◽

Author(s):

Richard Kuo ◽

Yuanyuan Cheng ◽

Runxuan Zhang ◽

John W.S. Brown ◽

Jacqueline Smith ◽

...

Keyword(s):

Data Processing ◽

Error Correction ◽

Human Genome ◽

Parameter Tuning ◽

Dark Side ◽

Sequencing Data ◽

Protein Coding ◽

Human Transcriptome ◽

Model Predictions ◽

Long Read

Abstract Background: The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide stronger evidence for rare transcripts and genes such as mono-exonic and non-coding genes that were previously either undetectable or impossible to differentiate from sequencing noise. Results: We developed the Transcriptome Annotation by Modular Algorithms (TAMA) software to leverage the power of long read transcript sequencing and address the issues with current data processing pipelines. TAMA achieved high sensitivity and precision for gene and transcript model predictions in both reference guided and unguided approaches in our benchmark tests using simulated Pacific Biosciences (PacBio) and Nanopore sequencing data and real PacBio datasets. By analyzing PacBio Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using TAMA and other commonly used pipelines, we found that the convention of using mapping identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction can cause major changes to read mapping, resulting in potentially over 6K erroneous gene model predictions in the Iso-Seq based human genome annotation. Using TAMA’s genome assembly based error correction and gene feature evidence, we identified 2,566 putative novel non-coding genes and 1,557 putative novel protein coding gene models.Conclusions: Long read transcript sequencing data has the power to identify novel genes within the highly annotated human genome. The use of parameter tuning and extensive output information of the TAMA software package allows for in depth exploration of eukaryotic transcriptomes. We have found long read data based evidence for thousands of unannotated genes within the human genome. More development in sequencing library preparation and data processing are required for differentiating sequencing noise from real genes in long read RNA sequencing data.

Download Full-text

mosaicFlye: Resolving long mosaic repeats using long error-prone reads

10.1101/2020.01.15.908285 ◽

2020 ◽

Cited By ~ 3

Author(s):

Anton Bankevich ◽

Pavel Pevzner

Keyword(s):

Human Chromosome ◽

Human Genome ◽

Genome Assembly ◽

Chromosome 6 ◽

Segmental Duplications ◽

Bacterial Genomes ◽

Long Read ◽

Human Chromosome 6 ◽

Genome Assemblies ◽

Eukaryotic Genomes

AbstractLong-read technologies revolutionized genome assembly and enabled resolution of bridged repeats (i.e., repeats that are spanned by some reads) in various genomes. However, the problem of resolving unbridged repeats (such as long segmental duplications in the human genome) remains largely unsolved, making it a major obstacle towards achieving the goal of complete genome assemblies. Moreover, the challenge of resolving unbridged repeats is not limited to eukaryotic genomes but also impairs assemblies of bacterial genomes and metagenomes. We describe the mosaicFlye algorithm for resolving complex unbridged repeats based on differences between various repeat copies and show how it improves assemblies of the human genome as well as bacterial genomes and metagenomes. In particular, we show that mosaicFlye results in a complete assembly of both arms of the human chromosome 6.

Download Full-text

De novo diploid genome assembly for genome-wide structural variant detection

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqz018 ◽

2019 ◽

Vol 2 (1) ◽

Author(s):

Lu Zhang ◽

Xin Zhou ◽

Ziming Weng ◽

Arend Sidow

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Pairwise Alignment ◽

Cost Effective ◽

Difficult Problem ◽

Ancestral State ◽

Fundamental Limitations ◽

Human Genomes ◽

Genome Wide ◽

Long Read

Abstract Detection of structural variants (SVs) on the basis of read alignment to a reference genome remains a difficult problem. De novo assembly, traditionally used to generate reference genomes, offers an alternative for SV detection. However, it has not been applied broadly to human genomes because of fundamental limitations of short-fragment approaches and high cost of long-read technologies. We here show that 10× linked-read sequencing supports accurate SV detection. We examined variants in six de novo 10× assemblies with diverse experimental parameters from two commonly used human cell lines: NA12878 and NA24385. The assemblies are effective for detecting mid-size SVs, which were discovered by simple pairwise alignment of the assemblies’ contigs to the reference (hg38). Our study also shows that the base-pair level SV breakpoint accuracy is high, with a majority of SVs having precisely correct sizes and breakpoints. Setting the ancestral state of SV loci by comparing to ape orthologs allows inference of the actual molecular mechanism (insertion or deletion) causing the mutation. In about half of cases, the mechanism is the opposite of the reference-based call. We uncover 214 SVs that may have been maintained as polymorphisms in the human lineage since before our divergence from chimp. Overall, we show that de novo assembly of 10× linked-read data can achieve cost-effective SV detection for personal genomes.

Download Full-text