Epigenetic Patterns in a Complete Human Genome

The completion of the first telomere-to-telomere human genome, T2T-CHM13, enables exploration of the full epigenome, removing limitations previously imposed by the missing reference sequence. Existing epigenetic studies omit unassembled and unmappable genomic regions (e.g. centromeres, pericentromeres, acrocentric chromosome arms, subtelomeres, segmental duplications, tandem repeats). Leveraging the new assembly, we were able to measure enrichment of epigenetic marks with short reads using k-mer assisted mapping methods. This granted array-level enrichment information to characterize the epigenetic regulation of these satellite repeats. Using nanopore sequencing data, we generated base level maps of the most complete human methylome ever produced. We examined methylation patterns in satellite DNA and revealed organized patterns of methylation along individual molecules. When exploring the centromeric epigenome, we discovered a distinctive dip in centromere methylation consistent with active sites of kinetochore assembly. Through long-read chromatin accessibility measurements (nanoNOMe) paired to CUT&RUN data, we found the hypomethylated region was extremely inaccessible and paired to CENP-A/B binding. With long-reads we interrogated allele-specific, long-range epigenetic patterns in complex macro-satellite arrays such as those involved in X chromosome inactivation. Using the single molecule measurements we can clustered reads based on methylation status alone distinguishing epigenetically heterogeneous and homogeneous areas. The analysis provides a framework to investigate the most elusive regions of the human genome, applying both long and short-read technology to grant new insights into epigenetic regulation.

Download Full-text

Telomere-to-telomere assembly of a complete human X chromosome

10.1101/735928 ◽

2019 ◽

Cited By ~ 43

Author(s):

Karen H. Miga ◽

Sergey Koren ◽

Arang Rhie ◽

Mitchell R. Vollger ◽

Ariel Gershman ◽

...

Keyword(s):

Human Genome ◽

X Chromosome ◽

Satellite Dna ◽

Tandem Repeats ◽

Reference Genome ◽

De Novo ◽

Hydatidiform Mole ◽

Segmental Duplications ◽

High Coverage ◽

Current Reference

After nearly two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no one chromosome has been finished end to end, and hundreds of unresolved gaps persist 1,2. The remaining gaps include ribosomal rDNA arrays, large near-identical segmental duplications, and satellite DNA arrays. These regions harbor largely unexplored variation of unknown consequence, and their absence from the current reference genome can lead to experimental artifacts and hide true variants when re-sequencing additional human genomes. Here we present a de novo human genome assembly that surpasses the continuity of GRCh38 2, along with the first gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome 3, we reconstructed the ∼2.8 megabase centromeric satellite DNA array and closed all 29 remaining gaps in the current reference, including new sequence from the human pseudoautosomal regions and cancer-testis ampliconic gene families (CT-X and GAGE). This complete chromosome X, combined with the ultra-long nanopore data, also allowed us to map methylation patterns across complex tandem repeats and satellite arrays for the first time. These results demonstrate that finishing the human genome is now within reach and will enable ongoing efforts to complete the remaining human chromosomes.

Download Full-text

Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads

10.1101/635037 ◽

2019 ◽

Cited By ~ 7

Author(s):

Mitchell R. Vollger ◽

Glennis A. Logsdon ◽

Peter A. Audano ◽

Arvis Sulovari ◽

David Porubsky ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Tandem Repeats ◽

De Novo ◽

Sequence Data ◽

Gene Annotation ◽

Hydatidiform Mole ◽

High Fidelity ◽

Human Genomes ◽

Long Read

AbstractThe sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective stand-alone technology for de novo assembly of human genomes.

Download Full-text

Critical assessment of bioinformatics methods for the characterization of pathological repeat expansions with single-molecule sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbz099 ◽

2019 ◽

Vol 21 (6) ◽

pp. 1971-1986 ◽

Cited By ~ 1

Author(s):

Matteo Chiara ◽

Federico Zambelli ◽

Ernesto Picardi ◽

David S Horner ◽

Graziano Pesole

Keyword(s):

Single Molecule ◽

Tandem Repeats ◽

Simulated Data ◽

Detailed Comparison ◽

Sequencing Data ◽

Single Molecule Sequencing ◽

Sequencing Technologies ◽

Repeat Expansions

Abstract A number of studies have reported the successful application of single-molecule sequencing technologies to the determination of the size and sequence of pathological expanded microsatellite repeats over the last 5 years. However, different custom bioinformatics pipelines were employed in each study, preventing meaningful comparisons and somewhat limiting the reproducibility of the results. In this review, we provide a brief summary of state-of-the-art methods for the characterization of expanded repeats alleles, along with a detailed comparison of bioinformatics tools for the determination of repeat length and sequence, using both real and simulated data. Our reanalysis of publicly available human genome sequencing data suggests a modest, but statistically significant, increase of the error rate of single-molecule sequencing technologies at genomic regions containing short tandem repeats. However, we observe that all the methods herein tested, irrespective of the strategy used for the analysis of the data (either based on the alignment or assembly of the reads), show high levels of sensitivity in both the detection of expanded tandem repeats and the estimation of the expansion size, suggesting that approaches based on single-molecule sequencing technologies are highly effective for the detection and quantification of tandem repeat expansions and contractions.

Download Full-text

Tandem repeats structure of gel-forming mucin domains could be revealed by SMRT sequencing data

10.21203/rs.3.rs-112828/v1 ◽

2020 ◽

Author(s):

Tiange Lang

Keyword(s):

Single Molecule ◽

Tandem Repeats ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Coding Region ◽

Smrt Sequencing ◽

Single Nucleotide ◽

Sequencing Technologies ◽

Long Reads ◽

Great Complexity

Abstract Background. Gel-forming mucin domains of mucin genes show great complexity with tandem repeats (TRs), thus make it difficult to study the sequences. Methods. With the coming of single molecule real-time (SMRT) sequencing technologies, we manage to present sequence structure of mucin domains via SMRT long reads for MUC2, MUC5AC, MUC5B and MUC6. Results. Our study shows that for different individuals, single nucleotide polymorphisms (SNPs) could be found in mucin domains of MUC2, MUC5AC, MUC5B and MUC6, while different number of tandem repeats could be found in mucin domains of MUC2 and MUC6. Conclusions. This information will provided new insights on getting the sequence for Tandem Repeat parts which locate in coding region.

Download Full-text

Chromosome-scale assembly comparison of the Korean Reference Genome KOREF from PromethION and PacBio with Hi-C mapping information

GigaScience ◽

10.1093/gigascience/giz125 ◽

2019 ◽

Vol 8 (12) ◽

Cited By ~ 6

Author(s):

Hui-Su Kim ◽

Sungwon Jeon ◽

Changjae Kim ◽

Yeon Kyung Kim ◽

Yun Sung Cho ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Low Cost ◽

Cost Effective ◽

Sequencing Data ◽

Smrt Sequencing ◽

Human Genome Assembly

Abstract Background Long DNA reads produced by single-molecule and pore-based sequencers are more suitable for assembly and structural variation discovery than short-read DNA fragments. For de novo assembly, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are the favorite options. However, PacBio's SMRT sequencing is expensive for a full human genome assembly and costs more than $40,000 US for 30× coverage as of 2019. ONT PromethION sequencing, on the other hand, is 1/12 the price of PacBio for the same coverage. This study aimed to compare the cost-effectiveness of ONT PromethION and PacBio's SMRT sequencing in relation to the quality. Findings We performed whole-genome de novo assemblies and comparison to construct an improved version of KOREF, the Korean reference genome, using sequencing data produced by PromethION and PacBio. With PromethION, an assembly using sequenced reads with 64× coverage (193 Gb, 3 flowcell sequencing) resulted in 3,725 contigs with N50s of 16.7 Mb and a total genome length of 2.8 Gb. It was comparable to a KOREF assembly constructed using PacBio at 62× coverage (188 Gb, 2,695 contigs, and N50s of 17.9 Mb). When we applied Hi-C–derived long-range mapping data, an even higher quality assembly for the 64× coverage was achieved, resulting in 3,179 scaffolds with an N50 of 56.4 Mb. Conclusion The pore-based PromethION approach provided a high-quality chromosome-scale human genome assembly at a low cost with long maximum contig and scaffold lengths and was more cost-effective than PacBio at comparable quality measurements.

Download Full-text

Nanopore-based single molecule sequencing of the D4Z4 array responsible for facioscapulohumeral muscular dystrophy

10.1101/157040 ◽

2017 ◽

Author(s):

Satomi Mitsuhashi ◽

So Nakagawa ◽

Mahoko Takahashi Ueda ◽

Tadashi Imanishi ◽

Martin C Frith ◽

...

Keyword(s):

Muscular Dystrophy ◽

Single Molecule ◽

Gc Content ◽

Facioscapulohumeral Muscular Dystrophy ◽

Repeat Sequence ◽

Reference Sequence ◽

Sequencing Data ◽

Bac Clone ◽

D4z4 Repeat ◽

D4z4 Array

AbstractSubtelomeric macrosatellite repeats are difficult to sequence using conventional sequencing methods owing to the high similarity among repeat units and high GC content. Sequencing these repetitive regions is challenging, even with recent improvements in sequencing technologies. Among these repeats, a haplotype carrying a particular sequence and shortening of the D4Z4 array on human chromosome 4q35 causes one of the most prevalent forms of muscular dystrophy with autosomal-dominant inheritance, facioscapulohumeral muscular dystrophy (FSHD). Here, we applied a nanopore-based ultra-long read sequencer to sequence a BAC clone containing 13 D4Z4 repeats and flanking regions. We successfully obtained the whole D4Z4 repeat sequence, including the pathogenic gene DUX4 in the last D4Z4 repeat. The estimated sequence accuracy of the total repeat region was 99.8% based on a comparison with the reference sequence. Errors were typically observed between purine or between pyrimidine bases. Further, we analyzed the D4Z4 sequence from publicly available ultra-long whole human genome sequencing data obtained by nanopore sequencing. This technology may be a new tool for studying D4Z4 repeats and pathomechanism of FSHD in the future and has the potential to widen our understanding of subtelomeric regions.

Download Full-text

Chromosome-scale assembly comparison of the Korean Reference Genome KOREF from PromethION and PacBio with Hi-C mapping information

10.1101/674804 ◽

2019 ◽

Cited By ~ 2

Author(s):

Hui-Su Kim ◽

Sungwon Jeon ◽

Changjae Kim ◽

Yeon Kyung Kim ◽

Yun Sung Cho ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Low Cost ◽

Cost Effective ◽

Sequencing Data ◽

Smrt Sequencing ◽

Human Genome Assembly

AbstractBackgroundLong DNA reads produced by single molecule and pore-based sequencers are more suitable for assembly and structural variation discovery than short read DNA fragments. For de novo assembly, PacBio and Oxford Nanopore Technologies (ONT) are favorite options. However, PacBio’s SMRT sequencing is expensive for a full human genome assembly and costs over 40,000 USD for 30x coverage as of 2019. ONT PromethION sequencing, on the other hand, is one-twelfth the price of PacBio for the same coverage. This study aimed to compare the cost-effectiveness of ONT PromethION and PacBio’s SMRT sequencing in relation to the quality.FindingsWe performed whole genome de novo assemblies and comparison to construct an improved version of KOREF, the Korean reference genome, using sequencing data produced by PromethION and PacBio. With PromethION, an assembly using sequenced reads with 64x coverage (193 Gb, 3 flowcell sequencing) resulted in 3,725 contigs with N50s of 16.7 Mbp and a total genome length of 2.8 Gbp. It was comparable to a KOREF assembly constructed using PacBio at 62x coverage (188 Gbp, 2,695 contigs and N50s of 17.9 Mbp). When we applied Hi-C-derived long-range mapping data, an even higher quality assembly for the 64x coverage was achieved, resulting in 3,179 scaffolds with an N50 of 56.4 Mbp.ConclusionThe pore-based PromethION approach provides a good quality chromosome-scale human genome assembly at a low cost with long maximum contig and scaffold lengths and is more cost-effective than PacBio at comparable quality measurements.

Download Full-text

Tandem repeats structure of gel-forming mucin domains could be revealed by SMRT sequencing data

10.21203/rs.3.rs-112828/v2 ◽

2021 ◽

Author(s):

Tiange Lang

Keyword(s):

Single Molecule ◽

Tandem Repeats ◽

The Body ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Coding Region ◽

Smrt Sequencing ◽

Epithelial Surface ◽

Sequencing Technologies ◽

Long Reads

Abstract Mucins are large glycoproteins that cover and protect epithelial surface of the body. Gel-forming mucin domains of mucin genes are rich in proline, threonine, and serine that are heavily glycosylate. These domains show great complexity with tandem repeats (TRs), thus make it difficult to study the sequences. With the coming of single molecule real-time (SMRT) sequencing technologies, we manage to present sequence structure of mucin domains via SMRT long reads for gel-forming mucins MUC2, MUC5AC, MUC5B and MUC6. Our study shows that for different individuals, single nucleotide polymorphisms (SNPs) could be found in mucin domains of MUC2, MUC5AC, MUC5B and MUC6, while different number of tandem repeats could be found in mucin domains of MUC2 and MUC6. Furthermore, we get the sequence of MUC2, MUC5AC, and MUC5B mucin domain in a Chinese individual at accuracy of possibly maximum 99.98%, 99.93%, and 99.76%, respectively. We report a new method to obtain DNA sequence of gel-forming mucin domains. This method will provided new insights on getting the sequence for Tandem Repeat parts which locate in coding region. With the sequences we obtained with this method, we can give more information for people to study the sequences of gel-forming mucin domains.

Download Full-text

Long-read sequencing technology indicates genome-wide effects of non-B DNA on polymerization speed and error rate

10.1101/237461 ◽

2017 ◽

Author(s):

Wilfried M. Guiblet ◽

Marzia A. Cremona ◽

Monika Cechova ◽

Robert S. Harris ◽

Iva Kejnovska ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Tandem Repeats ◽

Neurological Diseases ◽

Error Rates ◽

Polymerization Kinetics ◽

Sequencing Error ◽

Dna Polymerization ◽

Sequencing Errors ◽

Genome Wide

ABSTRACTDNA conformation may deviate from the classical B-form in ~13% of the human genome. Non-B DNA regulates many cellular processes; however, its effects on DNA polymerization speed and accuracy have not been investigated genome-wide. Such an inquiry is critical for understanding neurological diseases and cancer genome instability. Here we present the first simultaneous examination of DNA polymerization kinetics and errors in the human genome sequenced with Single-Molecule-Real-Time technology. We show that polymerization speed differs between non-B and B-DNA: it decelerates at G-quadruplexes and fluctuates periodically at disease-causing tandem repeats. Analyzing polymerization kinetics profiles, we predict and validate experimentally non-B DNA formation for a novel motif. We demonstrate that several non-B motifs affect sequencing errors (e.g., G-quadruplexes increase error rates) and that sequencing errors are positively associated with polymerase slowdown. Finally, we show that highly divergent G4 motifs have pronounced polymerization slowdown and high sequencing error rates, suggesting similar mechanisms for sequencing errors and germline mutations.

Download Full-text

Repeat expansion and methylation state analysis with nanopore sequencing

10.1101/480285 ◽

2018 ◽

Cited By ~ 3

Author(s):

Pay Gießelmann ◽

Björn Brändl ◽

Etienne Raimondeau ◽

Rebecca Bowen ◽

Christian Rohrandt ◽

...

Keyword(s):

Single Molecule ◽

Signal Analysis ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Methylation Status ◽

Target Sequence ◽

Nanopore Sequencing ◽

Repeat Expansions ◽

Enrichment Strategy ◽

Short Tandem

Expansions of short tandem repeats are genetic variants that have been implicated in neuropsychiatric and other disorders but their assessment remains challenging with current molecular methods. Here, we developed a Cas12a-based enrichment strategy for nanopore sequencing that, combined with a new algorithm for raw signal analysis, enables us to efficiently target, sequence and precisely quantify repeat numbers as well as their DNA methylation status. Taking advantage of these single molecule nanopore signals provides therefore unprecedented opportunities to study pathological repeat expansions.

Download Full-text