Comprehensive Analysis of Human Subtelomeres by Whole Genome Mapping

Mapping Intimacies ◽

10.1101/728519 ◽

2019 ◽

Author(s):

Eleanor Young ◽

Heba Z. Abid ◽

Pui-Yan Kwok ◽

Harold Riethman ◽

Ming Xiao

Keyword(s):

Long Range ◽

Genome Mapping ◽

Subtelomeric Region ◽

Reference Sequence ◽

Segmental Duplications ◽

Structural Variations ◽

Sequence Characterization ◽

Current Reference ◽

Sequencing Technologies ◽

Chromosome Arm

AbstractDetailed comprehensive knowledge of the structures of individual long-range telomere-terminal haplotypes are needed to understand their impact on telomere function, and to delineate the population structure and evolution of subtelomere regions. However, the abundance of large evolutionarily recent segmental duplications and high levels of large structural variations have complicated both the mapping and sequence characterization of human subtelomere regions. Here, we use high throughput optical mapping of large single DNA molecules in nanochannel arrays for 154 human genomes from 26 populations to present a comprehensive look at human subtelomere structure and variation. The results catalog many novel long-range subtelomere haplotypes and determine the frequencies and contexts of specific subtelomeric duplicons on each chromosome arm, helping to clarify the currently ambiguous nature of many specific subtelomere structures as represented in the current reference sequence (HG38). The organization and content of some duplicons in subtelomeres appear to show both chromosome arm and population-specific trends. Based upon these trends we estimate a timeline for the spread of these duplication blocks.Author SummaryThe ends of human chromosomes have caps called telomeres that are essential. These telomeres are influenced by the portions of DNA next to them, a region known as the subtelomere. We need to better understand the subtelomeric region to understand how it impacts the telomeres. This subtelomeric region is not well described in the current references. This is due to large variations in this region and portions that are repeated many times, making current sequencing technologies struggle to capture these regions. Many of these variations are evolutionary recent. Here we use 154 different samples from the 26 geographic regions of the world to gain a better understanding of the variation in these regions. We found many new haplotypes and clarified the haplotypes existing in the current reference. We then examined population and chromosome specific trends.

MUM&Co: accurate detection of all SV types through whole-genome alignment

Bioinformatics ◽

10.1093/bioinformatics/btaa115 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3242-3243 ◽

Cited By ~ 2

Author(s):

Samuel O’Donnell ◽

Gilles Fischer

Keyword(s):

De Novo ◽

Supplementary Information ◽

Genome Alignment ◽

Whole Genome ◽

Structural Variations ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Human Genomes ◽

Whole Genome Alignment ◽

Primary Output

Abstract Summary MUM&Co is a single bash script to detect structural variations (SVs) utilizing whole-genome alignment (WGA). Using MUMmer’s nucmer alignment, MUM&Co can detect insertions, deletions, tandem duplications, inversions and translocations greater than 50 bp. Its versatility depends upon the WGA and therefore benefits from contiguous de-novo assemblies generated by third generation sequencing technologies. Benchmarked against five WGA SV-calling tools, MUM&Co outperforms all tools on simulated SVs in yeast, plant and human genomes and performs similarly in two real human datasets. Additionally, MUM&Co is particularly unique in its ability to find inversions in both simulated and real datasets. Lastly, MUM&Co’s primary output is an intuitive tabulated file containing a list of SVs with only necessary genomic details. Availability and implementation https://github.com/SAMtoBAM/MUMandCo. Supplementary information Supplementary data are available at Bioinformatics online.

RECORD: Reference-Assisted Genome Assembly for Closely Related Genomes

International Journal of Genomics ◽

10.1155/2015/563482 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Krisztian Buza ◽

Bartek Wilczynski ◽

Norbert Dojer

Keyword(s):

Reference Genome ◽

De Novo ◽

Real Data ◽

Reference Sequence ◽

Individual Genome ◽

Single Experiment ◽

Sequencing Technologies ◽

Sequencing Cost ◽

The Individual ◽

Assembly Software

Background. Next-generation sequencing technologies are now producing multiple times the genome size in total reads from a single experiment. This is enough information to reconstruct at least some of the differences between the individual genome studied in the experiment and the reference genome of the species. However, in most typical protocols, this information is disregarded and the reference genome is used.Results. We provide a new approach that allows researchers to reconstruct genomes very closely related to the reference genome (e.g., mutants of the same species) directly from the reads used in the experiment. Our approach applies de novo assembly software to experimental reads and so-called pseudoreads and uses the resulting contigs to generate a modified reference sequence. In this way, it can very quickly, and at no additional sequencing cost, generate new, modified reference sequence that is closer to the actual sequenced genome and has a full coverage. In this paper, we describe our approach and test its implementation called RECORD. We evaluate RECORD on both simulated and real data. We made our software publicly available on sourceforge.Conclusion. Our tests show that on closely related sequences RECORD outperforms more general assisted-assembly software.

Genome assembly of the maize inbred line A188 provides a new reference genome for functional genomics

10.1101/2021.03.15.435372 ◽

2021 ◽

Author(s):

Fei Ge ◽

Jingtao Qu ◽

Peng Liu ◽

Lang Pan ◽

Chaoying Zou ◽

...

Keyword(s):

Single Molecule ◽

Inbred Line ◽

Genome Mapping ◽

Maize Inbred Line ◽

Sequencing Data ◽

Structural Variations ◽

Single Molecule Sequencing ◽

Maize Genetic ◽

Induction Ratio ◽

Phenotypic Variations

Heretofore, little is known about the mechanism underlying the genotype-dependence of embryonic callus (EC) induction, which has severely inhibited the development of maize genetic engineering. Here, we report the genome sequence and annotation of a maize inbred line with high EC induction ratio, A188, which is assembled from single-molecule sequencing and optical genome mapping. We assembled a 2,210 Mb genome with a scaffold N50 size of 11.61 million bases (Mb), compared to those of 9.73 Mb for B73 and 10.2 Mb for Mo17. Comparative analysis revealed that ~30% of the predicted A188 genes had large structural variations to B73, Mo17 and W22 genomes, which caused considerable protein divergence and might lead to phenotypic variations between the four inbred lines. Combining our new A188 genome, previously reported QTLs and RNA sequencing data, we reveal 8 large structural variation genes and 4 differentially expressed genes playing potential roles in EC induction.

The design and construction of reference pangenome graphs with minigraph

Genome Biology ◽

10.1186/s13059-020-02168-z ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 7

Author(s):

Heng Li ◽

Xiaowen Feng ◽

Chong Chu

Keyword(s):

Data Model ◽

Reference Genome ◽

Structural Variants ◽

Current Reference ◽

Sequencing Technologies ◽

Recent Advances ◽

Multiple Genomes ◽

Design And Construction

Abstract The recent advances in sequencing technologies enable the assembly of individual genomes to the quality of the reference genome. How to integrate multiple genomes from the same species and make the integrated representation accessible to biologists remains an open challenge. Here, we propose a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome. We implement our ideas in the minigraph toolkit and demonstrate that we can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome.

LongAGE: defining breakpoints of genomic structural variants through optimal and memory efficient alignments of long reads

Bioinformatics ◽

10.1093/bioinformatics/btaa703 ◽

2020 ◽

Author(s):

Quang Tran ◽

Alexej Abyzov

Keyword(s):

Copy Number Variants ◽

Supplementary Information ◽

Segmental Duplications ◽

Structural Variations ◽

Single Nucleotide ◽

Long Reads ◽

Nucleotide Resolution ◽

Single Nucleotide Resolution ◽

Genomic Structural Variants ◽

Memory Efficient

Abstract Summary Defining the precise location of structural variations (SVs) at single-nucleotide breakpoint resolution is a challenging problem due to large gaps in alignment. Previously, Alignment with Gap Excision (AGE) enabled us to define breakpoints of SVs at single-nucleotide resolution; however, AGE requires a vast amount of memory when aligning a pair of long sequences. To address this, we developed a memory-efficient implementation—LongAGE—based on the classical Hirschberg algorithm. We demonstrate an application of LongAGE for resolving breakpoints of SVs embedded into segmental duplications on Pacific Biosciences (PacBio) reads that can be longer than 10 kb. Furthermore, we observed different breakpoints for a deletion and a duplication in the same locus, providing direct evidence that such multi-allelic copy number variants (mCNVs) arise from two or more independent ancestral mutations. Availability and implementation LongAGE is implemented in C++ and available on Github at https://github.com/Coaxecva/LongAGE. Supplementary information Supplementary data are available at Bioinformatics online.

Shot-Gun Sequencing Strategy for Long-Range Genome Mapping: A Pilot Study

Genomics ◽

10.1006/geno.1994.1307 ◽

1994 ◽

Vol 21 (3) ◽

pp. 495-500 ◽

Cited By ~ 13

Author(s):

Eugene R. Zabarovsky ◽

Vladimir I. Kashuba ◽

Bertil Pettersson ◽

Nikolai Petrov ◽

Vladimir Zakharyev ◽

...

Keyword(s):

Pilot Study ◽

Long Range ◽

Genome Mapping ◽

Sequencing Strategy

ntJoin: Fast and lightweight assembly-guided scaffolding using minimizer graphs

10.1101/2020.01.13.905240 ◽

2020 ◽

Author(s):

Lauren Coombe ◽

Vladimir Nikolić ◽

Justin Chu ◽

Inanc Birol ◽

René L. Warren

Keyword(s):

Reference Sequence ◽

Biological Research ◽

Closely Related Species ◽

Draft Assembly ◽

Short Read ◽

Sequencing Technologies ◽

Long Read ◽

Genome Assemblies ◽

High Quality Genome ◽

Reference Human Genome

AbstractSummaryThe ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short read assembly with a draft long read assembly, and a draft assembly with an assembly from a closely-related species. When scaffolding a human short read assembly using the reference human genome or a long read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using less than 11 GB of RAM. Compared to existing reference-guided assemblers, ntJoin generates highly contiguous assemblies faster and using less memory.Availability and implementationntJoin is written in C++ and Python, and is freely available at https://github.com/bcgsc/[email protected]

Illumina TruSeq synthetic long-reads empowerde novoassembly and resolve complex, highly repetitive transposable elements

10.1101/001834 ◽

2014 ◽

Cited By ~ 1

Author(s):

Rajiv C McCoy ◽

Ryan W Taylor ◽

Timothy A Blauwkamp ◽

Joanna L Kelley ◽

Michael Kertesz ◽

...

Keyword(s):

Transposable Elements ◽

Reference Genome ◽

De Novo ◽

Model Organism ◽

Genomic Analysis ◽

High Sequence Identity ◽

Current Reference ◽

Sequencing Technologies ◽

Long Reads ◽

Whole Genomes

High-throughput DNA sequencing technologies have revolutionized genomic analysis, including thede novoassembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, in part due to the presence of dispersed repeats which introduce ambiguity during genome reconstruction. Transposable elements (TEs) can be particularly problematic, especially for TE families exhibiting high sequence identity, high copy number, or present in complex genomic arrangements. While TEs strongly affect genome function and evolution, most currentde novoassembly approaches cannot resolve long, identical, and abundant families of TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly parallel library preparation and local assembly of short read data and achieve lengths of 1.5-18.5 Kbp with an extremely low error rate (∼0.03% per base). To test the utility of this technology, we sequenced and assembled the genome of the model organismDrosophila melanogaster(reference genome strainy;cn,bw,sp) achieving an N50 contig size of 69.7 Kbp and covering 96.9% of the euchromatic chromosome arms of the current reference genome. TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recovered and accurately placed 4,229 (77.8%) of the 5,434 of annotated transposable elements with perfect identity to the current reference genome. As TEs are ubiquitous features of genomes of many species, TruSeq synthetic long- reads, and likely other methods that generate long reads, offer a powerful approach to improvede novoassemblies of whole genomes.

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

10.21203/rs.3.rs-712747/v1 ◽

2021 ◽

Author(s):

Arang Rhie ◽

Ann Mc Cartney ◽

Kishwar Shafin ◽

Michael Alonge ◽

Andrey Bzikadze ◽

...

Keyword(s):

Genome Assembly ◽

Tandem Repeats ◽

Hydatidiform Mole ◽

Segmental Duplications ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Human Genome Assembly ◽

Long Read ◽

Genome Assemblies ◽

Oxford Nanopore Technologies

Abstract Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first Telomere-to-Telomere (T2T) human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Though derived from highly accurate sequencing, evaluation revealed that the initial T2T draft assembly had evidence of small errors and structural misassemblies. To correct these errors, we designed a novel repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly QV to 73.9. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both PacBio HiFi and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies

Telomere-to-telomere assembly of a complete human X chromosome

10.1101/735928 ◽

2019 ◽

Cited By ~ 43

Author(s):

Karen H. Miga ◽

Sergey Koren ◽

Arang Rhie ◽

Mitchell R. Vollger ◽

Ariel Gershman ◽

...

Keyword(s):

Human Genome ◽

X Chromosome ◽

Satellite Dna ◽

Tandem Repeats ◽

Reference Genome ◽

De Novo ◽

Hydatidiform Mole ◽

Segmental Duplications ◽

High Coverage ◽

Current Reference

After nearly two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no one chromosome has been finished end to end, and hundreds of unresolved gaps persist 1,2. The remaining gaps include ribosomal rDNA arrays, large near-identical segmental duplications, and satellite DNA arrays. These regions harbor largely unexplored variation of unknown consequence, and their absence from the current reference genome can lead to experimental artifacts and hide true variants when re-sequencing additional human genomes. Here we present a de novo human genome assembly that surpasses the continuity of GRCh38 2, along with the first gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome 3, we reconstructed the ∼2.8 megabase centromeric satellite DNA array and closed all 29 remaining gaps in the current reference, including new sequence from the human pseudoautosomal regions and cancer-testis ampliconic gene families (CT-X and GAGE). This complete chromosome X, combined with the ultra-long nanopore data, also allowed us to map methylation patterns across complex tandem repeats and satellite arrays for the first time. These results demonstrate that finishing the human genome is now within reach and will enable ongoing efforts to complete the remaining human chromosomes.