Rapid and ongoing evolution of repetitive sequence structures in human centromeres

Yuta Suzuki; Eugene W. Myers; Shinichi Morishita

doi:10.1126/sciadv.abd9230

Rapid and ongoing evolution of repetitive sequence structures in human centromeres

Science Advances ◽

10.1126/sciadv.abd9230 ◽

2020 ◽

Vol 6 (50) ◽

pp. eabd9230

Author(s):

Yuta Suzuki ◽

Eugene W. Myers ◽

Shinichi Morishita

Keyword(s):

Sequence Variation ◽

Structural Diversity ◽

Higher Order ◽

Human Populations ◽

Sequence Evolution ◽

Diverse Populations ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Long Read

Our understanding of centromere sequence variation across human populations is limited by its extremely long nested repeat structures called higher-order repeats that are challenging to sequence. Here, we analyzed chromosomes 11, 17, and X using long-read sequencing data for 36 individuals from diverse populations including a Han Chinese trio and 21 Japanese. We revealed substantial structural diversity with many previously unidentified variant higher-order repeats specific to individuals characterizing rapid, haplotype-specific evolution of human centromeric arrays, while frequent single-nucleotide variants are largely conserved. We found a characteristic pattern shared among prevalent variants in human and chimpanzee. Our findings pave the way for studying sequence evolution in human and primate centromeres.

Long-read Data Revealed Structural Diversity in Human Centromere Sequences

10.1101/784785 ◽

2019 ◽

Author(s):

Yuta Suzuki ◽

Gene Myers ◽

Shinichi Morishita

Keyword(s):

Dna Sequences ◽

Structural Diversity ◽

Higher Order ◽

Human Populations ◽

Sequence Evolution ◽

Single Nucleotide Variants ◽

True Nature ◽

Human Centromere ◽

Long Read ◽

Novel Variant

ABSTRACTCentromeres invariably serve as the loci of kinetochore assembly in all eukaryotic cells, but their underlying DNA sequences evolve rapidly. Human centromeres are characterized by their extremely repetitive structures, i.e., higher-order repeats, rendering the region one of the most difficult parts of the genome to assess. Consequently, our understanding of centromere sequence variations across human populations is limited. Here, we analyzed chromosomes 11, 17, and X using long sequencing reads of two European and two Asian genomes, and our results show that human centromere sequences exhibit substantial structural diversity, harboring many novel variant higher-order repeats specific to individuals, while frequent single-nucleotide variants are largely conserved. Our findings add another dimension to our knowledge of centromeres, challenging the notion of stable human centromeres. The discovery of such diversity prompts further deep sequencing of human populations to understand the true nature of sequence evolution in human centromeres.

Detecting and phasing minor single-nucleotide variants from long-read sequencing data

Nature Communications ◽

10.1038/s41467-021-23289-4 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Zhixing Feng ◽

Jose C. Clemente ◽

Brandon Wong ◽

Eric E. Schadt

Keyword(s):

Genetic Heterogeneity ◽

Error Rates ◽

Metagenomic Data ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Technological Limitations

AbstractCellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, and co-infection of multiple pathogens. Detecting and phasing minor variants play an instrumental role in deciphering cellular genetic heterogeneity, but they are still difficult tasks because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, provide an opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrate that iGDA can accurately reconstruct haplotypes in closely related strains of the same species (divergence ≥0.011%) from long-read metagenomic data.

Co-variation of viral recombination with single nucleotide variants during virus evolution revealed by CoVaMa

10.1101/2021.09.14.460373 ◽

2021 ◽

Author(s):

Shiyi Wang ◽

Stephanea L. Sotcheff ◽

Christian M. Gallardo ◽

Elizabeth Jaworski ◽

Bruce E. Torbett ◽

...

Keyword(s):

Viral Evolution ◽

Flock House Virus ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Viral Recombination ◽

Genomic Deletions ◽

Bioinformatic Tools ◽

Long Read

AbstractAdaptation of viruses to their environments occurs through the acquisition of both novel Single-Nucleotide Variants (SNV) and recombination events including insertions, deletions, and duplications. The co-occurrence of SNVs in individual viral genomes during their evolution has been well-described. However, unlike covariation of SNVs, studying the correlation between recombination events with each other or with SNVs has been hampered by their inherent genetic complexity and a lack of bioinformatic tools. Here, we expanded our previously reported CoVaMa pipeline (v0.1) to measure linkage disequilibrium between recombination events and SNVs within both short-read and long-read sequencing datasets. We demonstrate this approach using long-read nanopore sequencing data acquired from Flock House virus (FHV) serially passaged in vitro. We found SNVs that were either correlated or anti-correlated with large genomic deletions generated by nonhomologous recombination that give rise to Defective-RNAs. We also analyzed NGS data from longitudinal HIV samples derived from a patient undergoing antiretroviral therapy who proceeded to virological failure. We found correlations between insertions in the p6Gag and mutations in Gag cleavage sites. This report confirms previous findings and provides insights on novel associations between SNVs and specific recombination events within the viral genome and their role in viral evolution.

Detecting and phasing minor single-nucleotide variants from long-read sequencing data

10.1101/2020.09.25.314252 ◽

2020 ◽

Author(s):

Zhixing Feng ◽

Jose Clemente ◽

Brandon Wong ◽

Eric E. Schadt

Keyword(s):

Genetic Heterogeneity ◽

Error Rates ◽

Metagenomic Data ◽

Significant Advance ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read

AbstractCellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, co-infection of multiple pathogens. Detecting and phasing minor variants, which is to determine whether multiple variants are from the same haplotype, play an instrumental role in deciphering cellular genetic heterogeneity, but are still difficult because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, have provided an unprecedented opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrated that iGDA can accurately reconstruct haplotypes in closely-related strains of the same species (divergence ≥ 0.011%) from long-read metagenomic data. Our approach, therefore, presents a significant advance towards the complete deciphering of cellular genetic heterogeneity.

VISOR: a versatile haplotype-aware structural variant simulator for short- and long-read sequencing

Bioinformatics ◽

10.1093/bioinformatics/btz719 ◽

2019 ◽

Author(s):

Davide Bolognini ◽

Ashley Sanders ◽

Jan O Korbel ◽

Alberto Magi ◽

Vladimir Benes ◽

...

Keyword(s):

Single Cell ◽

Supplementary Information ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Cancer Heterogeneity ◽

Long Reads ◽

Long Read ◽

Complex Structural ◽

Error Profiles

Abstract Summary VISOR is a tool for haplotype-specific simulations of simple and complex structural variants (SVs). The method is applicable to haploid, diploid or higher ploidy simulations for bulk or single-cell sequencing data. SVs are implanted into FASTA haplotypes at single-basepair resolution, optionally with nearby single-nucleotide variants. Short or long reads are drawn at random from these haplotypes using standard error profiles. Double- or single-stranded data can be simulated and VISOR supports the generation of haplotype-tagged BAM files. The tool further includes methods to interactively visualize simulated variants in single-stranded data. The versatility of VISOR is unmet by comparable tools and it lays the foundation to simulate haplotype-resolved cancer heterogeneity data in bulk or at single-cell resolution. Availability and implementation VISOR is implemented in python 3.6, open-source and freely available at https://github.com/davidebolo1993/VISOR. Documentation is available at https://davidebolo1993.github.io/visordoc/. Supplementary information Supplementary data are available at Bioinformatics online.

Frequency of Important CYP450 Enzyme Gene Polymorphisms in the Iranian Population in Comparison with Other Major Populations: A Comprehensive Review of the Human Data

Journal of Personalized Medicine ◽

10.3390/jpm11080804 ◽

2021 ◽

Vol 11 (8) ◽

pp. 804

Author(s):

Navid Neyshaburinezhad ◽

Hengameh Ghasim ◽

Mohammadreza Rouini ◽

Youssef Daali ◽

Yalda H. Ardakani

Keyword(s):

Cytochrome P450 ◽

Meta Analysis ◽

Clinical Importance ◽

Copy Number Variations ◽

Iranian Population ◽

Human Populations ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Drug Dosing ◽

Linkage Information

Genetic polymorphisms in cytochrome P450 genes can cause alteration in metabolic activity of clinically important medicines. Thus, single nucleotide variants (SNVs) and copy number variations (CNVs) in CYP genes are leading factors of drug pharmacokinetics and toxicity and form pharmacogenetics biomarkers for drug dosing, efficacy, and safety. The distribution of cytochrome P450 alleles differs significantly between populations with important implications for personalized drug therapy and healthcare programs. To provide a meta-analysis of CYP allele polymorphisms with clinical importance, we brought together whole-genome and exome sequencing data from 800 unrelated individuals of Iranian population (100 subjects from 8 major ethnics of Iran) and 63,269 unrelated individuals of five major human populations (EUR, AMR, AFR, EAS and SAS). By integrating these datasets with population-specific linkage information, we evolved the frequencies of 140 CYP haplotypes related to 9 important CYP450 isoenzymes (CYP1A2, CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, CYP2E1, CYP3A4 and CYP3A5) giving a large resource for major genetic determinants of drug metabolism. Furthermore, we evaluated the more frequent Iranian alleles and compared the dataset with the Caucasian race. Finally, the similarity of the Iranian population SNVs with other populations was investigated.

Implications of Genetic Distance to Reference and De Novo Genome Assembly for Clinical Genomics in Africans

10.1101/2020.09.25.20201780 ◽

2020 ◽

Author(s):

Daniel Shriner ◽

Adebowale Adeyemo ◽

Charles Rotimi

Keyword(s):

Genetic Distance ◽

De Novo ◽

Reference Sequence ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

De Novo Genome Assembly ◽

Single Nucleotide ◽

Clinical Genomics ◽

Advantages And Disadvantages ◽

False Discovery

In clinical genomics, variant calling from short-read sequencing data typically relies on a pan-genomic, universal human reference sequence. A major limitation of this approach is that the number of reads that incorrectly map or fail to map increase as the reads diverge from the reference sequence. In the context of genome sequencing of genetically diverse Africans, we investigate the advantages and disadvantages of using a de novo assembly of the read data as the reference sequence in single sample calling. Conditional on sufficient read depth, the alignment-based and assembly-based approaches yielded comparable sensitivity and false discovery rates for single nucleotide variants when benchmarked against a gold standard call set. The alignment-based approach yielded coverage of an additional 270.8 Mb over which sensitivity was lower and the false discovery rate was higher. Although both approaches detected and missed clinically relevant variants, the assembly-based approach identified more such variants than the alignment-based approach. Of particular relevance to individuals of African descent, the assembly-based approach identified four heterozygous genotypes containing the sickle allele whereas the alignment-based approach identified no occurrences of the sickle allele. Variant annotation using dbSNP and gnomAD identified systematic biases in these databases due to underrepresentation of Africans. Using the counts of homozygous alternate genotypes from the alignment-based approach as a measure of genetic distance to the reference sequence GRCh38.p12, we found that the numbers of misassemblies, total variant sites, potentially novel single nucleotide variants (SNVs), and certain variant classes (e.g., splice acceptor variants, stop loss variants, missense variants, synonymous variants, and variants absent from gnomAD) were significantly correlated with genetic distance. In contrast, genomic coverage and other variant classes (e.g., ClinVar pathogenic or likely pathogenic variants, start loss variants, stop gain variants, splice donor variants, incomplete terminal codons, variants with CADD score ≥20) were not correlated with genetic distance. With improvement in coverage, the assembly-based approach can offer a viable alternative to the alignment-based approach, with the advantage that it can obviate the need to generate diverse human reference sequences or collections of alternate scaffolds.

Genotyping common, large structural variations in 5,202 genomes using pangenomes, the Giraffe mapper, and the vg toolkit

10.1101/2020.12.04.412486 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jouni Sirén ◽

Jean Monlong ◽

Xian Chang ◽

Adam M. Novak ◽

Jordan M. Eizenga ◽

...

Keyword(s):

Human Populations ◽

Structural Variations ◽

Single Nucleotide ◽

Human Genomes ◽

Genome Wide ◽

Sequence Graph ◽

Long Read ◽

Comparable Accuracy ◽

Single Nucleotide Variations ◽

Allelic Variations

ABSTRACTWe introduce Giraffe, a pangenome short read mapper that can efficiently map to a collection of haplotypes threaded through a sequence graph. Giraffe, part of the variation graph toolkit (vg)1, maps reads to thousands of human genomes at around the same speed BWA-MEM2 maps reads to a single reference genome, while maintaining comparable accuracy to VG-MAP, vg’s original mapper. We have developed efficient genotyping pipelines using Giraffe. We demonstrate improvements in genotyping for single nucleotide variations (SNVs), insertions and deletions (indels) and structural variations (SVs) genome-wide. We use Giraffe to genotype and phase 167 thousands structural variations ascertained from long read studies in 5,202 human genomes sequenced with short reads, including the complete 1000 Genomes Project dataset, at an average cost of $1.50 per sample. We determine the frequency of these variations in diverse human populations, characterize their complex allelic variations and identify thousands of expression quantitative trait loci (eQTLs) driven by these variations.

1202. Multimodal Sequencing of a Clonal Case Cluster of Carbapenem-Resistant Citrobacter Reveals Unexpectedly Rapid Dynamics of KPC3-Containing Plasmids

Open Forum Infectious Diseases ◽

10.1093/ofid/ofy210.1035 ◽

2018 ◽

Vol 5 (suppl_1) ◽

pp. S364-S364

Author(s):

Roby Bhattacharyya ◽

Alejandro Pironti ◽

Bruce J Walker ◽

Abigail Manson ◽

Virginia Pierce ◽

...

Keyword(s):

Point Mutations ◽

Illumina Miseq ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Single Nucleotide ◽

Carbapenem Resistant ◽

Oxford Nanopore ◽

Close Relationship ◽

Long Read ◽

Carbapenem Resistant Enterobacteriaceae

Abstract Background Carbapenem-resistant Enterobacteriaceae (CRE) are a major public health threat. We report four clonally related Citrobacter freundii isolates harboring the blaKPC-3 carbapenemase in April–May 2017 that are nearly identical to a strain from 2014 at the same institution. Despite differing by ≤5 single nucleotide polymorphisms (SNPs), these isolates exhibited dramatic differences in carbapenemase plasmid architecture. Methods We sequenced four carbapenem-resistant C. freundii isolates from 2017 and compared them with an ongoing CRE surveillance project at our institution. SNPs were identified from Illumina MiSeq data aligned to a reference genome using the variant caller Pilon. Plasmids were assembled from Illumina and Oxford Nanopore sequencing data using Unicycler. Results The four 2017 isolates differed from one another by 0–5 chromosomal SNPs; two were identical. With one exception, these isolates differed by >38,000 SNPs from 25 C. freundii isolates sequenced from 2013 to 2017 at the same institution for CRE surveillance. The exception was a 2014 isolate that differed by 13–16 SNPs from each 2017 isolate, with 13 SNPs common to all four. Each C. freundii isolate harbored wild-type blaKPC-3. Despite the close relationship among the 2017 cluster, the plasmids harboring the blaKPC-3 genes differed dramatically: the carbapenemase occurred in one of the two different plasmids, with rearrangements between these plasmids across isolates. The related 2014 isolate harbored both plasmids, each with a separate copy of blaKPC-3. No transmission chains were found between any of the affected patients. Conclusion WGS confirmed clonality among four contemporaneous blaKPC-3-containing C. freundii isolates, and marked similarity with a 2014 isolate, within an institution. That only 13–16 SNPs varied between the 2014 and 2017 isolates suggests durable persistence of the blaKPC-3 gene within this lineage in a hospital ecosystem. The plasmids harboring these carbapenemase genes proved remarkably plastic, with plasmid loss and rearrangements occurring on the same time scale as two to three chromosomal point mutations. Combining short and long-read sequencing in a case cluster uniquely revealed unexpectedly rapid dynamics of carbapenemase plasmids, providing critical insight into their manner of spread. Disclosures M. J. Ferraro, SeLux Diagnostics: Scientific Advisor and Shareholder, Consulting fee. D. C. Hooper, SeLux Diagnostics: Scientific Advisor, Consulting fee.

Highly multiplexed, fast and accurate nanopore sequencing for verification of synthetic DNA constructs and sequence libraries

Synthetic Biology ◽

10.1093/synbio/ysz025 ◽

2019 ◽

Vol 4 (1) ◽

Cited By ~ 4

Author(s):

Andrew Currin ◽

Neil Swainston ◽

Mark S Dunstan ◽

Adrian J Jervis ◽

Paul Mulherin ◽

...

Keyword(s):

Synthetic Biology ◽

Dna Sequencing ◽

Cost Effective ◽

Polymorphism Analysis ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Synthetic Dna ◽

Design Build ◽

Hardware Costs

Abstract Synthetic biology utilizes the Design–Build–Test–Learn pipeline for the engineering of biological systems. Typically, this requires the construction of specifically designed, large and complex DNA assemblies. The availability of cheap DNA synthesis and automation enables high-throughput assembly approaches, which generates a heavy demand for DNA sequencing to verify correctly assembled constructs. Next-generation sequencing is ideally positioned to perform this task, however with expensive hardware costs and bespoke data analysis requirements few laboratories utilize this technology in-house. Here a workflow for highly multiplexed sequencing is presented, capable of fast and accurate sequence verification of DNA assemblies using nanopore technology. A novel sample barcoding system using polymerase chain reaction is introduced, and sequencing data are analyzed through a bespoke analysis algorithm. Crucially, this algorithm overcomes the problem of high-error rate nanopore data (which typically prevents identification of single nucleotide variants) through statistical analysis of strand bias, permitting accurate sequence analysis with single-base resolution. As an example, 576 constructs (6 × 96 well plates) were processed in a single workflow in 72 h (from Escherichia coli colonies to analyzed data). Given our procedure’s low hardware costs and highly multiplexed capability, this provides cost-effective access to powerful DNA sequencing for any laboratory, with applications beyond synthetic biology including directed evolution, single nucleotide polymorphism analysis and gene synthesis.