scholarly journals Rapid and ongoing evolution of repetitive sequence structures in human centromeres

2020 ◽  
Vol 6 (50) ◽  
pp. eabd9230
Author(s):  
Yuta Suzuki ◽  
Eugene W. Myers ◽  
Shinichi Morishita

Our understanding of centromere sequence variation across human populations is limited by its extremely long nested repeat structures called higher-order repeats that are challenging to sequence. Here, we analyzed chromosomes 11, 17, and X using long-read sequencing data for 36 individuals from diverse populations including a Han Chinese trio and 21 Japanese. We revealed substantial structural diversity with many previously unidentified variant higher-order repeats specific to individuals characterizing rapid, haplotype-specific evolution of human centromeric arrays, while frequent single-nucleotide variants are largely conserved. We found a characteristic pattern shared among prevalent variants in human and chimpanzee. Our findings pave the way for studying sequence evolution in human and primate centromeres.

2019 ◽  
Author(s):  
Yuta Suzuki ◽  
Gene Myers ◽  
Shinichi Morishita

ABSTRACTCentromeres invariably serve as the loci of kinetochore assembly in all eukaryotic cells, but their underlying DNA sequences evolve rapidly. Human centromeres are characterized by their extremely repetitive structures, i.e., higher-order repeats, rendering the region one of the most difficult parts of the genome to assess. Consequently, our understanding of centromere sequence variations across human populations is limited. Here, we analyzed chromosomes 11, 17, and X using long sequencing reads of two European and two Asian genomes, and our results show that human centromere sequences exhibit substantial structural diversity, harboring many novel variant higher-order repeats specific to individuals, while frequent single-nucleotide variants are largely conserved. Our findings add another dimension to our knowledge of centromeres, challenging the notion of stable human centromeres. The discovery of such diversity prompts further deep sequencing of human populations to understand the true nature of sequence evolution in human centromeres.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Zhixing Feng ◽  
Jose C. Clemente ◽  
Brandon Wong ◽  
Eric E. Schadt

AbstractCellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, and co-infection of multiple pathogens. Detecting and phasing minor variants play an instrumental role in deciphering cellular genetic heterogeneity, but they are still difficult tasks because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, provide an opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrate that iGDA can accurately reconstruct haplotypes in closely related strains of the same species (divergence ≥0.011%) from long-read metagenomic data.


2021 ◽  
Author(s):  
Shiyi Wang ◽  
Stephanea L. Sotcheff ◽  
Christian M. Gallardo ◽  
Elizabeth Jaworski ◽  
Bruce E. Torbett ◽  
...  

AbstractAdaptation of viruses to their environments occurs through the acquisition of both novel Single-Nucleotide Variants (SNV) and recombination events including insertions, deletions, and duplications. The co-occurrence of SNVs in individual viral genomes during their evolution has been well-described. However, unlike covariation of SNVs, studying the correlation between recombination events with each other or with SNVs has been hampered by their inherent genetic complexity and a lack of bioinformatic tools. Here, we expanded our previously reported CoVaMa pipeline (v0.1) to measure linkage disequilibrium between recombination events and SNVs within both short-read and long-read sequencing datasets. We demonstrate this approach using long-read nanopore sequencing data acquired from Flock House virus (FHV) serially passaged in vitro. We found SNVs that were either correlated or anti-correlated with large genomic deletions generated by nonhomologous recombination that give rise to Defective-RNAs. We also analyzed NGS data from longitudinal HIV samples derived from a patient undergoing antiretroviral therapy who proceeded to virological failure. We found correlations between insertions in the p6Gag and mutations in Gag cleavage sites. This report confirms previous findings and provides insights on novel associations between SNVs and specific recombination events within the viral genome and their role in viral evolution.


2020 ◽  
Author(s):  
Zhixing Feng ◽  
Jose Clemente ◽  
Brandon Wong ◽  
Eric E. Schadt

AbstractCellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, co-infection of multiple pathogens. Detecting and phasing minor variants, which is to determine whether multiple variants are from the same haplotype, play an instrumental role in deciphering cellular genetic heterogeneity, but are still difficult because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, have provided an unprecedented opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrated that iGDA can accurately reconstruct haplotypes in closely-related strains of the same species (divergence ≥ 0.011%) from long-read metagenomic data. Our approach, therefore, presents a significant advance towards the complete deciphering of cellular genetic heterogeneity.


2019 ◽  
Author(s):  
Davide Bolognini ◽  
Ashley Sanders ◽  
Jan O Korbel ◽  
Alberto Magi ◽  
Vladimir Benes ◽  
...  

Abstract Summary VISOR is a tool for haplotype-specific simulations of simple and complex structural variants (SVs). The method is applicable to haploid, diploid or higher ploidy simulations for bulk or single-cell sequencing data. SVs are implanted into FASTA haplotypes at single-basepair resolution, optionally with nearby single-nucleotide variants. Short or long reads are drawn at random from these haplotypes using standard error profiles. Double- or single-stranded data can be simulated and VISOR supports the generation of haplotype-tagged BAM files. The tool further includes methods to interactively visualize simulated variants in single-stranded data. The versatility of VISOR is unmet by comparable tools and it lays the foundation to simulate haplotype-resolved cancer heterogeneity data in bulk or at single-cell resolution. Availability and implementation VISOR is implemented in python 3.6, open-source and freely available at https://github.com/davidebolo1993/VISOR. Documentation is available at https://davidebolo1993.github.io/visordoc/. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 11 (8) ◽  
pp. 804
Author(s):  
Navid Neyshaburinezhad ◽  
Hengameh Ghasim ◽  
Mohammadreza Rouini ◽  
Youssef Daali ◽  
Yalda H. Ardakani

Genetic polymorphisms in cytochrome P450 genes can cause alteration in metabolic activity of clinically important medicines. Thus, single nucleotide variants (SNVs) and copy number variations (CNVs) in CYP genes are leading factors of drug pharmacokinetics and toxicity and form pharmacogenetics biomarkers for drug dosing, efficacy, and safety. The distribution of cytochrome P450 alleles differs significantly between populations with important implications for personalized drug therapy and healthcare programs. To provide a meta-analysis of CYP allele polymorphisms with clinical importance, we brought together whole-genome and exome sequencing data from 800 unrelated individuals of Iranian population (100 subjects from 8 major ethnics of Iran) and 63,269 unrelated individuals of five major human populations (EUR, AMR, AFR, EAS and SAS). By integrating these datasets with population-specific linkage information, we evolved the frequencies of 140 CYP haplotypes related to 9 important CYP450 isoenzymes (CYP1A2, CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, CYP2E1, CYP3A4 and CYP3A5) giving a large resource for major genetic determinants of drug metabolism. Furthermore, we evaluated the more frequent Iranian alleles and compared the dataset with the Caucasian race. Finally, the similarity of the Iranian population SNVs with other populations was investigated.


2020 ◽  
Author(s):  
Daniel Shriner ◽  
Adebowale Adeyemo ◽  
Charles Rotimi

In clinical genomics, variant calling from short-read sequencing data typically relies on a pan-genomic, universal human reference sequence. A major limitation of this approach is that the number of reads that incorrectly map or fail to map increase as the reads diverge from the reference sequence. In the context of genome sequencing of genetically diverse Africans, we investigate the advantages and disadvantages of using a de novo assembly of the read data as the reference sequence in single sample calling. Conditional on sufficient read depth, the alignment-based and assembly-based approaches yielded comparable sensitivity and false discovery rates for single nucleotide variants when benchmarked against a gold standard call set. The alignment-based approach yielded coverage of an additional 270.8 Mb over which sensitivity was lower and the false discovery rate was higher. Although both approaches detected and missed clinically relevant variants, the assembly-based approach identified more such variants than the alignment-based approach. Of particular relevance to individuals of African descent, the assembly-based approach identified four heterozygous genotypes containing the sickle allele whereas the alignment-based approach identified no occurrences of the sickle allele. Variant annotation using dbSNP and gnomAD identified systematic biases in these databases due to underrepresentation of Africans. Using the counts of homozygous alternate genotypes from the alignment-based approach as a measure of genetic distance to the reference sequence GRCh38.p12, we found that the numbers of misassemblies, total variant sites, potentially novel single nucleotide variants (SNVs), and certain variant classes (e.g., splice acceptor variants, stop loss variants, missense variants, synonymous variants, and variants absent from gnomAD) were significantly correlated with genetic distance. In contrast, genomic coverage and other variant classes (e.g., ClinVar pathogenic or likely pathogenic variants, start loss variants, stop gain variants, splice donor variants, incomplete terminal codons, variants with CADD score ≥20) were not correlated with genetic distance. With improvement in coverage, the assembly-based approach can offer a viable alternative to the alignment-based approach, with the advantage that it can obviate the need to generate diverse human reference sequences or collections of alternate scaffolds.


Author(s):  
Jouni Sirén ◽  
Jean Monlong ◽  
Xian Chang ◽  
Adam M. Novak ◽  
Jordan M. Eizenga ◽  
...  

ABSTRACTWe introduce Giraffe, a pangenome short read mapper that can efficiently map to a collection of haplotypes threaded through a sequence graph. Giraffe, part of the variation graph toolkit (vg)1, maps reads to thousands of human genomes at around the same speed BWA-MEM2 maps reads to a single reference genome, while maintaining comparable accuracy to VG-MAP, vg’s original mapper. We have developed efficient genotyping pipelines using Giraffe. We demonstrate improvements in genotyping for single nucleotide variations (SNVs), insertions and deletions (indels) and structural variations (SVs) genome-wide. We use Giraffe to genotype and phase 167 thousands structural variations ascertained from long read studies in 5,202 human genomes sequenced with short reads, including the complete 1000 Genomes Project dataset, at an average cost of $1.50 per sample. We determine the frequency of these variations in diverse human populations, characterize their complex allelic variations and identify thousands of expression quantitative trait loci (eQTLs) driven by these variations.


2018 ◽  
Vol 5 (suppl_1) ◽  
pp. S364-S364
Author(s):  
Roby Bhattacharyya ◽  
Alejandro Pironti ◽  
Bruce J Walker ◽  
Abigail Manson ◽  
Virginia Pierce ◽  
...  

Abstract Background Carbapenem-resistant Enterobacteriaceae (CRE) are a major public health threat. We report four clonally related Citrobacter freundii isolates harboring the blaKPC-3 carbapenemase in April–May 2017 that are nearly identical to a strain from 2014 at the same institution. Despite differing by ≤5 single nucleotide polymorphisms (SNPs), these isolates exhibited dramatic differences in carbapenemase plasmid architecture. Methods We sequenced four carbapenem-resistant C. freundii isolates from 2017 and compared them with an ongoing CRE surveillance project at our institution. SNPs were identified from Illumina MiSeq data aligned to a reference genome using the variant caller Pilon. Plasmids were assembled from Illumina and Oxford Nanopore sequencing data using Unicycler. Results The four 2017 isolates differed from one another by 0–5 chromosomal SNPs; two were identical. With one exception, these isolates differed by >38,000 SNPs from 25 C. freundii isolates sequenced from 2013 to 2017 at the same institution for CRE surveillance. The exception was a 2014 isolate that differed by 13–16 SNPs from each 2017 isolate, with 13 SNPs common to all four. Each C. freundii isolate harbored wild-type blaKPC-3. Despite the close relationship among the 2017 cluster, the plasmids harboring the blaKPC-3 genes differed dramatically: the carbapenemase occurred in one of the two different plasmids, with rearrangements between these plasmids across isolates. The related 2014 isolate harbored both plasmids, each with a separate copy of blaKPC-3. No transmission chains were found between any of the affected patients. Conclusion WGS confirmed clonality among four contemporaneous blaKPC-3-containing C. freundii isolates, and marked similarity with a 2014 isolate, within an institution. That only 13–16 SNPs varied between the 2014 and 2017 isolates suggests durable persistence of the blaKPC-3 gene within this lineage in a hospital ecosystem. The plasmids harboring these carbapenemase genes proved remarkably plastic, with plasmid loss and rearrangements occurring on the same time scale as two to three chromosomal point mutations. Combining short and long-read sequencing in a case cluster uniquely revealed unexpectedly rapid dynamics of carbapenemase plasmids, providing critical insight into their manner of spread. Disclosures M. J. Ferraro, SeLux Diagnostics: Scientific Advisor and Shareholder, Consulting fee. D. C. Hooper, SeLux Diagnostics: Scientific Advisor, Consulting fee.


2019 ◽  
Vol 4 (1) ◽  
Author(s):  
Andrew Currin ◽  
Neil Swainston ◽  
Mark S Dunstan ◽  
Adrian J Jervis ◽  
Paul Mulherin ◽  
...  

Abstract Synthetic biology utilizes the Design–Build–Test–Learn pipeline for the engineering of biological systems. Typically, this requires the construction of specifically designed, large and complex DNA assemblies. The availability of cheap DNA synthesis and automation enables high-throughput assembly approaches, which generates a heavy demand for DNA sequencing to verify correctly assembled constructs. Next-generation sequencing is ideally positioned to perform this task, however with expensive hardware costs and bespoke data analysis requirements few laboratories utilize this technology in-house. Here a workflow for highly multiplexed sequencing is presented, capable of fast and accurate sequence verification of DNA assemblies using nanopore technology. A novel sample barcoding system using polymerase chain reaction is introduced, and sequencing data are analyzed through a bespoke analysis algorithm. Crucially, this algorithm overcomes the problem of high-error rate nanopore data (which typically prevents identification of single nucleotide variants) through statistical analysis of strand bias, permitting accurate sequence analysis with single-base resolution. As an example, 576 constructs (6 × 96 well plates) were processed in a single workflow in 72 h (from Escherichia coli colonies to analyzed data). Given our procedure’s low hardware costs and highly multiplexed capability, this provides cost-effective access to powerful DNA sequencing for any laboratory, with applications beyond synthetic biology including directed evolution, single nucleotide polymorphism analysis and gene synthesis.


Sign in / Sign up

Export Citation Format

Share Document