scholarly journals Short template switch events explain mutation clusters in the human genome

2016 ◽  
Author(s):  
Ari Löytynoja ◽  
Nick Goldman

AbstractResequencing efforts are uncovering the extent of genetic variation in humans and provide data to study the evolutionary processes shaping our genome. One recurring puzzle in both intra- and inter-species studies is the high frequency of complex mutations comprising multiple nearby base substitutions or insertion-deletions. We devised a generalized mutation model of template switching during replication that extends existing models of genome rearrangement, and used this to study the role of template switch events in the origin of such mutation clusters. Applied to the human genome, our model detects thousands of template switch events during the evolution of human and chimp from their common ancestor, and hundreds of events between two independently sequenced human genomes. While many of these are consistent with the template switch mechanism previously proposed for bacteria but not thought significant in higher organisms, our model also identifies new types of mutations that create short inversions, some flanked by paired inverted repeats. The local template switch process can create numerous complex mutation patterns, including hairpin loop structures, and explains multi-nucleotide mutations and compensatory substitutions without invoking positive selection, complicated and speculative mechanisms, or implausible coincidence. Clustered sequence differences are challenging for mapping and variant calling methods, and we show that detection of mutation clusters with current resequencing methodologies is difficult and many erroneous variant annotations exist in human reference data. Template switch events such as those we have uncovered may have been neglected as an explanation for complex mutations because of biases in commonly used analyses. Incorporation of our model into reference-based analysis pipelines and comparisons of de novo-assembled genomes will lead to improved understanding of genome variation and evolution.

2015 ◽  
Author(s):  
Justin M Zook ◽  
David Catoe ◽  
Jennifer McDaniel ◽  
Lindsay Vang ◽  
Noah Spies ◽  
...  

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.


2021 ◽  
Author(s):  
Ari Löytynoja

Variation within human genomes is distributed unevenly and variants show spatial clustering. DNA-replication related template switching is a poorly known mutational mechanism capable of causing major chromosomal rearrangements as well as creating short inverted sequence copies that appear as local mutation clusters in sequence comparisons. We reanalyzed haplotype-resolved genome assemblies representing 25 human populations and multinucleotide variants aggregated from 140,000 human sequencing experiments. We found local template switching to explain thousands of complex mutation clusters across the human genome, the loci segregating within and between populations with a small number appearing as de novo mutations. We developed computational tools for genotyping candidate template switch loci using short-read sequencing data and for identification of template switch events using both short-read data and genotype data. These tools will enable building a catalogue of affected loci and studying the cellular mechanisms behind template switching both in healthy organisms and in disease. Strikingly, we noticed that widely-used analysis pipelines for short-read sequencing data - capable of identifying single nucleotide changes - may miss TSM-origin inversions of tens of base pairs, potentially invalidating medical genetic studies searching for causative alleles behind genetic diseases.


2019 ◽  
Vol 35 (2) ◽  
pp. 105-118
Author(s):  
Vinh Le

The advent of genomic technologies has led to the current genomic era. Large-scale human genome projects have resulted in a huge amount of genomic data. Analyzing human genomes is a challenging task including a number of key steps from short read alignment, variant calling, and variant annotating. In this paper, the state-of-the-art computational methods and databases for each step will be analyzed to suggest a practical and efficient guideline for whole human genome analyses. This paper also discusses frameworks to combine variants from various genome analysis pipelines to obtain reliable variants. Finally, we will address advantages as well as discordances of widely-used variant annotation methods to evaluate the clinical significance of variants. The review will empower bioinformaticians to efficiently perform human genome analyses, and more importantly, help genetic consultants understand and properly interpret mutations for clinical purposes.


2019 ◽  
Author(s):  
Gan Ai ◽  
Kun Yang ◽  
Yuee Tian ◽  
Wenwu Ye ◽  
Hai Zhu ◽  
...  

AbstractBeing widely existed in oomycetes, the RXLR effector features conserved RXLR-dEER motifs in its N terminal. Every known Phytophthora or Hyaloperonospora pathogen harbors hundreds of RXLRs. In Pythium species, however, none of the RXLR effectors has been characterized yet. Here, we developed a stringent method for de novo identification of RXLRs and characterized 359 putative RXLR effectors from nine tested Pythium species. Phylogenetic analysis revealed a single superfamily formed by all oomycetous RXLRs, suggesting they descent from a common ancestor. RXLR effectors from Pythium and Phytophthora species exhibited similar sequence features, protein structures and genome locations. In particular, the mosquito biological agent P. guiyangense contains a significantly larger RXLR repertoire than the other eight Pythium species examined, which may result from gene duplication and genome rearrangement events as indicated by synteny analysis. Expression pattern analysis of RXLR-encoding genes in the plant pathogen P. ultimum detected transcripts from the vast majority of predicted RXLRs with some of them being induced at infection stages. One such RXLRs showed necrosis-inducing activity. Furthermore, all predicted RXLRs were cloned from two biocontrol agents P. oligandrum and P. periplocum. Three of them were found to encode effectors inducing defense response in Nicotiana benthamiana. Taken together, our findings represent the first complete synopsis of Pythium RXLR effectors, which provides critical clues on their evolutionary patterns as well as the mechanisms of their interactions with diverse hosts.Author summaryPathogens from the Pythium genus are widespread across multiple ecological niches. Most of them are soilborne plant pathogens whereas others cause infectious diseases in mammals. Some Pythium species can be used as biocontrol agents for plant diseases or mosquito management. Despite that phylogenetically close oomycete pathogens secrete RXLR effectors to enable infection, no RXLR protein was previously characterized in any Pythium species. Here we developed a stringent method to predict Pythium RXLR effectors and compared them with known RXLRs from other species. All oomycetous RXLRs form a huge superfamily, which indicates they may share a common ancestor. Our sequence analysis results suggest that the expansion of RXLR repertoire results from gene duplication and genome recombination events. We further demonstrated that most predicted Pythium RXLRs can be transcribed and some of them encode effectors exhibiting pathogenic or defense-inducing activities. This work expands our understanding of RXLR evolution in oomycetes in general, and provides novel insights into the molecular interactions between Pythium pathogens and their diverse hosts.


2020 ◽  
Author(s):  
Beth Osia ◽  
Thamer Alsulaiman ◽  
Tyler Jackson ◽  
Juraj Kramara ◽  
Suely Oliveira ◽  
...  

AbstractMicrohomology-mediated break-induced replication (MMBIR) is a mechanism of polymerase template switching at microhomology, which can produce complex genomic rearrangements (CGRs), underlies neurological and metabolic diseases, and contributes to cancer development. Yet, the extent of MMBIR activity in genomes is poorly understood due to difficulty in directly identifying MMBIR events by whole genome sequencing (WGS). Here, by using our newly developed MMBSearch software, we directly detect MMBIR events in human genomes and report substantial differences in frequency and complexity of MMBIR events between normal and cancer cells. MMBIR events appear only as germline variants in normal human fibroblast cells but readily accumulate de novo across several cancer types. Detailed analysis of MMBIR mutations in lung adenocarcinomas revealed MMBIR-initiated chromosome fusions that disrupted potential tumor suppressor genes and induced CGRs. Our findings document MMBIR as a trigger for widespread genomic instability and highlight MMBIR as a potential driver of tumor evolution.


2019 ◽  
Author(s):  
Kishwar Shafin ◽  
Trevor Pesout ◽  
Ryan Lorig-Roach ◽  
Marina Haukness ◽  
Hugh E. Olsen ◽  
...  

AbstractPresent workflows for producing human genome assemblies from long-read technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. We demonstrate an optimized PromethION nanopore sequencing method for eleven human genomes. The sequencing, performed on one machine in nine days, achieved an average 63x coverage, 42 Kb read N50, 90% median read identity and 6.5x coverage in 100 Kb+ reads using just three flow cells per sample. To assemble these data we introduce new computational tools: Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms. On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone. We evaluate assembly performance for diploid, haploid and trio-binned human samples in terms of accuracy, cost, and time and demonstrate improvements relative to current state-of-the-art methods in all areas. We further show that addition of proximity ligation (Hi-C) sequencing yields near chromosome-level scaffolds for all eleven genomes.


2019 ◽  
Author(s):  
Mitchell R. Vollger ◽  
Glennis A. Logsdon ◽  
Peter A. Audano ◽  
Arvis Sulovari ◽  
David Porubsky ◽  
...  

AbstractThe sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective stand-alone technology for de novo assembly of human genomes.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Brianna Sierra Chrisman ◽  
Kelley Paskov ◽  
Nate. Stockham ◽  
Kevin Tabatabaei ◽  
Jae-Yoon Jung ◽  
...  

AbstractThe evolutionary dynamics of SARS-CoV-2 have been carefully monitored since the COVID-19 pandemic began in December 2019. However, analysis has focused primarily on single nucleotide polymorphisms and largely ignored the role of insertions and deletions (indels) as well as recombination in SARS-CoV-2 evolution. Using sequences from the GISAID database, we catalogue over 100 insertions and deletions in the SARS-CoV-2 consensus sequences. We hypothesize that these indels are artifacts of recombination events between SARS-CoV-2 replicates whereby RNA-dependent RNA polymerase (RdRp) re-associates with a homologous template at a different loci (“imperfect homologous recombination”). We provide several independent pieces of evidence that suggest this. (1) The indels from the GISAID consensus sequences are clustered at specific regions of the genome. (2) These regions are also enriched for 5’ and 3’ breakpoints in the transcription regulatory site (TRS) independent transcriptome, presumably sites of RNA-dependent RNA polymerase (RdRp) template-switching. (3) Within raw reads, these indel hotspots have cases of both high intra-host heterogeneity and intra-host homogeneity, suggesting that these indels are both consequences of de novo recombination events within a host and artifacts of previous recombination. We briefly analyze the indels in the context of RNA secondary structure, noting that indels preferentially occur in “arms” and loop structures of the predicted folded RNA, suggesting that secondary structure may be a mechanism for TRS-independent template-switching in SARS-CoV-2 or other coronaviruses. These insights into the relationship between structural variation and recombination in SARS-CoV-2 can improve our reconstructions of the SARS-CoV-2 evolutionary history as well as our understanding of the process of RdRp template-switching in RNA viruses.


F1000Research ◽  
2018 ◽  
Vol 7 ◽  
pp. 1391
Author(s):  
Evan Biederstedt ◽  
Jeffrey C. Oliver ◽  
Nancy F. Hansen ◽  
Aarti Jajoo ◽  
Nathan Dunn ◽  
...  

Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enables de novo assembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-based de novo assembly, including large structural variants and divergent haplotypes. Here we present NovoGraph, a method for the construction of a human genome graph directly from a set of de novo assemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles from de novo assemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped.


2021 ◽  
Author(s):  
Tamanna Yasmin ◽  
Phil Grayson ◽  
Margaret F. Docker ◽  
Sara V. Good

The sea lamprey genome undergoes programmed genome rearrangement (PGR) in which ~20% is jettisoned from somatic cells soon after fertilization. Although the role of PGR in embryonic development has been studied, the role of the germline-specific region (GSR) in gonad development is unknown. We analysed RNA-sequence data from 28 sea lamprey gonads sampled across life-history stages, generated a genome-guided de novo superTransciptome with annotations, and identified genes in the GSR. We found that the 638 genes in the GSR are enriched for reproductive processes, exhibit 36x greater odds of being expressed in testes than ovaries, show little evidence of conserved synteny with other chordates, and most have putative paralogues in the GSR and/or somatic genomes. Further, several of these genes play known roles in sex determination and differentiation in other vertebrates. We conclude that the GSR of sea lamprey plays an important role in testicular differentiation and potentially sex determination.


Sign in / Sign up

Export Citation Format

Share Document