Drastic mutations can be hidden in short-read mapping: thousands of mutation clusters in human genome are explained by short-range template switching

Mapping Intimacies ◽

10.1101/2021.11.26.470150 ◽

2021 ◽

Author(s):

Ari Löytynoja

Keyword(s):

Human Genome ◽

De Novo ◽

Human Populations ◽

Template Switching ◽

Sequencing Data ◽

De Novo Mutations ◽

Sequence Comparisons ◽

Short Read ◽

Short Read Sequencing ◽

Template Switch

Variation within human genomes is distributed unevenly and variants show spatial clustering. DNA-replication related template switching is a poorly known mutational mechanism capable of causing major chromosomal rearrangements as well as creating short inverted sequence copies that appear as local mutation clusters in sequence comparisons. We reanalyzed haplotype-resolved genome assemblies representing 25 human populations and multinucleotide variants aggregated from 140,000 human sequencing experiments. We found local template switching to explain thousands of complex mutation clusters across the human genome, the loci segregating within and between populations with a small number appearing as de novo mutations. We developed computational tools for genotyping candidate template switch loci using short-read sequencing data and for identification of template switch events using both short-read data and genotype data. These tools will enable building a catalogue of affected loci and studying the cellular mechanisms behind template switching both in healthy organisms and in disease. Strikingly, we noticed that widely-used analysis pipelines for short-read sequencing data - capable of identifying single nucleotide changes - may miss TSM-origin inversions of tens of base pairs, potentially invalidating medical genetic studies searching for causative alleles behind genetic diseases.

Download Full-text

A Family-Based Probabilistic Method for Capturing De Novo Mutations from High-Throughput Short-Read Sequencing Data

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1713 ◽

2012 ◽

Vol 11 (2) ◽

Cited By ~ 11

Author(s):

Reed A. Cartwright ◽

Julie Hussin ◽

Jonathan E. M. Keebler ◽

Eric A. Stone ◽

Philip Awadalla

Keyword(s):

High Throughput ◽

De Novo ◽

Probabilistic Method ◽

Sequencing Data ◽

De Novo Mutations ◽

Short Read ◽

Short Read Sequencing ◽

Family Based

Download Full-text

Short template switch events explain mutation clusters in the human genome

10.1101/038380 ◽

2016 ◽

Author(s):

Ari Löytynoja ◽

Nick Goldman

Keyword(s):

Human Genome ◽

Genome Rearrangement ◽

Common Ancestor ◽

De Novo ◽

Variant Calling ◽

Template Switching ◽

Human Genomes ◽

Switch Mechanism ◽

Template Switch

AbstractResequencing efforts are uncovering the extent of genetic variation in humans and provide data to study the evolutionary processes shaping our genome. One recurring puzzle in both intra- and inter-species studies is the high frequency of complex mutations comprising multiple nearby base substitutions or insertion-deletions. We devised a generalized mutation model of template switching during replication that extends existing models of genome rearrangement, and used this to study the role of template switch events in the origin of such mutation clusters. Applied to the human genome, our model detects thousands of template switch events during the evolution of human and chimp from their common ancestor, and hundreds of events between two independently sequenced human genomes. While many of these are consistent with the template switch mechanism previously proposed for bacteria but not thought significant in higher organisms, our model also identifies new types of mutations that create short inversions, some flanked by paired inverted repeats. The local template switch process can create numerous complex mutation patterns, including hairpin loop structures, and explains multi-nucleotide mutations and compensatory substitutions without invoking positive selection, complicated and speculative mechanisms, or implausible coincidence. Clustered sequence differences are challenging for mapping and variant calling methods, and we show that detection of mutation clusters with current resequencing methodologies is difficult and many erroneous variant annotations exist in human reference data. Template switch events such as those we have uncovered may have been neglected as an explanation for complex mutations because of biases in commonly used analyses. Incorporation of our model into reference-based analysis pipelines and comparisons of de novo-assembled genomes will lead to improved understanding of genome variation and evolution.

Download Full-text

DACCOR - Detection, charACterization, and reconstruction of Repetitive regions in bacterial genomes

10.7287/peerj.preprints.3480v1 ◽

2017 ◽

Author(s):

Alexander Seitz ◽

Friederike Hanssen ◽

Kay Nieselt

Keyword(s):

Base Pair ◽

Repetitive Sequence ◽

Reference Genome ◽

De Novo ◽

Treponema Pallidum ◽

Sequencing Data ◽

Bacterial Genomes ◽

Short Read ◽

Short Reads ◽

Short Read Sequencing

The reconstruction of genomes using mapping based approaches with short reads experiences difficulties when resolving repetitive regions. These repetitive regions in genomes result in low mapping qualities of the respective reads, which in turn lead to many unresolved bases of the genotypers. Currently, the reconstruction of these regions is often based on modified references in which the repetitive regions are masked. However, for many references such masked genomes are not available or are based on repetitive regions of other genomes. Our idea is to identify repetitive regions in the reference genome de novo. These regions can then be used to reconstruct them separately using short read sequencing data. Afterwards the reconstructed repetitive sequence can be inserted into the reconstructed genome. We present the program DACCOR, which performs these steps automatically. Our results show an increased base pair resolution of the repetitive regions in the reconstruction of Treponema pallidum samples, resulting in fewer unresolved bases.

Download Full-text

Extensive gene duplication in Arabidopsis revealed by pseudo-heterozygosity

10.1101/2021.11.15.468652 ◽

2021 ◽

Author(s):

Benjamin Jaegle ◽

Luz Mayela Soto-Jimenez ◽

Robin Burns ◽

Fernando A. Rabanal ◽

Magnus Nordborg

Keyword(s):

Copy Number ◽

Structural Variation ◽

De Novo ◽

Sequencing Data ◽

Heterozygous Snps ◽

Mendelian Segregation ◽

Short Read ◽

Short Read Sequencing ◽

Snp Data ◽

Number Variation

Background: It is becoming apparent that genomes harbor massive amounts of structural variation, and that this variation has largely gone undetected for technical reasons. In addition to being inherently interesting, structural variation can cause artifacts when short-read sequencing data are mapped to a reference genome. In particular, spurious SNPs (that do not show Mendelian segregation) may result from mapping of reads to duplicated regions. Recalling SNP using the raw reads of the 1001 Arabidopsis Genomes Project we identified 3.3 million heterozygous SNPs (44% of total). Given that Arabidopsis thaliana (A. thaliana) is highly selfing, we hypothesized that these SNPs reflected cryptic copy number variation, and investigated them further. Results: While genuine heterozygosity should occur in tracts within individuals, heterozygosity at a particular locus is instead shared across individuals in a manner that strongly suggests it reflects segregating duplications rather than actual heterozygosity. Focusing on pseudo-heterozygosity in annotated genes, we used GWAS to map the position of the duplicates, identifying 2500 putatively duplicated genes. The results were validated using de novo genome assemblies from six lines. Specific examples included an annotated gene and nearby transposon that, in fact, transpose together. Conclusions: Our study confirms that most heterozygous SNPs calls in A. thaliana are artifacts, and suggest that great caution is needed when analysing SNP data from short-read sequencing. The finding that 10% of annotated genes are copy-number variables, and the realization that neither gene- nor transposon-annotation necessarily tells us what is actually mobile in the genome suggest that future analyses based on independently assembled genomes will be very informative.

Download Full-text

DACCOR–Detection, characterization, and reconstruction of repetitive regions in bacterial genomes

PeerJ ◽

10.7717/peerj.4742 ◽

2018 ◽

Vol 6 ◽

pp. e4742 ◽

Cited By ~ 1

Author(s):

Alexander Seitz ◽

Friederike Hanssen ◽

Kay Nieselt

Keyword(s):

Base Pair ◽

Repetitive Sequence ◽

Reference Genome ◽

De Novo ◽

Treponema Pallidum ◽

Sequencing Data ◽

Bacterial Genomes ◽

Short Read ◽

Short Reads ◽

Short Read Sequencing

The reconstruction of genomes using mapping-based approaches with short reads experiences difficulties when resolving repetitive regions. These repetitive regions in genomes result in low mapping qualities of the respective reads, which in turn lead to many unresolved bases. Currently, the reconstruction of these regions is often based on modified references in which the repetitive regions are masked. However, for many references, such masked genomes are not available or are based on repetitive regions of other genomes. Our idea is to identify repetitive regions in the reference genome de novo. These regions can then be used to reconstruct them separately using short read sequencing data. Afterward, the reconstructed repetitive sequence can be inserted into the reconstructed genome. We present the program detection, characterization, and reconstruction of repetitive regions, which performs these steps automatically. Our results show an increased base pair resolution of the repetitive regions in the reconstruction of Treponema pallidum samples, resulting in fewer unresolved bases.

Download Full-text

Illumina short-read sequencing data, de novo assembly and annotations of the Drosophila nasuta nasuta genome

Data in Brief ◽

10.1016/j.dib.2020.106674 ◽

2021 ◽

Vol 34 ◽

pp. 106674

Author(s):

Stafny DSouza ◽

Koushik Ponnanna ◽

Amruthavalli Chokkanna ◽

Nallur Ramachandra

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Drosophila Nasuta ◽

Drosophila Nasuta Nasuta

Download Full-text

Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms

BMC Bioinformatics ◽

10.1186/1471-2105-13-170 ◽

2012 ◽

Vol 13 (1) ◽

pp. 170 ◽

Cited By ~ 24

Author(s):

Berat Z Haznedaroglu ◽

Darryl Reeves ◽

Hamid Rismani-Yazdi ◽

Jordan Peccia

Keyword(s):

High Throughput ◽

Functional Annotation ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

De Novo Transcriptome Assembly ◽

Sequencing Data ◽

De Novo Transcriptome ◽

Short Read ◽

Short Read Sequencing

Download Full-text

DACCOR - Detection, charACterization, and reconstruction of Repetitive regions in bacterial genomes

10.7287/peerj.preprints.3480 ◽

2017 ◽

Author(s):

Alexander Seitz ◽

Friederike Hanssen ◽

Kay Nieselt

Keyword(s):

Base Pair ◽

Repetitive Sequence ◽

Reference Genome ◽

De Novo ◽

Treponema Pallidum ◽

Sequencing Data ◽

Bacterial Genomes ◽

Short Read ◽

Short Reads ◽

Short Read Sequencing

Download Full-text

De novo mutations across 1,465 diverse genomes reveal mutational insights and reductions in the Amish founder population

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1902766117 ◽

2020 ◽

Vol 117 (5) ◽

pp. 2560-2569 ◽

Cited By ~ 5

Author(s):

Michael D. Kessler ◽

Douglas P. Loesch ◽

James A. Perry ◽

Nancy L. Heard-Costa ◽

Daniel Taliun ◽

...

Keyword(s):

Genetic Variation ◽

De Novo ◽

Whole Genome Sequencing Data ◽

Human Populations ◽

Founder Population ◽

Sequencing Data ◽

De Novo Mutations ◽

Narrow Sense Heritability ◽

High Coverage ◽

Genome Wide

De novo mutations (DNMs), or mutations that appear in an individual despite not being seen in their parents, are an important source of genetic variation whose impact is relevant to studies of human evolution, genetics, and disease. Utilizing high-coverage whole-genome sequencing data as part of the Trans-Omics for Precision Medicine (TOPMed) Program, we called 93,325 single-nucleotide DNMs across 1,465 trios from an array of diverse human populations, and used them to directly estimate and analyze DNM counts, rates, and spectra. We find a significant positive correlation between local recombination rate and local DNM rate, and that DNM rate explains a substantial portion (8.98 to 34.92%, depending on the model) of the genome-wide variation in population-level genetic variation from 41K unrelated TOPMed samples. Genome-wide heterozygosity does correlate with DNM rate, but only explains <1% of variation. While we are underpowered to see small differences, we do not find significant differences in DNM rate between individuals of European, African, and Latino ancestry, nor across ancestrally distinct segments within admixed individuals. However, we did find significantly fewer DNMs in Amish individuals, even when compared with other Europeans, and even after accounting for parental age and sequencing center. Specifically, we found significant reductions in the number of C→A and T→C mutations in the Amish, which seem to underpin their overall reduction in DNMs. Finally, we calculated near-zero estimates of narrow sense heritability (h2), which suggest that variation in DNM rate is significantly shaped by nonadditive genetic effects and the environment.

Download Full-text

On the (Im)possibility to Reconstruct Plasmids from Whole Genome Short-Read Sequencing Data

10.1101/086744 ◽

2016 ◽

Cited By ~ 11

Author(s):

Sergio Arredondo-Alonso ◽

Willem van Schaik ◽

Rob J. Willems ◽

Anita C. Schürch

Keyword(s):

Complete Genome ◽

De Novo ◽

Bacterial Genome ◽

Composition Analysis ◽

Bacterial Survival ◽

Sequencing Data ◽

Genome Sequences ◽

High Throughput Analysis ◽

Short Read ◽

Short Read Sequencing

AbstractPlasmids are autonomous extra-chromosomal elements in bacterial cells that can carry genes that are important for bacterial survival. To benchmark algorithms for automated plasmid sequence reconstruction from short read sequencing data, we selected 42 publicly available complete bacterial genome sequences which were assembled by a combination of long- and short-read data. The selected bacterial genome sequence projects span 12 genera, containing 148 plasmids. We predicted plasmids from short-read data with four different programs (PlasmidSPAdes, Recycler, cBar and PlasmidFinder) and compared the outcome to the reference sequences.PlasmidSPAdes reconstructs plasmids based on coverage differences in the assembly graph. It reconstructed most of the reference plasmids (recall = 0.82) but approximately a quarter of the predicted plasmid contigs were false positives (precision = 0.76). PlasmidSPAdes merged 83 % of the predictions from genomes with multiple plasmids in a single bin. Recycler searches the assembly graph for sub-graphs corresponding to circular sequences and correctly predicted small plasmids but failed with long plasmids (recall = 0.12, precision = 0.30). cBar, which applies pentamer frequency composition analysis to detect plasmid-derived contigs, showed an overall recall and precision of 0.78 and 0.64. However, cBar only categorizes contigs as plasmid-derived and does not bin the different plasmids correctly within a bacterial isolate. PlasmidFinder, which searches for matches in a replicon database, had the highest precision (1.0) but was restricted by the contents of its database and the contig length obtained from de novo assembly (recall = 0.36).Surprisingly, PlasmidSPAdes and Recycler detected single isolated components corresponding to putative novel small plasmids (<10 kbp) which were also predicted as plasmids by cBar.This study shows that it is possible to automatically predict plasmid sequences, but only for small plasmids. The reconstruction of large plasmids (>50 kbp) containing repeated sequences remains challenging and limits the high-throughput analysis of WGS data.Author SummaryShort read sequencing of the DNA of bacteria is often used to understand characteristics such as antibiotic resistance. However the assembly of short read sequencing data with the goal of reconstructing a complete genome is often fragmented and leaves gaps. Therefore independently replicating DNA fragments called plasmids cannot easily be identified from an assembly. Lately a number of programs have been developed to enable the automated prediction of the sequences of plasmids. Here we tested these programs by comparing their outcomes with complete genome sequences. None of the tested programs were able to fully and unambiguously predict distinct plasmid sequences. All programs performed best with the prediction of plasmids smaller than 50 kbp. Larger plasmids were only correctly predicted if they were present as a single contig in the assembly. While predictions by PlasmidSPAdes and cBar contained most of the plasmids, they were merged with or indistinguishable from other plasmids and sometimes chromosome sequences. PlasmidFinder missed most plasmids but all its predictions were correct. Without manual steps or long-read sequencing information, plasmid reconstruction from short read sequencing data remains challenging.

Download Full-text