Towards a better understanding of the low recall of insertion variants with short-read based variant callers

Abstract Background Since 2009, numerous tools have been developed to detect structural variants using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has completely changed the situation. In 2019, two independent cross technologies studies have published the most complete variant callsets with sequence resolved insertions in human individuals. Among the reported insertions, only 17 to 28% could be discovered with short-read based tools. Results In this work, we performed an in-depth analysis of these unprecedented insertion callsets in order to investigate the causes of such failures. We have first established a precise classification of insertion variants according to four layers of characterization: the nature and size of the inserted sequence, the genomic context of the insertion site and the breakpoint junction complexity. Because these levels are intertwined, we then used simulations to characterize the impact of each complexity factor on the recall of several structural variant callers. We showed that most reported insertions exhibited characteristics that may interfere with their discovery: 63% were tandem repeat expansions, 38% contained homology larger than 10 bp within their breakpoint junctions and 70% were located in simple repeats. Consequently, the recall of short-read based variant callers was significantly lower for such insertions (6% for tandem repeats vs 56% for mobile element insertions). Simulations showed that the most impacting factor was the insertion type rather than the genomic context, with various difficulties being handled differently among the tested structural variant callers, and they highlighted the lack of sequence resolution for most insertion calls. Conclusions Our results explain the low recall by pointing out several difficulty factors among the observed insertion features and provide avenues for improving SV caller algorithms and their combinations.

Download Full-text

Towards a better understanding of the low recall of insertion variants with short-read based variant callers

10.1101/2020.06.09.142232 ◽

2020 ◽

Author(s):

Wesley Delage ◽

Julien Thevenon ◽

Claire Lemaitre

Keyword(s):

Gold Standard ◽

Insertion Site ◽

Structural Variants ◽

Genomic Context ◽

Breakpoint Junction ◽

Short Read ◽

Depth Analysis ◽

Long Read ◽

The Impact

AbstractSince 2009, numerous tools have been developed to detect structural variants (SVs) using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has completely changed the situation. In 2019, two independent cross technologies studies have published the most complete variant callsets with sequence resolved insertions in human individuals. Among the reported insertions, only 17 to 37% could be discovered with short-read based tools. In this work, we performed an in-depth analysis of these unprecedented insertion callsets in order to investigate the causes of such failures. We have first established a precise classification of insertion variants according to four layers of characterization: the nature and size of the inserted sequence, the genomic context of the insertion site and the breakpoint junction complexity. Because these levels are intertwined, we then used simulations to characterize the impact of each complexity factor on the recall of several SV callers. Simulations showed that the most impacting factor was the insertion type rather than the genomic context, with various difficulties being handled differently among the tested SV callers, and they highlighted the lack of sequence resolution for most insertion calls. Our results explain the low recall by pointing out several difficulty factors among the observed insertion features and provide avenues for improving SV caller algorithms and their [email protected]

Download Full-text

CaBagE: A Cas9-based Background Elimination strategy for targeted, long-read DNA sequencing

PLoS ONE ◽

10.1371/journal.pone.0241253 ◽

2021 ◽

Vol 16 (4) ◽

pp. e0241253

Author(s):

Amelia D. Wallace ◽

Thomas A. Sasani ◽

Jordan Swanier ◽

Brooke L. Gates ◽

Jeff Greenland ◽

...

Keyword(s):

Dna Sequencing ◽

Tandem Repeats ◽

Short Read ◽

Background Elimination ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Repeat Expansions ◽

Sequencing Platforms ◽

Als Patients

A substantial fraction of the human genome is difficult to interrogate with short-read DNA sequencing technologies due to paralogy, complex haplotype structures, or tandem repeats. Long-read sequencing technologies, such as Oxford Nanopore’s MinION, enable direct measurement of complex loci without introducing many of the biases inherent to short-read methods, though they suffer from relatively lower throughput. This limitation has motivated recent efforts to develop amplification-free strategies to target and enrich loci of interest for subsequent sequencing with long reads. Here, we present CaBagE, a method for target enrichment that is efficient and useful for sequencing large, structurally complex targets. The CaBagE method leverages the stable binding of Cas9 to its DNA target to protect desired fragments from digestion with exonuclease. Enriched DNA fragments are then sequenced with Oxford Nanopore’s MinION long-read sequencing technology. Enrichment with CaBagE resulted in a median of 116X coverage (range 39–416) of target loci when tested on five genomic targets ranging from 4-20kb in length using healthy donor DNA. Four cancer gene targets were enriched in a single reaction and multiplexed on a single MinION flow cell. We further demonstrate the utility of CaBagE in two ALS patients with C9orf72 short tandem repeat expansions to produce genotype estimates commensurate with genotypes derived from repeat-primed PCR for each individual. With CaBagE there is a physical enrichment of on-target DNA in a given sample prior to sequencing. This feature allows adaptability across sequencing platforms and potential use as an enrichment strategy for applications beyond sequencing. CaBagE is a rapid enrichment method that can illuminate regions of the ‘hidden genome’ underlying human disease.

Download Full-text

CaBagE: a Cas9-based Background Elimination strategy for targeted, long-read DNA sequencing

10.1101/2020.10.13.337253 ◽

2020 ◽

Author(s):

Amelia Wallace ◽

Thomas A. Sasani ◽

Jordan Swanier ◽

Brooke L. Gates ◽

Jeff Greenland ◽

...

Keyword(s):

Dna Sequencing ◽

Tandem Repeats ◽

Short Read ◽

Background Elimination ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Repeat Expansions ◽

Sequencing Platforms ◽

Als Patients

AbstractA substantial fraction of the human genome is difficult to interrogate with short-read DNA sequencing technologies due to paralogy, complex haplotype structures, or tandem repeats. Long-read sequencing technologies, such as Oxford Nanopore’s MinION, enable direct measurement of complex loci without introducing many of the biases inherent to short-read methods, though they suffer from relatively lower throughput. This limitation has motivated recent efforts to develop amplification-free strategies to target and enrich loci of interest for subsequent sequencing with long reads. Here, we present CaBagE, a novel method for target enrichment that is efficient and useful for sequencing large, structurally complex targets. The CaBagE method leverages the stable binding of Cas9 to its DNA target to protect desired fragments from digestion with exonuclease. Enriched DNA fragments are then sequenced with Oxford Nanopore’s MinION long-read sequencing technology. Enrichment with CaBagE resulted in up to 416X coverage of target loci when tested on five genomic targets ranging from 4-20kb in length using healthy donor DNA. Four cancer gene targets were enriched in a single reaction and multiplexed on a single MinION flow cell. We further demonstrate the utility of CaBagE in two ALS patients with C9orf72 short tandem repeat expansions to produce genotype estimates commensurate with genotypes derived from repeat-primed PCR for each individual. With CaBagE there is a physical enrichment of on-target DNA in a given sample prior to sequencing. This feature allows adaptability across sequencing platforms and potential use as an enrichment strategy for applications beyond sequencing. CaBagE is a rapid enrichment method that can illuminate regions of the ‘hidden genome’ underlying human disease.

Download Full-text

Human-specific tandem repeat expansion and differential gene expression during primate evolution

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1912175116 ◽

2019 ◽

Vol 116 (46) ◽

pp. 23243-23253 ◽

Cited By ~ 13

Author(s):

Arvis Sulovari ◽

Ruiyang Li ◽

Peter A. Audano ◽

David Porubsky ◽

Mitchell R. Vollger ◽

...

Keyword(s):

Tandem Repeat ◽

Tandem Repeats ◽

Sequence Data ◽

Variable Number ◽

Specific Expression ◽

Sequence Composition ◽

Transcription Profiles ◽

Long Read ◽

Repeat Expansions ◽

Human Specific

Short tandem repeats (STRs) and variable number tandem repeats (VNTRs) are important sources of natural and disease-causing variation, yet they have been problematic to resolve in reference genomes and genotype with short-read technology. We created a framework to model the evolution and instability of STRs and VNTRs in apes. We phased and assembled 3 ape genomes (chimpanzee, gorilla, and orangutan) using long-read and 10x Genomics linked-read sequence data for 21,442 human tandem repeats discovered in 6 haplotype-resolved assemblies of Yoruban, Chinese, and Puerto Rican origin. We define a set of 1,584 STRs/VNTRs expanded specifically in humans, including large tandem repeats affecting coding and noncoding portions of genes (e.g., MUC3A, CACNA1C). We show that short interspersed nuclear element–VNTR–Alu (SVA) retrotransposition is the main mechanism for distributing GC-rich human-specific tandem repeat expansions throughout the genome but with a bias against genes. In contrast, we observe that VNTRs not originating from retrotransposons have a propensity to cluster near genes, especially in the subtelomere. Using tissue-specific expression from human and chimpanzee brains, we identify genes where transcript isoform usage differs significantly, likely caused by cryptic splicing variation within VNTRs. Using single-cell expression from cerebral organoids, we observe a strong effect for genes associated with transcription profiles analogous to intermediate progenitor cells. Finally, we compare the sequence composition of some of the largest human-specific repeat expansions and identify 52 STRs/VNTRs with at least 40 uninterrupted pure tracts as candidates for genetically unstable regions associated with disease.

Download Full-text

Comparison of different technologies for the decipherment of the whole genome sequence of Campylobacter jejuni BfR-CA-14430

Gut Pathogens ◽

10.1186/s13099-019-0340-7 ◽

2019 ◽

Vol 11 (1) ◽

Author(s):

Lennard Epping ◽

Julia C. Golz ◽

Marie-Theres Knüver ◽

Charlotte Huber ◽

Andrea Thürmer ◽

...

Keyword(s):

Campylobacter Jejuni ◽

Genome Sequence ◽

Bacterial Species ◽

Illumina Miseq ◽

Chicken Meat ◽

Whole Genome ◽

Short Read ◽

Plasmid Sequence ◽

Depth Analysis ◽

Long Read

Abstract Background Campylobacter jejuni is a zoonotic pathogen that infects the human gut through the food chain mainly by consumption of undercooked chicken meat, raw chicken cross-contaminated ready-to-eat food or by raw milk. In the last decades, C. jejuni has increasingly become the most common bacterial cause for food-born infections in high income countries, costing public health systems billions of euros each year. Currently, different whole genome sequencing techniques such as short-read bridge amplification and long-read single molecule real-time sequencing techniques are applied for in-depth analysis of bacterial species, in particular, Illumina MiSeq, PacBio and MinION. Results In this study, we analyzed a recently isolated C. jejuni strain from chicken meat by short- and long-read data from Illumina, PacBio and MinION sequencing technologies. For comparability, this strain is used in the German PAC-CAMPY research consortium in several studies, including phenotypic analysis of biofilm formation, natural transformation and in vivo colonization models. The complete assembled genome sequence most likely consists of a chromosome of 1,645,980 bp covering 1665 coding sequences as well as a plasmid sequence with 41,772 bp that encodes for 46 genes. Multilocus sequence typing revealed that the strain belongs to the clonal complex CC-21 (ST-44) which is known to be involved in C. jejuni human infections, including outbreaks. Furthermore, we discovered resistance determinants and a point mutation in the DNA gyrase (gyrA) that render the bacterium resistant against ampicillin, tetracycline and (fluoro-)quinolones. Conclusion The comparison of Illumina MiSeq, PacBio and MinION sequencing and analyses with different assembly tools enabled us to reconstruct a complete chromosome as well as a circular plasmid sequence of the C. jejuni strain BfR-CA-14430. Illumina short-read sequencing in combination with either PacBio or MinION can substantially improve the quality of the complete chromosome and epichromosomal elements on the level of mismatches and insertions/deletions, depending on the assembly program used.

Download Full-text

Long-read sequencing across the C9orf72 ‘GGGGCC’ repeat expansion: implications for clinical use and genetic discovery efforts in human disease

10.1101/176651 ◽

2018 ◽

Cited By ~ 1

Author(s):

Mark T. W. Ebbert ◽

Stefan Farrugia ◽

Jonathon Sens ◽

Karen Jansen-West ◽

Tania F. Gendron ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Repeat Expansion ◽

Whole Genome ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read ◽

Repeat Expansions ◽

Targeted Approach

AbstractBackground: Many neurodegenerative diseases are caused by nucleotide repeat expansions, but most expansions, like the C9orf72 ‘GGGGCC’ (G4C2) repeat that causes approximately 5-7% of all amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD) cases, are too long to sequence using short-read sequencing technologies. It is unclear whether long-read sequencing technologies can traverse these long, challenging repeat expansions. Here, we demonstrate that two long-read sequencing technologies, Pacific Biosciences’ (PacBio) and Oxford Nanopore Technologies’ (ONT), can sequence through disease-causing repeats cloned into plasmids, including the FTD/ALS-causing G4C2 repeat expansion. We also report the first long-read sequencing data characterizing the C9orf72 G4C2 repeat expansion at the nucleotide level in two symptomatic expansion carriers using PacBio whole-genome sequencing and a no-amplification (No-Amp) targeted approach based on CRISPR/Cas9.Results: Both the PacBio and ONT platforms successfully sequenced through the repeat expansions in plasmids. Throughput on the MinlON was a challenge for whole-genome sequencing; we were unable to attain reads covering the human C9orf72 repeat expansion using 15 flow cells. We obtained 8x coverage across the C9orf72 locus using the PacBio Sequel, accurately reporting the unexpanded allele at eight repeats, and reading through the entire expansion with 1324 repeats (7941 nucleotides). Using the No-Amp targeted approach, we attained >800x coverage and were able to identify the unexpanded allele, closely estimate expansion size, and assess nucleotide content in a single experiment. We estimate the individual’s repeat region was >99% G4C2 content, though we cannot rule out small interruptions.Conclusions: Our findings indicate that long-read sequencing is well suited to characterizing known repeat expansions, and for discovering new disease-causing, disease-modifying, or risk-modifying repeat expansions that have gone undetected with conventional short-read sequencing. The PacBio No-Amp targeted approach may have future potential in clinical and genetic counseling environments. Larger and deeper long-read sequencing studies in C9orf72 expansion carriers will be important to determine heterogeneity and whether the repeats are interrupted by non-G4C2 content, potentially mitigating or modifying disease course or age of onset, as interruptions are known to do in other repeat-expansion disorders. These results have broad implications across all diseases where the genetic etiology remains unclear.

Download Full-text

Robust detection of tandem repeat expansions from long DNA reads

10.1101/356931 ◽

2018 ◽

Cited By ~ 1

Author(s):

Satomi Mitsuhashi ◽

Martin C Frith ◽

Takeshi Mizuguchi ◽

Satoko Miyatake ◽

Tomoko Toyota ◽

...

Keyword(s):

Tandem Repeat ◽

Tandem Repeats ◽

Genetic Diseases ◽

Error Rates ◽

Robust Detection ◽

Sequencing Errors ◽

Tandem Repeat Sequences ◽

Long Read ◽

Repeat Expansions ◽

The Many

AbstractTandemly repeated sequences are highly mutable and variable features of genomes. Tandem repeat expansions are responsible for a growing list of human diseases, even though it is hard to determine tandem repeat sequences with current DNA sequencing technology. Recent long-read technologies are promising, because the DNA reads are often longer than the repetitive regions, but are hampered by high error rates. Here, we report robust detection of human repeat expansions from careful alignments of long (PacBio and nanopore) reads to a reference genome. Our method (tandem-genotypes) is robust to systematic sequencing errors, inexact repeats with fuzzy boundaries, and low sequencing coverage. By comparing to healthy controls, we can prioritize pathological expansions within the top 10 out of 700000 tandem repeats in the genome. This may help to elucidate the many genetic diseases whose causes remain unknown.

Download Full-text

Recent advances in the detection of repeat expansions with short-read next-generation sequencing

F1000Research ◽

10.12688/f1000research.13980.1 ◽

2018 ◽

Vol 7 ◽

pp. 736 ◽

Cited By ~ 31

Author(s):

Melanie Bahlo ◽

Mark F Bennett ◽

Peter Degorski ◽

Rick M Tankard ◽

Martin B Delatycki ◽

...

Keyword(s):

Tandem Repeats ◽

Diagnostic Yield ◽

Genetic Disorders ◽

Next Generation ◽

Base Pairs ◽

Short Read ◽

Whole Exome ◽

Repeat Expansions ◽

Simple Repeats ◽

Santa Cruz Genome

Short tandem repeats (STRs), also known as microsatellites, are commonly defined as consisting of tandemly repeated nucleotide motifs of 2–6 base pairs in length. STRs appear throughout the human genome, and about 239,000 are documented in the Simple Repeats Track available from the UCSC (University of California, Santa Cruz) genome browser. STRs vary in size, producing highly polymorphic markers commonly used as genetic markers. A small fraction of STRs (about 30 loci) have been associated with human disease whereby one or both alleles exceed an STR-specific threshold in size, leading to disease. Detection of repeat expansions is currently performed with polymerase chain reaction–based assays or with Southern blots for large expansions. The tests are expensive and time-consuming and are not always conclusive, leading to lengthy diagnostic journeys for patients, potentially including missed diagnoses. The advent of whole exome and whole genome sequencing has identified the genetic cause of many genetic disorders; however, analysis pipelines are focused primarily on the detection of short nucleotide variations and short insertions and deletions (indels). Until recently, repeat expansions, with the exception of the smallest expansion (SCA6), were not detectable in next-generation short-read sequencing datasets and would have been ignored in most analyses. In the last two years, four analysis methods with accompanying software (ExpansionHunter, exSTRa, STRetch, and TREDPARSE) have been released. Although a comprehensive comparative analysis of the performance of these methods across all known repeat expansions is still lacking, it is clear that these methods are a valuable addition to any existing analysis pipeline. Here, we detail how to assess short-read data for evidence of expansions, reviewing all four methods and outlining their strengths and weaknesses. Implementation of these methods should lead to increased diagnostic yield of repeat expansion disorders for known STR loci and has the potential to detect novel repeat expansions.

Download Full-text

Precise characterization of somatic structural variations and mobile element insertions from paired long-read sequencing data with nanomonsv

10.1101/2020.07.22.214262 ◽

2020 ◽

Author(s):

Yuichi Shiraishi ◽

Junji Koya ◽

Kenichi Chiba ◽

Yuki Saito ◽

Ai Okada ◽

...

Keyword(s):

Matched Control ◽

Mobile Element ◽

Sequencing Data ◽

Structural Variations ◽

Short Read ◽

Long Read ◽

Functional Consequences ◽

Single Base Resolution ◽

Mutational Processes

AbstractWe introduce our novel software, nanomonsv, for detecting somatic structural variations (SVs) using tumor and matched control long-read sequencing data with a single-base resolution. Using paired long-read sequencing data from three cancer cell-lines and their matched lymphoblastoid lines, we demonstrate that our approach can identify not only somatic SVs that can be captured with short-read technologies but also novel ones especially those whose breakpoints are located in repeat regions. In addition, we have developed a workflow for classifying mobile element insertions while elucidating their in-depth properties such as 5′ truncations, internal inversion as well as source sites in the case of LINE1 transductions. Finally, we identify complex SVs probably caused by replication mechanisms or telomere crisis by examining the co-occurrence of multiple somatic SVs in common supporting reads. In summary, our approaches applied to cancer long-read sequencing data can reveal various features of somatic SVs and will lead to further understanding of mutational processes and functional consequences of somatic SVs.

Download Full-text

ExpansionHunter Denovo: A computational method for locating known and novel repeat expansions in short-read sequencing data

10.1101/863035 ◽

2019 ◽

Author(s):

Egor Dolzhenko ◽

Mark F. Bennett ◽

Phillip A. Richmond ◽

Brett Trost ◽

Sai Chen ◽

...

Keyword(s):

Tandem Repeats ◽

Simulated Data ◽

Computational Method ◽

Detection Methods ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Monogenic Disorders ◽

Genome Wide ◽

Repeat Expansions

AbstractExpansions of short tandem repeats are responsible for over 40 monogenic disorders, and undoubtedly many more pathogenic repeat expansions (REs) remain to be discovered. Existing methods for detecting REs in short-read sequencing data require predefined repeat catalogs. However recent discoveries have emphasized the need for detection methods that do not require candidate repeats to be specified in advance. To address this need, we introduce ExpansionHunter Denovo, an efficient catalog-free method for genome-wide detection of REs. Analysis of real and simulated data shows that our method can identify large expansions of 41 out of 44 pathogenic repeats, including nine recently reported non-reference REs not discoverable via existing methods.ExpansionHunter Denovo is freely available at https://github.com/Illumina/ExpansionHunterDenovo

Download Full-text