Ultrafast, alignment-free detection of repeat expansions in NGS and RNAseq data

AbstractShort tandem repeat expansions are an established cause of diseases such as Huntington’s disease. Bioinformatic methods for detecting repeat expansions in short-read sequencing have revealed new repeat expansions in humans. Current bioinformatic methods to detect repeat expansions require alignment information to identify repetitive motif enrichment at genomic locations. We present superSTR, an ultrafast method that does not require alignment, capable of efficiently processing DNA and RNA sequencing data. superSTR is applied to UK Biobank data to efficiently analyse 49,953 whole exomes in a screening experiment, identifying known mutations as well as diseases not previously associated with REs. superSTR also identifies repeat expansion motifs in RNAseq data, demonstrated in several disorders and cell lines. superSTR is a new tool for the most efficient repeat expansion detection currently possible and complements existing locus-specific, reference dependent repeat expansion analysis tools. superSTR is available from https://github.com/bahlolab/superSTR.

Download Full-text

REscan: inferring repeat expansions and structural variation in paired-end short read sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa753 ◽

2020 ◽

Author(s):

Russell Lewis McLaughlin

Keyword(s):

Structural Variation ◽

Sequence Data ◽

Neurological Diseases ◽

Repeat Expansion ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Repeat Expansions ◽

Paired End Sequencing

Abstract Motivation Repeat expansions are an important class of genetic variation in neurological diseases. However, the identification of novel repeat expansions using conventional sequencing methods is a challenge due to their typical lengths relative to short sequence reads and difficulty in producing accurate and unique alignments for repetitive sequence. However, this latter property can be harnessed in paired-end sequencing data to infer the possible locations of repeat expansions and other structural variation. Results This article presents REscan, a command-line utility that infers repeat expansion loci from paired-end short read sequencing data by reporting the proportion of reads orientated towards a locus that do not have an adequately mapped mate. A high REscan statistic relative to a population of data suggests a repeat expansion locus for experimental follow-up. This approach is validated using genome sequence data for 259 cases of amyotrophic lateral sclerosis, of which 24 are positive for a large repeat expansion in C9orf72, showing that REscan statistics readily discriminate repeat expansion carriers from non-carriers. Availabilityand implementation C source code at https://github.com/rlmcl/rescan (GNU General Public Licence v3).

Download Full-text

Long-read sequencing across the C9orf72 ‘GGGGCC’ repeat expansion: implications for clinical use and genetic discovery efforts in human disease

10.1101/176651 ◽

2018 ◽

Cited By ~ 1

Author(s):

Mark T. W. Ebbert ◽

Stefan Farrugia ◽

Jonathon Sens ◽

Karen Jansen-West ◽

Tania F. Gendron ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Repeat Expansion ◽

Whole Genome ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read ◽

Repeat Expansions ◽

Targeted Approach

AbstractBackground: Many neurodegenerative diseases are caused by nucleotide repeat expansions, but most expansions, like the C9orf72 ‘GGGGCC’ (G4C2) repeat that causes approximately 5-7% of all amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD) cases, are too long to sequence using short-read sequencing technologies. It is unclear whether long-read sequencing technologies can traverse these long, challenging repeat expansions. Here, we demonstrate that two long-read sequencing technologies, Pacific Biosciences’ (PacBio) and Oxford Nanopore Technologies’ (ONT), can sequence through disease-causing repeats cloned into plasmids, including the FTD/ALS-causing G4C2 repeat expansion. We also report the first long-read sequencing data characterizing the C9orf72 G4C2 repeat expansion at the nucleotide level in two symptomatic expansion carriers using PacBio whole-genome sequencing and a no-amplification (No-Amp) targeted approach based on CRISPR/Cas9.Results: Both the PacBio and ONT platforms successfully sequenced through the repeat expansions in plasmids. Throughput on the MinlON was a challenge for whole-genome sequencing; we were unable to attain reads covering the human C9orf72 repeat expansion using 15 flow cells. We obtained 8x coverage across the C9orf72 locus using the PacBio Sequel, accurately reporting the unexpanded allele at eight repeats, and reading through the entire expansion with 1324 repeats (7941 nucleotides). Using the No-Amp targeted approach, we attained >800x coverage and were able to identify the unexpanded allele, closely estimate expansion size, and assess nucleotide content in a single experiment. We estimate the individual’s repeat region was >99% G4C2 content, though we cannot rule out small interruptions.Conclusions: Our findings indicate that long-read sequencing is well suited to characterizing known repeat expansions, and for discovering new disease-causing, disease-modifying, or risk-modifying repeat expansions that have gone undetected with conventional short-read sequencing. The PacBio No-Amp targeted approach may have future potential in clinical and genetic counseling environments. Larger and deeper long-read sequencing studies in C9orf72 expansion carriers will be important to determine heterogeneity and whether the repeats are interrupted by non-G4C2 content, potentially mitigating or modifying disease course or age of onset, as interruptions are known to do in other repeat-expansion disorders. These results have broad implications across all diseases where the genetic etiology remains unclear.

Download Full-text

Genome-wide sequencing as a first-tier screening test for short tandem repeat expansions

Genome Medicine ◽

10.1186/s13073-021-00932-9 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Indhu-Shree Rajan-Babu ◽

Junran J. Peng ◽

Readman Chiu ◽

Shelin Adam ◽

Christele Du Souich ◽

...

Keyword(s):

Next Generation Sequencing ◽

Size Range ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Full Mutation ◽

Genome Wide ◽

Repeat Expansions ◽

Generation Sequencing ◽

Short Tandem

Abstract Background Screening for short tandem repeat (STR) expansions in next-generation sequencing data can enable diagnosis, optimal clinical management/treatment, and accurate genetic counseling of patients with repeat expansion disorders. We aimed to develop an efficient computational workflow for reliable detection of STR expansions in next-generation sequencing data and demonstrate its clinical utility. Methods We characterized the performance of eight STR analysis methods (lobSTR, HipSTR, RepeatSeq, ExpansionHunter, TREDPARSE, GangSTR, STRetch, and exSTRa) on next-generation sequencing datasets of samples with known disease-causing full-mutation STR expansions and genomes simulated to harbor repeat expansions at selected loci and optimized their sensitivity. We then used a machine learning decision tree classifier to identify an optimal combination of methods for full-mutation detection. In Burrows-Wheeler Aligner (BWA)-aligned genomes, the ensemble approach of using ExpansionHunter, STRetch, and exSTRa performed the best (precision = 82%, recall = 100%, F1-score = 90%). We applied this pipeline to screen 301 families of children with suspected genetic disorders. Results We identified 10 individuals with full-mutations in the AR, ATXN1, ATXN8, DMPK, FXN, or HTT disease STR locus in the analyzed families. Additional candidates identified in our analysis include two probands with borderline ATXN2 expansions between the established repeat size range for reduced-penetrance and full-penetrance full-mutation and seven individuals with FMR1 CGG repeats in the intermediate/premutation repeat size range. In 67 probands with a prior negative clinical PCR test for the FMR1, FXN, or DMPK disease STR locus, or the spinocerebellar ataxia disease STR panel, our pipeline did not falsely identify aberrant expansion. We performed clinical PCR tests on seven (out of 10) full-mutation samples identified by our pipeline and confirmed the expansion status in all, showing absolute concordance between our bioinformatics and molecular findings. Conclusions We have successfully demonstrated the application of a well-optimized bioinformatics pipeline that promotes the utility of genome-wide sequencing as a first-tier screening test to detect expansions of known disease STRs. Interrogating clinical next-generation sequencing data for pathogenic STR expansions using our ensemble pipeline can improve diagnostic yield and enhance clinical outcomes for patients with repeat expansion disorders.

Download Full-text

ExpansionHunter Denovo: A computational method for locating known and novel repeat expansions in short-read sequencing data

10.1101/863035 ◽

2019 ◽

Author(s):

Egor Dolzhenko ◽

Mark F. Bennett ◽

Phillip A. Richmond ◽

Brett Trost ◽

Sai Chen ◽

...

Keyword(s):

Tandem Repeats ◽

Simulated Data ◽

Computational Method ◽

Detection Methods ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Monogenic Disorders ◽

Genome Wide ◽

Repeat Expansions

AbstractExpansions of short tandem repeats are responsible for over 40 monogenic disorders, and undoubtedly many more pathogenic repeat expansions (REs) remain to be discovered. Existing methods for detecting REs in short-read sequencing data require predefined repeat catalogs. However recent discoveries have emphasized the need for detection methods that do not require candidate repeats to be specified in advance. To address this need, we introduce ExpansionHunter Denovo, an efficient catalog-free method for genome-wide detection of REs. Analysis of real and simulated data shows that our method can identify large expansions of 41 out of 44 pathogenic repeats, including nine recently reported non-reference REs not discoverable via existing methods.ExpansionHunter Denovo is freely available at https://github.com/Illumina/ExpansionHunterDenovo

Download Full-text

An update on the neurological short tandem repeat expansion disorders and the emergence of long-read sequencing diagnostics

Acta Neuropathologica Communications ◽

10.1186/s40478-021-01201-x ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Sanjog R. Chintalaphani ◽

Sandy S. Pineda ◽

Ira W. Deveson ◽

Kishore R. Kumar

Keyword(s):

Tandem Repeat ◽

Short Tandem Repeat ◽

Fragile X ◽

Cost Effective ◽

Repeat Expansion ◽

Main Body ◽

Cgg Repeat ◽

Long Read ◽

Repeat Expansions ◽

Short Tandem

Abstract Background Short tandem repeat (STR) expansion disorders are an important cause of human neurological disease. They have an established role in more than 40 different phenotypes including the myotonic dystrophies, Fragile X syndrome, Huntington’s disease, the hereditary cerebellar ataxias, amyotrophic lateral sclerosis and frontotemporal dementia. Main body STR expansions are difficult to detect and may explain unsolved diseases, as highlighted by recent findings including: the discovery of a biallelic intronic ‘AAGGG’ repeat in RFC1 as the cause of cerebellar ataxia, neuropathy, and vestibular areflexia syndrome (CANVAS); and the finding of ‘CGG’ repeat expansions in NOTCH2NLC as the cause of neuronal intranuclear inclusion disease and a range of clinical phenotypes. However, established laboratory techniques for diagnosis of repeat expansions (repeat-primed PCR and Southern blot) are cumbersome, low-throughput and poorly suited to parallel analysis of multiple gene regions. While next generation sequencing (NGS) has been increasingly used, established short-read NGS platforms (e.g., Illumina) are unable to genotype large and/or complex repeat expansions. Long-read sequencing platforms recently developed by Oxford Nanopore Technology and Pacific Biosciences promise to overcome these limitations to deliver enhanced diagnosis of repeat expansion disorders in a rapid and cost-effective fashion. Conclusion We anticipate that long-read sequencing will rapidly transform the detection of short tandem repeat expansion disorders for both clinical diagnosis and gene discovery.

Download Full-text

ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data

Genome Biology ◽

10.1186/s13059-020-02017-z ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 6

Author(s):

Egor Dolzhenko ◽

Mark F. Bennett ◽

Phillip A. Richmond ◽

Brett Trost ◽

Sai Chen ◽

...

Keyword(s):

Computational Method ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Repeat Expansions

Download Full-text

Characterization of FMR1 Repeat Expansion and Intragenic Variants by Indirect Sequence Capture

Frontiers in Genetics ◽

10.3389/fgene.2021.743230 ◽

2021 ◽

Vol 12 ◽

Author(s):

Valentina Grosso ◽

Luca Marcolungo ◽

Simone Maestri ◽

Massimiliano Alfano ◽

Denise Lavezzari ◽

...

Keyword(s):

Repeat Expansion ◽

Single Nucleotide ◽

Short Read Sequencing ◽

Sequence Capture ◽

Long Read ◽

Repeat Expansions ◽

Nucleotide Resolution ◽

Generation Sequencing ◽

Single Nucleotide Resolution

Traditional methods for the analysis of repeat expansions, which underlie genetic disorders, such as fragile X syndrome (FXS), lack single-nucleotide resolution in repeat analysis and the ability to characterize causative variants outside the repeat array. These drawbacks can be overcome by long-read and short-read sequencing, respectively. However, the routine application of next-generation sequencing in the clinic requires target enrichment, and none of the available methods allows parallel analysis of long-DNA fragments using both sequencing technologies. In this study, we investigated the use of indirect sequence capture (Xdrop technology) coupled to Nanopore and Illumina sequencing to characterize FMR1, the gene responsible of FXS. We achieved the efficient enrichment (> 200×) of large target DNA fragments (~60–80 kbp) encompassing the entire FMR1 gene. The analysis of Xdrop-enriched samples by Nanopore long-read sequencing allowed the complete characterization of repeat lengths in samples with normal, pre-mutation, and full mutation status (> 1 kbp), and correctly identified repeat interruptions relevant for disease prognosis and transmission. Single-nucleotide variants (SNVs) and small insertions/deletions (indels) could be detected in the same samples by Illumina short-read sequencing, completing the mutational testing through the identification of pathogenic variants within the FMR1 gene, when no typical CGG repeat expansion is detected. The study successfully demonstrated the parallel analysis of repeat expansions and SNVs/indels in the FMR1 gene at single-nucleotide resolution by combining Xdrop enrichment with two next-generation sequencing approaches. With the appropriate optimization necessary for the clinical settings, the system could facilitate both the study of genotype–phenotype correlation in FXS and enable a more efficient diagnosis and genetic counseling for patients and their relatives.

Download Full-text

Secondary structural choice of DNA and RNA associated with CGG/CCG trinucleotide repeat expansion rationalizes the RNA misprocessing in FXTAS

Scientific Reports ◽

10.1038/s41598-021-87097-y ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Yogeeshwar Ajjugal ◽

Narendar Kolimi ◽

Thenmalarchelvi Rathinavelan

Keyword(s):

Rna Binding ◽

Trinucleotide Repeat ◽

Rna Binding Proteins ◽

Fragile X ◽

Repeat Expansion ◽

Mobility Shift ◽

Cgg Repeat ◽

Dna And Rna ◽

Structural Choice ◽

Sense Strand

AbstractCGG tandem repeat expansion in the 5′-untranslated region of the fragile X mental retardation-1 (FMR1) gene leads to unusual nucleic acid conformations, hence causing genetic instabilities. We show that the number of G…G (in CGG repeat) or C…C (in CCG repeat) mismatches (other than A…T, T…A, C…G and G…C canonical base pairs) dictates the secondary structural choice of the sense and antisense strands of the FMR1 gene and their corresponding transcripts in fragile X-associated tremor/ataxia syndrome (FXTAS). The circular dichroism (CD) spectra and electrophoretic mobility shift assay (EMSA) reveal that CGG DNA (sense strand of the FMR1 gene) and its transcript favor a quadruplex structure. CD, EMSA and molecular dynamics (MD) simulations also show that more than four C…C mismatches cannot be accommodated in the RNA duplex consisting of the CCG repeat (antisense transcript); instead, it favors an i-motif conformational intermediate. Such a preference for unusual secondary structures provides a convincing justification for the RNA foci formation due to the sequestration of RNA-binding proteins to the bidirectional transcripts and the repeat-associated non-AUG translation that are observed in FXTAS. The results presented here also suggest that small molecule modulators that can destabilize FMR1 CGG DNA and RNA quadruplex structures could be promising candidates for treating FXTAS.

Download Full-text

Diagnostics of short tandem repeat expansion variants using massively parallel sequencing and componential tools

European Journal of Human Genetics ◽

10.1038/s41431-018-0302-4 ◽

2018 ◽

Vol 27 (3) ◽

pp. 400-407 ◽

Cited By ~ 3

Author(s):

Rick H. de Leeuw ◽

Dominique Garnier ◽

Rosemarie M. J. M. Kroon ◽

Corinne G. C. Horlings ◽

Emile de Meijer ◽

...

Keyword(s):

Tandem Repeat ◽

Short Tandem Repeat ◽

Massively Parallel Sequencing ◽

Repeat Expansion ◽

Massively Parallel ◽

Parallel Sequencing ◽

Short Tandem

Download Full-text

High resolution copy number inference in cancer using short-molecule nanopore sequencing

10.1101/2020.12.28.424602 ◽

2020 ◽

Author(s):

Timour Baslan ◽

Sam Kovaka ◽

Fritz J. Sedlazeck ◽

Yanming Zhang ◽

Robert Wappel ◽

...

Keyword(s):

Copy Number ◽

Cost Effective ◽

Chromosome Analysis ◽

Ease Of Use ◽

Precision Oncology ◽

Nanopore Sequencing ◽

Dna Molecules ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing

ABSTRACTGenome copy number is an important source of genetic variation in health and disease. In cancer, clinically actionable Copy Number Alterations (CNAs) can be inferred from short-read sequencing data, enabling genomics-based precision oncology. Emerging Nanopore sequencing technologies offer the potential for broader clinical utility, for example in smaller hospitals, due to lower instrument cost, higher portability, and ease of use. Nonetheless, Nanopore sequencing devices are limited in terms of the number of retrievable sequencing reads/molecules compared to short-read sequencing platforms. This represents a challenge for applications that require high read counts such as CNA inference. To address this limitation, we targeted the sequencing of short-length DNA molecules loaded at optimized concentration in an effort to increase sequence read/molecule yield from a single nanopore run. We show that sequencing short DNA molecules reproducibly returns high read counts and allows high quality CNA inference. We demonstrate the clinical relevance of this approach by accurately inferring CNAs in acute myeloid leukemia samples. The data shows that, compared to traditional approaches such as chromosome analysis/cytogenetics, short molecule nanopore sequencing returns more sensitive, accurate copy number information in a cost effective and expeditious manner, including for multiplex samples. Our results provide a framework for the sequencing of relatively short DNA molecules on nanopore devices with applications in research and medicine, that include but are not limited to, CNAs.

Download Full-text