REViewer: Haplotype-resolved visualization of read alignments in and around tandem repeats

Background: Expansions of short tandem repeats are the cause of many neurogenetic disorders including familial amyotrophic lateral sclerosis, Huntington disease, and many others. Multiple methods have been recently developed that can identify repeat expansions in whole genome or exome sequencing data. Despite the widely-recognized need for visual assessment of variant calls in clinical settings, current computational tools lack the ability to produce such visualizations for repeat expansions. Expanded repeats are difficult to visualize because they correspond to large insertions relative to the reference genome and involve many misaligning and ambiguously aligning reads. Results: We implemented REViewer, a computational method for visualization of sequencing data in genomic regions containing long repeat expansions. To generate a read pileup, REViewer reconstructs local haplotype sequences and distributes reads to these haplotypes in a way that is most consistent with the fragment lengths and evenness of read coverage. To create appropriate training materials for onboarding new users, we performed a concordance study involving 12 scientists involved in STR research. We used the results of this study to create a user guide that describes the basic principles of using REViewer as well as a guide to the typical features of read pileups that correspond to low confidence repeat genotype calls. Additionally, we demonstrated that REViewer can be used to annotate clinically-relevant repeat interruptions by comparing visual assessment results of 44 FMR1 repeat alleles with the results of triplet repeat primed PCR. For 38 of these alleles, the results of visual assessment were consistent with triplet repeat primed PCR. Conclusions: Read pileup plots generated by REViewer offer an intuitive way to visualize sequencing data in regions containing long repeat expansions. Laboratories can use REViewer to assess the quality of repeat genotype calls as well as to visually detect interruptions or other imperfections in the repeat sequence and the surrounding flanking regions.

Download Full-text

ExpansionHunter Denovo: A computational method for locating known and novel repeat expansions in short-read sequencing data

10.1101/863035 ◽

2019 ◽

Author(s):

Egor Dolzhenko ◽

Mark F. Bennett ◽

Phillip A. Richmond ◽

Brett Trost ◽

Sai Chen ◽

...

Keyword(s):

Tandem Repeats ◽

Simulated Data ◽

Computational Method ◽

Detection Methods ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Monogenic Disorders ◽

Genome Wide ◽

Repeat Expansions

AbstractExpansions of short tandem repeats are responsible for over 40 monogenic disorders, and undoubtedly many more pathogenic repeat expansions (REs) remain to be discovered. Existing methods for detecting REs in short-read sequencing data require predefined repeat catalogs. However recent discoveries have emphasized the need for detection methods that do not require candidate repeats to be specified in advance. To address this need, we introduce ExpansionHunter Denovo, an efficient catalog-free method for genome-wide detection of REs. Analysis of real and simulated data shows that our method can identify large expansions of 41 out of 44 pathogenic repeats, including nine recently reported non-reference REs not discoverable via existing methods.ExpansionHunter Denovo is freely available at https://github.com/Illumina/ExpansionHunterDenovo

Download Full-text

Critical assessment of bioinformatics methods for the characterization of pathological repeat expansions with single-molecule sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbz099 ◽

2019 ◽

Vol 21 (6) ◽

pp. 1971-1986 ◽

Cited By ~ 1

Author(s):

Matteo Chiara ◽

Federico Zambelli ◽

Ernesto Picardi ◽

David S Horner ◽

Graziano Pesole

Keyword(s):

Single Molecule ◽

Tandem Repeats ◽

Simulated Data ◽

Detailed Comparison ◽

Sequencing Data ◽

Single Molecule Sequencing ◽

Sequencing Technologies ◽

Repeat Expansions

Abstract A number of studies have reported the successful application of single-molecule sequencing technologies to the determination of the size and sequence of pathological expanded microsatellite repeats over the last 5 years. However, different custom bioinformatics pipelines were employed in each study, preventing meaningful comparisons and somewhat limiting the reproducibility of the results. In this review, we provide a brief summary of state-of-the-art methods for the characterization of expanded repeats alleles, along with a detailed comparison of bioinformatics tools for the determination of repeat length and sequence, using both real and simulated data. Our reanalysis of publicly available human genome sequencing data suggests a modest, but statistically significant, increase of the error rate of single-molecule sequencing technologies at genomic regions containing short tandem repeats. However, we observe that all the methods herein tested, irrespective of the strategy used for the analysis of the data (either based on the alignment or assembly of the reads), show high levels of sensitivity in both the detection of expanded tandem repeats and the estimation of the expansion size, suggesting that approaches based on single-molecule sequencing technologies are highly effective for the detection and quantification of tandem repeat expansions and contractions.

Download Full-text

Analysis Techniques

DNA Fingerprinting ◽

10.1093/oso/9780716770015.003.0009 ◽

1993 ◽

Keyword(s):

Tandem Repeats ◽

Dna Analysis ◽

Restriction Enzymes ◽

Repeat Sequence ◽

Dot Blot ◽

Analysis Techniques ◽

Base Sequences ◽

Optimum Reaction ◽

Flanking Regions

Conventional DNA analysis techniques include cleavage of DNA by restriction enzymes, fragment electrophoresis, Southern transfer, probe labeling, probegenomic fragment hybridization, and print detection (Cawood 1989, Sambrook 1989, Berger 1987). Details of the assay conditions may vary considerably depending on the specific probes hybridized. Endonuclease digestion, electrophoresis, and Southern transfer are not required with simple dot-blot procedures. The quality of the final result can be no greater than the quality of the input DNA specimen and the attention of the analyst to assay details. The format of the analysis blot must be carefully considered to include control specimens and a broad range of size markers. The analyst must also be certain about the sizes of the profile fragments to accurately determine if matches exist between crime evidence and suspect specimen or offspring and putative parent specimens and to calculate the match probabilities. Restriction enzymes cleave DNA at specific recognition base sequences. It is important to choose an enzyme with sites flanking the repeats when fragments consisting of different numbers of tandem repeats are to be characterized for DNA profiling. Cleavage within a repeat sequence will result in the production of small fragments that may be unresolvable. The choice of enzyme, in this respect, is accomplished either by trial and error or by knowledge of the base sequence of the fragment flanking regions. The optimum reaction conditions vary for each enzyme; consequently, suppliers usually provide information sheets for the user. Digestion temperature and buffer salt concentration are the critical features. The reaction mixture can be overlaid with a few drops of paraffin oil to prevent vapor formation and changes in the buffer concentration. This applies mainly to enzymes such as Taq I that require high reaction temperatures (65°C in this example). Unless specifically indicated otherwise, three different strength ionic buffers will accommodate most enzymes.

Download Full-text

ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data

Genome Biology ◽

10.1186/s13059-020-02017-z ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 6

Author(s):

Egor Dolzhenko ◽

Mark F. Bennett ◽

Phillip A. Richmond ◽

Brett Trost ◽

Sai Chen ◽

...

Keyword(s):

Computational Method ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Repeat Expansions

Download Full-text

REscan: inferring repeat expansions and structural variation in paired-end short read sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa753 ◽

2020 ◽

Author(s):

Russell Lewis McLaughlin

Keyword(s):

Structural Variation ◽

Sequence Data ◽

Neurological Diseases ◽

Repeat Expansion ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Repeat Expansions ◽

Paired End Sequencing

Abstract Motivation Repeat expansions are an important class of genetic variation in neurological diseases. However, the identification of novel repeat expansions using conventional sequencing methods is a challenge due to their typical lengths relative to short sequence reads and difficulty in producing accurate and unique alignments for repetitive sequence. However, this latter property can be harnessed in paired-end sequencing data to infer the possible locations of repeat expansions and other structural variation. Results This article presents REscan, a command-line utility that infers repeat expansion loci from paired-end short read sequencing data by reporting the proportion of reads orientated towards a locus that do not have an adequately mapped mate. A high REscan statistic relative to a population of data suggests a repeat expansion locus for experimental follow-up. This approach is validated using genome sequence data for 259 cases of amyotrophic lateral sclerosis, of which 24 are positive for a large repeat expansion in C9orf72, showing that REscan statistics readily discriminate repeat expansion carriers from non-carriers. Availabilityand implementation C source code at https://github.com/rlmcl/rescan (GNU General Public Licence v3).

Download Full-text

How Long Are Long Tandem Repeats? A Challenge for Current Methods of Whole-Genome Sequence Assembly: The Case of Satellites in Caenorhabditis elegans

Genes ◽

10.3390/genes9100500 ◽

2018 ◽

Vol 9 (10) ◽

pp. 500

Author(s):

Juan A. Subirana ◽

Xavier Messeguer

Keyword(s):

Caenorhabditis Elegans ◽

Sanger Sequencing ◽

Tandem Repeats ◽

Whole Genome Sequence ◽

Nanopore Sequencing ◽

Original Sequence ◽

Genome Sequence Assembly ◽

Long Read ◽

Genomic Regions ◽

Caenorhabditis Elegans Genome

Repetitive genome regions have been difficult to sequence, mainly because of the comparatively small size of the fragments used in assembly. Satellites or tandem repeats are very abundant in nematodes and offer an excellent playground to evaluate different assembly methods. Here, we compare the structure of satellites found in three different assemblies of the Caenorhabditis elegans genome: the original sequence obtained by Sanger sequencing, an assembly based on PacBio technology, and an assembly using Nanopore sequencing reads. In general, satellites were found in equivalent genomic regions, but the new long-read methods (PacBio and Nanopore) tended to result in longer assembled satellites. Important differences exist between the assemblies resulting from the two long-read technologies, such as the sizes of long satellites. Our results also suggest that the lengths of some annotated genes with internal repeats which were assembled using Sanger sequencing are likely to be incorrect.

Download Full-text

Accurate measurement of microsatellite length by disrupting its tandem repeat structure

10.1101/2021.12.09.471828 ◽

2021 ◽

Author(s):

Dan Levy ◽

Zihua Wang ◽

Andrea Moffitt ◽

Michael H. Wigler

Keyword(s):

Tandem Repeat ◽

Error Rate ◽

Tandem Repeats ◽

Clinical Applications ◽

Error Rates ◽

Sequence Motifs ◽

High Error Rate ◽

Repeat Structure ◽

Flanking Regions ◽

Simple Sequence

Replication of tandem repeats of simple sequence motifs, also known as microsatellites, is error prone and variable lengths frequently occur during population expansions. Therefore, microsatellite length variations could serve as markers for cancer. However, accurate error-free quantitation of microsatellite lengths is difficult with current methods because of a high error rate during amplification and sequencing. We have solved this problem by using partial mutagenesis to disrupt enough of the repeat structure so that it can replicate faithfully, yet not so much that the flanking regions cannot be reliably identified. In this work we use bisulfite mutagenesis to convert a C to a U, later read as T. Compared to untreated templates, we achieve three orders of magnitude reduction in the error rate per round of replication. By requiring two independent first copies of an initial template, we reach error rates below one in a million. We discuss potential clinical applications of this method.

Download Full-text

Genome Size Estimation and Full-Length Transcriptome of Sphingonotus tsinlingensis: Genetic Background of a Drought-Adapted Grasshopper

Frontiers in Genetics ◽

10.3389/fgene.2021.678625 ◽

2021 ◽

Vol 12 ◽

Author(s):

Lu Zhao ◽

Hang Wang ◽

Ping Li ◽

Kuo Sun ◽

De-Long Guan ◽

...

Keyword(s):

Genome Size ◽

Genetic Background ◽

Tandem Repeats ◽

Full Length ◽

Arid Environments ◽

Size Estimation ◽

Sequencing Data ◽

Protein Coding ◽

Grasshopper Species ◽

Hsp Genes

Sphingonotus Fieber, 1852 (Orthoptera: Acrididae), is a grasshopper genus comprising approximately 170 species, all of which prefer dry environments such as deserts, steppes, and stony benchlands. In this study, we aimed to examine the adaptation of grasshopper species to arid environments. The genome size of Sphingonotus tsinlingensis was estimated using flow cytometry, and the first high-quality full-length transcriptome of this species was produced. The genome size of S. tsinlingensis is approximately 12.8 Gb. Based on 146.98 Gb of PacBio sequencing data, 221.47 Mb full-length transcripts were assembled. Among these, 88,693 non-redundant isoforms were identified with an N50 value of 2,726 bp, which was markedly longer than previous grasshopper transcriptome assemblies. In total, 48,502 protein-coding sequences were identified, and 37,569 were annotated using public gene function databases. Moreover, 36,488 simple tandem repeats, 12,765 long non-coding RNAs, and 414 transcription factors were identified. According to gene functions, 61 cytochrome P450 (CYP450) and 66 heat shock protein (HSP) genes, which may be associated with drought adaptation of S. tsinlingensis, were identified. We compared the transcriptomes of S. tsinlingensis and two other grasshopper species which were less tolerant to drought, namely Mongolotettix japonicus and Gomphocerus licenti. We observed the expression of CYP450 and HSP genes in S. tsinlingensis were higher. We produced the first full-length transcriptome of a Sphingonotus species that has an ultra-large genome. The assembly characteristics were better than those of all known grasshopper transcriptomes. This full-length transcriptome may thus be used to understand the genetic background and evolution of grasshoppers.

Download Full-text

Transposable element expression in tumors is associated with immune infiltration and increased antigenicity

Nature Communications ◽

10.1038/s41467-019-13035-2 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 16

Author(s):

Yu Kong ◽

Christopher M. Rose ◽

Ashley A. Cass ◽

Alexander G. Williams ◽

Martine Darwish ◽

...

Keyword(s):

Dna Methylation ◽

De Novo ◽

Computational Method ◽

The Cancer Genome Atlas ◽

Potential Consequence ◽

Sequencing Data ◽

Antiviral Responses ◽

Genome Wide ◽

Cancer Genome Atlas ◽

Demethylation Agent

AbstractProfound global loss of DNA methylation is a hallmark of many cancers. One potential consequence of this is the reactivation of transposable elements (TEs) which could stimulate the immune system via cell-intrinsic antiviral responses. Here, we develop REdiscoverTE, a computational method for quantifying genome-wide TE expression in RNA sequencing data. Using The Cancer Genome Atlas database, we observe increased expression of over 400 TE subfamilies, of which 262 appear to result from a proximal loss of DNA methylation. The most recurrent TEs are among the evolutionarily youngest in the genome, predominantly expressed from intergenic loci, and associated with antiviral or DNA damage responses. Treatment of glioblastoma cells with a demethylation agent results in both increased TE expression and de novo presentation of TE-derived peptides on MHC class I molecules. Therapeutic reactivation of tumor-specific TEs may synergize with immunotherapy by inducing inflammation and the display of potentially immunogenic neoantigens.

Download Full-text

Recovery of non-reference sequences missing from the human reference genome

BMC Genomics ◽

10.1186/s12864-019-6107-1 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 3

Author(s):

Ran Li ◽

Xiaomeng Tian ◽

Peng Yang ◽

Yingzhi Fan ◽

Ming Li ◽

...

Keyword(s):

Human Genome ◽

Tandem Repeats ◽

Reference Genome ◽

De Novo ◽

Precise Location ◽

Protein Coding ◽

Human Reference Genome ◽

Mhc Haplotype ◽

Reference Sequences ◽

Flanking Regions

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6113 NRS adding up to 12.8 Mb. Besides 1571 insertions, we detected 3041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.

Download Full-text