Discovery and Quality Analysis of a Comprehensive Set of Structural Variants and Short Tandem Repeats

AbstractStructural variants (SVs) and short tandem repeats (STRs) are important sources of genetic diversity but are not routinely analyzed in genetic studies because they are difficult to accurately identify and genotype. Because SVs and STRs range in size and type, it is necessary to apply multiple algorithms that incorporate different types of evidence from sequencing data and employ complex filtering strategies to discover a comprehensive set of high-quality and reproducible variants. Here we assembled a set of 719 deep whole genome sequencing (WGS) samples (mean 42x) from 477 distinct individuals which we used to discover and genotype a wide spectrum of SV and STR variants using five algorithms. We used 177 unique pairs of genetic replicates to identify factors that affect variant call reproducibility and developed a systematic filtering strategy to create of one of the most complete and well characterized maps of SVs and STRs to date.

Download Full-text

Discovery and quality analysis of a comprehensive set of structural variants and short tandem repeats

Nature Communications ◽

10.1038/s41467-020-16481-5 ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 1

Author(s):

David Jakubosky ◽

◽

Erin N. Smith ◽

Matteo D’Antonio ◽

Marc Jan Bonder ◽

...

Keyword(s):

Short Tandem Repeats ◽

Tandem Repeats ◽

Quality Analysis ◽

Structural Variants ◽

Short Tandem

Download Full-text

Properties of structural variants and short tandem repeats associated with gene expression and complex traits

Nature Communications ◽

10.1038/s41467-020-16482-4 ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 1

Author(s):

David Jakubosky ◽

◽

Matteo D’Antonio ◽

Marc Jan Bonder ◽

Craig Smail ◽

...

Keyword(s):

Gene Expression ◽

Complex Traits ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Structural Variants ◽

Short Tandem

Download Full-text

Accuracy of short tandem repeats genotyping tools in whole exome sequencing data

F1000Research ◽

10.12688/f1000research.22639.1 ◽

2020 ◽

Vol 9 ◽

pp. 200 ◽

Cited By ~ 2

Author(s):

Andreas Halman ◽

Alicia Oshlack

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Error Rate ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Sequencing Data ◽

Exome Sequencing Data ◽

Whole Exome ◽

Whole Exome Sequencing Data ◽

Short Tandem

Background: Short tandem repeats are an important source of genetic variation. They are highly mutable and repeat expansions are associated dozens of human disorders, such as Huntington's disease and spinocerebellar ataxias. Technical advantages in sequencing technology have made it possible to analyse these repeats at large scale; however, accurate genotyping is still a challenging task. We compared four different short tandem repeats genotyping tools on whole exome sequencing data to determine their genotyping performance and limits, which will aid other researchers in choosing a suitable tool and parameters for analysis. Methods: The analysis was performed on the Simons Simplex Collection dataset, where we used a novel method of evaluation with accuracy determined by the rate of homozygous calls on the X chromosome of male samples. In total we analysed 433 samples and around a million genotypes for evaluating tools on whole exome sequencing data. Results: We determined a relatively good performance of all tools when genotyping repeats of 3-6 bp in length, which could be improved with coverage and quality score filtering. However, genotyping homopolymers was challenging for all tools and a high error rate was present across different thresholds of coverage and quality scores. Interestingly, dinucleotide repeats displayed a high error rate as well, which was found to be mainly caused by the AC/TG repeats. Overall, LobSTR was able to make the most calls and was also the fastest tool, while RepeatSeq and HipSTR exhibited the lowest heterozygous error rate at low coverage. Conclusions: All tools have different strengths and weaknesses and the choice may depend on the application. In this analysis we demonstrated the effect of using different filtering parameters and offered recommendations based on the trade-off between the best accuracy of genotyping and the highest number of calls.

Download Full-text

NGSEP3: accurate variant calling across species and sequencing protocols

Bioinformatics ◽

10.1093/bioinformatics/btz275 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4716-4723 ◽

Cited By ~ 7

Author(s):

Daniel Tello ◽

Juanita Gil ◽

Cristian D Loaiza ◽

John J Riascos ◽

Nicolás Cardozo ◽

...

Keyword(s):

Short Tandem Repeats ◽

Tandem Repeats ◽

High Throughput Sequencing ◽

Variant Calling ◽

Real Data ◽

Supplementary Information ◽

Sequencing Data ◽

Comparative Accuracy ◽

Downstream Analysis ◽

Short Tandem

Abstract Motivation Accurate detection, genotyping and downstream analysis of genomic variants from high-throughput sequencing data are fundamental features in modern production pipelines for genetic-based diagnosis in medicine or genomic selection in plant and animal breeding. Our research group maintains the Next-Generation Sequencing Experience Platform (NGSEP) as a precise, efficient and easy-to-use software solution for these features. Results Understanding that incorrect alignments around short tandem repeats are an important source of genotyping errors, we implemented in NGSEP new algorithms for realignment and haplotype clustering of reads spanning indels and short tandem repeats. We performed extensive benchmark experiments comparing NGSEP to state-of-the-art software using real data from three sequencing protocols and four species with different distributions of repetitive elements. NGSEP consistently shows comparative accuracy and better efficiency compared to the existing solutions. We expect that this work will contribute to the continuous improvement of quality in variant calling needed for modern applications in medicine and agriculture. Availability and implementation NGSEP is available as open source software at http://ngsep.sf.net. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A worldwide map of swine short tandem repeats and their associations with evolutionary and environmental adaptations

Genetics Selection Evolution ◽

10.1186/s12711-021-00631-4 ◽

2021 ◽

Vol 53 (1) ◽

Author(s):

Zhongzi Wu ◽

Huanfa Gong ◽

Mingpeng Zhang ◽

Xinkai Tong ◽

Huashui Ai ◽

...

Keyword(s):

Environmental Variables ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Nucleotide Polymorphisms ◽

Loss Of Function ◽

Sequencing Data ◽

Wild Boars ◽

A Genome ◽

Average Annual Temperature ◽

Short Tandem

Abstract Background Short tandem repeats (STRs) are genetic markers with a greater mutation rate than single nucleotide polymorphisms (SNPs) and are widely used in genetic studies and forensics. However, most studies in pigs have focused only on SNPs or on a limited number of STRs. Results This study screened 394 deep-sequenced genomes from 22 domesticated pig breeds/populations worldwide, wild boars from both Europe and Asia, and numerous outgroup Suidaes, and identified a set of 878,967 polymorphic STRs (pSTRs), which represents the largest repository of pSTRs in pigs to date. We found multiple lines of evidence that pSTRs in coding regions were affected by purifying selection. The enrichment of trinucleotide pSTRs in coding sequences (CDS), 5′UTR and H3K4me3 regions suggests that trinucleotide STRs serve as important components in the exons and promoters of the corresponding genes. We demonstrated that, compared to SNPs, pSTRs provide comparable or even greater accuracy in determining the breed identity of individuals. We identified pSTRs that showed significant population differentiation between domestic pigs and wild boars in Asia and Europe. We also observed that some pSTRs were significantly associated with environmental variables, such as average annual temperature or altitude of the originating sites of Chinese indigenous breeds, among which we identified loss-of-function and/or expanded STRs overlapping with genes such as AHR, LAS1L and PDK1. Finally, our results revealed that several pSTRs show stronger signals in domestic pig—wild boar differentiation or association with the analysed environmental variables than the flanking SNPs within a 100-kb window. Conclusions This study provides a genome-wide high-density map of pSTRs in diverse pig populations based on genome sequencing data, enabling a more comprehensive characterization of their roles in evolutionary and environmental adaptation.

Download Full-text

STRipy: a graphical application for enhanced genotyping of pathogenic short tandem repeats in sequencing data

10.1101/2021.06.13.448220 ◽

2021 ◽

Author(s):

Andreas Halman ◽

Egor Dolzhenko ◽

Alicia Oshlack

Keyword(s):

Genome Sequencing ◽

Fragment Length ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

High Throughput Sequencing ◽

Causal Variant ◽

Sequencing Data ◽

Pathogenic Variants ◽

Set Up ◽

Short Tandem

AbstractShort tandem repeats (STRs) are highly polymorphic with high mutation rates and expansions of STRs have been implicated as the causal variant in diseases. The application of genome sequencing in patients has recently allowed many new discoveries with over 50 disease causing loci known to date. There are several tools which allow genotyping of STRs from high-throughput sequencing (HTS) data. However, running these tools out of the box only allow around half of the known disease-causing loci to be genotyped, with lengths often limited to either read or fragment length which is less than the pathogenic cut-off for some diseases. While analysis tools can be customised to genotype extra loci, this requires proficiency in bioinformatics to set up, use, and analyse the resulting data, limiting their widespread usage by other researchers and clinicians.To address these issues, we have created a new software called STRipy that has an intuitive graphical interface and requires no specific skills for usage, thus significantly simplifying detection of STRs expansions from human HTS data. STRipy is able to target all known disease-causing STRs with genotyping performed with an established tool, ExpansionHunter, that is incorporated into the software. We have created additional functionality into STRipy to work with long alleles exceeding the fragment length.STRipy was validated using over 60 thousand simulated samples and was shown to work on whole genome sequencing of biological samples with pathogenic variants. Finally, we have used STRipy to acquire genotypes of pathogenic loci for thousands of samples from various populations which are provided to the user along with the data from the literature to assist with results interpretation. We believe the simplicity and breadth of STRipy will increase the testing of STR diseases in current datasets resulting in further diagnoses of rare diseases caused by STRs expansions.

Download Full-text

Importance of polymorphic SNPs, short tandem repeats and structural variants for differential gene expression among inbred C57BL/6 and C57BL/10 substrains

10.1101/2020.03.16.993683 ◽

2020 ◽

Author(s):

Milad Mortazavi ◽

Yangsu Ren ◽

Shubham Saini ◽

Danny Antaki ◽

Celine St. Pierre ◽

...

Keyword(s):

Gene Expression ◽

Differentially Expressed Genes ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Expression Patterns ◽

Unintended Consequences ◽

Differentially Expressed ◽

Structural Variants ◽

Short Tandem ◽

New Mutations

AbstractC57BL/6J is the most widely used inbred mouse strain and is the basis for the mouse reference genome. In addition to C57BL/6J, several other C57BL/6 and C57BL/10 substrains exist. Previous studies have documented extensive phenotypic and genetic differences among these substrains, which are presumed to be due to the accumulation of new mutations. These differences can be used for genome wide association studies. They can also have unintended consequences for reproducibility when substrain differences are not properly accounted for. In this paper, we performed genomic sequencing and RNA-sequencing in the hippocampus of 9 C57BL/6 and 5 C57BL/10 substrains. We identified 985,329 SNPs, 150,344 Short Tandem Repeats (STR) and 896 Structural Variants (SV), out of which 330,178 SNPs and 14,367 STRs differentiated the C57BL/6 and C57BL/10 groups. We found several regions that contained dense polymorphisms. We also identified 578 differentially expressed genes for C57BL/6 substrains and 37 differentially expressed genes for C57BL/10 substrains (FDR < 0.01). We then identified nearby SNPs, STRs and SVs that matched the gene expression patterns. In so doing, we identified SVs in coding regions of Wdfy1, Ide, Fgfbp3 and Btaf1 that explain the expression patterns observed. We replicated several previously reported gene expression differences between substrains (Nnt, Gabra2) as well as many novel gene expression differences (e.g. Kcnc2). Our results illustrate the impact of new mutations on gene expression among these substrains and provides a resource for future mapping studies.

Download Full-text

Genomic properties of structural variants and short tandem repeats that impact gene expression and complex traits in humans

10.1101/714477 ◽

2019 ◽

Cited By ~ 3

Author(s):

David Jakubosky ◽

Matteo D’Antonio ◽

Marc Jan Bonder ◽

Craig Smail ◽

Margaret K.R. Donovan ◽

...

Keyword(s):

Gene Expression ◽

Complex Traits ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Evolutionary Constraint ◽

Structural Variants ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Genomic Locations ◽

Short Tandem

AbstractStructural variants (SVs) and short tandem repeats (STRs) comprise a broad group of diverse DNA variants which vastly differ in their sizes and distributions across the genome. Here, we show that different SV classes and STRs differentially impact gene expression and complex traits. Functional differences between SV classes and STRs include their genomic locations relative to eGenes, likelihood of being associated with multiple eGenes, associated eGene types (e.g., coding, noncoding, level of evolutionary constraint), effect sizes, linkage disequilibrium with tagging single nucleotide variants used in GWAS, and likelihood of being associated with GWAS traits. We also identified a set of high-impact SVs/STRs associated with the expression of three or more eGenes via chromatin loops and showed they are highly enriched for being associated with GWAS traits. Our study provides insights into the genomic properties of structural variant classes and short tandem repeats that impact gene expression and human traits.

Download Full-text

Accuracy of short tandem repeats genotyping tools in whole exome sequencing data

10.1101/2020.02.03.933002 ◽

2020 ◽

Author(s):

Andreas Halman ◽

Alicia Oshlack

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Error Rate ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Sequencing Data ◽

Exome Sequencing Data ◽

Whole Exome ◽

Whole Exome Sequencing Data ◽

Short Tandem

AbstractBackgroundShort tandem repeats are important source of genetic variation, they are highly mutable and repeat expansions are associated dozens of human disorders, such as Huntington’s disease and spinocerebellar ataxias. Technical advantages in sequencing technology have made it possible to analyse these repeats at large scale, however, accurate genotyping is still a challenging task. We compared four different short tandem repeats genotyping tools on whole exome sequencing data to determine their genotyping performance and limits which will aid other researchers to choose a suitable tool and parameters for analysis.MethodsThe analysis was performed on the Simons Simplex Collection dataset where we used a novel method of evaluation with accuracy determined by the rate of homozygous calls on the X chromosome of male samples. In total we analysed 433 samples and around a million genotypes for evaluating tools on whole exome sequencing data.ResultsWe determined a relatively good performance of all tools when genotyping repeats of 3-6 bp in length which could be improved with coverage and quality score filtering. However, genotyping homopolymers was challenging for all tools and a high error rate was present across different thresholds of coverage and quality scores. Interestingly, dinucleotide repeats displayed a high error rate as well, which was found to be mainly caused by the AC/TG repeats. Overall, LobSTR was able to make the most calls and was also the fastest tool while RepeatSeq and HipSTR exhibited the lowest heterozygous error rate at low coverage.ConclusionsAll tools have different strengths and weaknesses and the choice may depend on the type of analysis. In this analysis we demonstrated the effect of using different filtering parameters and offered recommendations based on the trade-off between the best accuracy of genotyping and the highest number of calls.

Download Full-text