A reference haplotype panel for genome-wide imputation of short tandem repeats

AbstractShort tandem repeats (STRs) are involved in dozens of Mendelian disorders and have been implicated in a variety of complex traits. However, existing technologies focusing on single nucleotide polymorphisms (SNPs) have not allowed for systematic STR association studies. Here, we leverage next-generation sequencing data from 479 families to create a SNP+STR reference haplotype panel for genome-wide imputation of STRs into SNP data. Imputation achieved an average of 97% concordance between genotyped and imputed STR genotypes in an external dataset compared to 63% expected under a random model. Performance varied widely across STRs, with near perfect concordance at bi-allelic STRs vs. 70% at highly polymorphic forensics markers. We demonstrate that imputation increases power over individual SNPs to detect STR associations using simulated phenotypes and gene expression data. This resource will enable the first large-scale STR association studies using existing SNP datasets, and will likely yield new insights into complex traits.

Download Full-text

A reference haplotype panel for genome-wide imputation of short tandem repeats

Nature Communications ◽

10.1038/s41467-018-06694-0 ◽

2018 ◽

Vol 9 (1) ◽

Cited By ~ 12

Author(s):

Shubham Saini ◽

Ileena Mitra ◽

Nima Mousavi ◽

Stephanie Feupe Fotsing ◽

Melissa Gymrek

Keyword(s):

Short Tandem Repeats ◽

Tandem Repeats ◽

Genome Wide ◽

Reference Haplotype ◽

Short Tandem

Download Full-text

Multi-tissue analysis reveals short tandem repeats as ubiquitous regulators of gene expression and complex traits

10.1101/495226 ◽

2018 ◽

Cited By ~ 3

Author(s):

Stephanie Feupe Fotsing ◽

Jonathan Margoliash ◽

Catherine Wang ◽

Shubham Saini ◽

Richard Yanicky ◽

...

Keyword(s):

Gene Expression ◽

Complex Traits ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Tissue Expression ◽

Genome Wide ◽

Limited Power ◽

Inflammatory Bowel ◽

Sequencing And Expression ◽

Short Tandem

AbstractShort tandem repeats (STRs) have been implicated in a variety of complex traits in humans. However, genome-wide studies of the effects of STRs on gene expression thus far have had limited power to detect associations and provide insights into putative mechanisms. Here, we leverage whole genome sequencing and expression data for 17 tissues from the Genotype-Tissue Expression Project (GTEx) to identify STRs for which repeat number is associated with expression of nearby genes (eSTRs). Our analysis reveals more than 28,000 eSTRs. We employ fine-mapping to quantify the probability that each eSTR is causal and characterize a group of the top 1,400 fine-mapped eSTRs. We identify hundreds of eSTRs linked with published GWAS signals and implicate specific eSTRs in complex traits including height and schizophrenia, inflammatory bowel disease, and intelligence. Overall, our results support the hypothesis that eSTRs contribute to a range of human phenotypes and will serve as a valuable resource for future studies of complex traits.

Download Full-text

Spatially coordinated heterochromatinization of distal short tandem repeats in fragile X syndrome

10.1101/2021.04.23.441217 ◽

2021 ◽

Author(s):

Linda Zhou ◽

Chunmin Ge ◽

Thomas Malachowski ◽

Ji Hun Kim ◽

Keerthivasan Raanin Chandradoss ◽

...

Keyword(s):

Fragile X Syndrome ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Fragile X ◽

Repeat Expansion ◽

Genome Wide ◽

A Genome ◽

In Trans ◽

Surveillance Mechanism ◽

Short Tandem

AbstractShort tandem repeat (STR) instability is causally linked to pathologic transcriptional silencing in a subset of repeat expansion disorders. In fragile X syndrome (FXS), instability of a single CGG STR tract is thought to repress FMR1 via local DNA methylation. Here, we report the acquisition of more than ten Megabase-sized H3K9me3 domains in FXS, including a 5-8 Megabase block around FMR1. Distal H3K9me3 domains encompass synaptic genes with STR instability, and spatially co-localize in trans concurrently with FMR1 CGG expansion and the dissolution of TADs. CRISPR engineering of mutation-length FMR1 CGG to normal-length preserves heterochromatin, whereas cut-out to pre-mutation-length attenuates a subset of H3K9me3 domains. Overexpression of a pre-mutation-length CGG de-represses both FMR1 and distal heterochromatinized genes, indicating that long-range H3K9me3-mediated silencing is exquisitely sensitive to STR length. Together, our data uncover a genome-wide surveillance mechanism by which STR tracts spatially communicate over vast distances to heterochromatinize the pathologically unstable genome in FXS.One-Sentence SummaryHeterochromatinization of distal synaptic genes with repeat instability in fragile X is reversible by overexpression of a pre-mutation length CGG tract.

Download Full-text

Better estimation of SNP heritability from summary statistics provides a new understanding of the genetic architecture of complex traits

10.1101/284976 ◽

2018 ◽

Cited By ~ 6

Author(s):

Doug Speed ◽

David J Balding

Keyword(s):

Complex Traits ◽

Genetic Architecture ◽

Large Scale ◽

Association Studies ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Confounding Bias ◽

Conserved Regions ◽

Genome Wide ◽

Variation Explained

LD Score Regression (LDSC) has been widely applied to the results of genome-wide association studies. However, its estimates of SNP heritability are derived from an unrealistic model in which each SNP is expected to contribute equal heritability. As a consequence, LDSC tends to over-estimate confounding bias, under-estimate the total phenotypic variation explained by SNPs, and provide misleading estimates of the heritability enrichment of SNP categories. Therefore, we present SumHer, software for estimating SNP heritability from summary statistics using more realistic heritability models. After demonstrating its superiority over LDSC, we apply SumHer to the results of 24 large-scale association studies (average sample size 121 000). First we show that these studies have tended to substantially over-correct for confounding, and as a result the number of genome-wide significant loci has under-reported by about 20%. Next we estimate enrichment for 24 categories of SNPs defined by functional annotations. A previous study using LDSC reported that conserved regions were 13-fold enriched, and found a further twelve categories with above 2-fold enrichment. By contrast, our analysis using SumHer finds that conserved regions are only 1.6-fold (SD 0.06) enriched, and that no category has enrichment above 1.7-fold. SumHer provides an improved understanding of the genetic architecture of complex traits, which enables more efficient analysis of future genetic data.

Download Full-text

Properties of structural variants and short tandem repeats associated with gene expression and complex traits

Nature Communications ◽

10.1038/s41467-020-16482-4 ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 1

Author(s):

David Jakubosky ◽

◽

Matteo D’Antonio ◽

Marc Jan Bonder ◽

Craig Smail ◽

...

Keyword(s):

Gene Expression ◽

Complex Traits ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Structural Variants ◽

Short Tandem

Download Full-text

circVAR database: genome-wide archive of genetic variants for human circular RNAs

10.21203/rs.3.rs-48904/v2 ◽

2020 ◽

Author(s):

Min Zhao ◽

Hong Qu

Keyword(s):

Genetic Variants ◽

Complex Traits ◽

Large Scale ◽

Rna Binding ◽

Rna Binding Proteins ◽

Association Studies ◽

Chromosome 17 ◽

Circular Rnas ◽

Genome Wide Association Studies ◽

Genome Wide

Abstract Background: Circular RNAs (circRNAs) play important roles in regulating gene expression through binding miRNAs and RNA binding proteins. Genetic variation of circRNAs may affect complex traits/diseases by changing their binding efficiency to target miRNAs and proteins. There is a growing demand for investigations of the functions of genetic changes using large-scale experimental evidence. However, there is no online genetic resource for circRNA genes. Results: We performed extensive genetic annotation of 295,526 circRNAs integrated from circBase, circNet and circRNAdb. All pre-computed genetic variants were presented at our online resource, circVAR, with data browsing and search functionality. We explored the chromosome-based distribution of circRNAs and their associated variants. We found that, based on mapping to the 1000 Genomes and ClinVAR databases, chromosome 17 has a relatively large number of circRNAs and associated common and health-related genetic variants. Following the annotation of genome wide association studies (GWAS)-based circRNA variants, we found many non-coding variants within circRNAs, suggesting novel mechanisms for common diseases reported from GWAS studies. For cancer-based somatic variants, we found that chromosome 7 has many highly complex mutations that have been overlooked in previous research. Conclusion: We used the circVAR database to collect SNPs and small insertions and deletions (INDELs) in putative circRNA regions and to identify their potential phenotypic information. To provide a reusable resource for the circRNA research community, we have published all the pre-computed genetic data concerning circRNAs and associated genes together with data query and browsing functions at http://soft.bioinfo-minzhao.org/circvar .

Download Full-text

Towards Development of Clustering Applications for Large-Scale Comparative Genotyping and Kinship Analysis Using Y-Short Tandem Repeats

OMICS A Journal of Integrative Biology ◽

10.1089/omi.2014.0136 ◽

2015 ◽

Vol 19 (6) ◽

pp. 361-367 ◽

Cited By ~ 3

Author(s):

Ali Seman ◽

Azizian Mohd Sapawi ◽

Mohd Zaki Salleh

Keyword(s):

Large Scale ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Kinship Analysis ◽

Short Tandem

Download Full-text

Linkage disequilibrium matches forensic genetic records to disjoint genomic marker sets

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1619944114 ◽

2017 ◽

Vol 114 (22) ◽

pp. 5671-5676 ◽

Cited By ~ 18

Author(s):

Michael D. Edge ◽

Bridget F. B. Algee-Hewitt ◽

Trevor J. Pemberton ◽

Jun Z. Li ◽

Noah A. Rosenberg

Keyword(s):

Linkage Disequilibrium ◽

Data Aggregation ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Genome Wide ◽

Genomic Marker ◽

Record Matching ◽

Privacy Risks ◽

Forensic Genetic ◽

Short Tandem

Combining genotypes across datasets is central in facilitating advances in genetics. Data aggregation efforts often face the challenge of record matching—the identification of dataset entries that represent the same individual. We show that records can be matched across genotype datasets that have no shared markers based on linkage disequilibrium between loci appearing in different datasets. Using two datasets for the same 872 people—one with 642,563 genome-wide SNPs and the other with 13 short tandem repeats (STRs) used in forensic applications—we find that 90–98% of forensic STR records can be connected to corresponding SNP records and vice versa. Accuracy increases to 99–100% when ∼30 STRs are used. Our method expands the potential of data aggregation, but it also suggests privacy risks intrinsic in maintenance of databases containing even small numbers of markers—including databases of forensic significance.

Download Full-text

Mapping short tandem repeats for liver gene expression traits helps prioritize potential causal variants for complex traits in pigs

Journal of Animal Science and Biotechnology ◽

10.1186/s40104-021-00658-z ◽

2022 ◽

Vol 13 (1) ◽

Author(s):

Zhongzi Wu ◽

Huanfa Gong ◽

Zhimin Zhou ◽

Tao Jiang ◽

Ziqi Lin ◽

...

Keyword(s):

Gene Expression ◽

Complex Traits ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Genome Wide Association Study ◽

Complex Trait ◽

Validation Population ◽

Causal Variants ◽

Liver Gene ◽

Short Tandem

Abstract Background Short tandem repeats (STRs) were recently found to have significant impacts on gene expression and diseases in humans, but their roles on gene expression and complex traits in pigs remain unexplored. This study investigates the effects of STRs on gene expression in liver tissues based on the whole-genome sequences and RNA-Seq data of a discovery cohort of 260 F6 individuals and a validation population of 296 F7 individuals from a heterogeneous population generated from crosses among eight pig breeds. Results We identified 5203 and 5868 significantly expression STRs (eSTRs, FDR < 1%) in the F6 and F7 populations, respectively, most of which could be reciprocally validated (π1 = 0.92). The eSTRs explained 27.5% of the cis-heritability of gene expression traits on average. We further identified 235 and 298 fine-mapped STRs through the Bayesian fine-mapping approach in the F6 and F7 pigs, respectively, which were significantly enriched in intron, ATAC peak, compartment A and H3K4me3 regions. We identified 20 fine-mapped STRs located in 100 kb windows upstream and downstream of published complex trait-associated SNPs, which colocalized with epigenetic markers such as H3K27ac and ATAC peaks. These included eSTR of the CLPB, PGLS, PSMD6 and DHDH genes, which are linked with genome-wide association study (GWAS) SNPs for blood-related traits, leg conformation, growth-related traits, and meat quality traits, respectively. Conclusions This study provides insights into the effects of STRs on gene expression traits. The identified eSTRs are valuable resources for prioritizing causal STRs for complex traits in pigs.

Download Full-text

CNest: A Novel Copy Number Association Discovery Method Uncovers 862 New Associations from 200,629 Whole Exome Sequence Datasets in the UK Biobank

10.1101/2021.08.19.456963 ◽

2021 ◽

Author(s):

Tomas W Fitzgerald ◽

Ewan Birney

Keyword(s):

Copy Number ◽

Large Scale ◽

Association Studies ◽

Genomic Variation ◽

Next Generation Sequencing Data ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Genome Wide ◽

The Uk ◽

Ngs Data

Copy number variation (CNV) has long been known to influence human traits having a rich history of research into common and rare genetic disease and although CNV is accepted as an important class of genomic variation, progress on copy number (CN) phenotype associations from Next Generation Sequencing data (NGS) has been limited, in part, due to the relative difficulty in CNV detection and an enrichment for large numbers of false positives. To date most successful CN genome wide association studies (CN-GWAS) have focused on using predictive measures of dosage intolerance or gene burden tests to gain sufficient power for detecting CN effects. Here we present a novel method for large scale CN analysis from NGS data generating robust CN estimates and allowing CN-GWAS to be performed genome wide in discovery mode. We provide a detailed analysis in the large scale UK BioBank resource and a specifically designed software package for deriving CN estimates from NGS data that are robust enough to be used for CN-GWAS. We use these methods to perform genome wide CN-GWAS analysis across 78 human traits discovering 862 genetic associations that are likely to contribute strongly to trait distributions based solely on their CN or by acting in concert with other genetic variation. Finally, we undertake an analysis comparing CNV and SNP association signals across the same traits and samples, defining specific CNV association classes based on whether they could be detected using standard SNP-GWAS in the UK Biobank.

Download Full-text