Efficient Variant Set Mixed Model Association Tests for Continuous and Binary Traits in Large-Scale Whole-Genome Sequencing Studies

ABSTRACTWith advances in Whole Genome Sequencing (WGS) technology, more advanced statistical methods for testing genetic association with rare variants are being developed. Methods in which variants are grouped for analysis are also known as variant-set, gene-based, and aggregate unit tests. The burden test and Sequence Kernel Association Test (SKAT) are two widely used variant-set tests, which were originally developed for samples of unrelated individuals and later have been extended to family data with known pedigree structures. However, computationally-efficient and powerful variant-set tests are needed to make analyses tractable in large-scale WGS studies with complex study samples. In this paper, we propose the variant-Set Mixed Model Association Tests (SMMAT) for continuous and binary traits using the generalized linear mixed model framework. These tests can be applied to large-scale WGS studies involving samples with population structure and relatedness, such as in the National Heart, Lung, and Blood Institute’s Trans-Omics for Precision Medicine (TOPMed) program. SMMAT tests share the same null model for different variant sets, and a virtue of this null model, which includes covariates only, is that it needs to be only fit once for all tests in each genome-wide analysis. Simulation studies show that all the proposed SMMAT tests correctly control type I error rates for both continuous and binary traits in the presence of population structure and relatedness. We also illustrate our tests in a real data example of analysis of plasma fibrinogen levels in the TOPMed program (n = 23,763), using the Analysis Commons, a cloud-based computing platform.

Download Full-text

Group-based variant calling leveraging next-generation supercomputing for large-scale whole-genome sequencing studies

BMC Bioinformatics ◽

10.1186/s12859-015-0736-4 ◽

2015 ◽

Vol 16 (1) ◽

Cited By ~ 11

Author(s):

Kristopher A. Standish ◽

Tristan M. Carland ◽

Glenn K. Lockwood ◽

Wayne Pfeiffer ◽

Mahidhar Tatineni ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Variant Calling ◽

Whole Genome ◽

Next Generation ◽

Sequencing Studies

Download Full-text

Identification of putative causal loci in whole-genome sequencing data via knockoff statistics

Nature Communications ◽

10.1038/s41467-021-22889-4 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Zihuai He ◽

Linxi Liu ◽

Chen Wang ◽

Yann Le Guen ◽

Justin Lee ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Rare Variants ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Association Tests ◽

Sequencing Project ◽

Risk Variants ◽

Sequencing Studies

AbstractThe analysis of whole-genome sequencing studies is challenging due to the large number of rare variants in noncoding regions and the lack of natural units for testing. We propose a statistical method to detect and localize rare and common risk variants in whole-genome sequencing studies based on a recently developed knockoff framework. It can (1) prioritize causal variants over associations due to linkage disequilibrium thereby improving interpretability; (2) help distinguish the signal due to rare variants from shadow effects of significant common variants nearby; (3) integrate multiple knockoffs for improved power, stability, and reproducibility; and (4) flexibly incorporate state-of-the-art and future association tests to achieve the benefits proposed here. In applications to whole-genome sequencing data from the Alzheimer’s Disease Sequencing Project (ADSP) and COPDGene samples from NHLBI Trans-Omics for Precision Medicine (TOPMed) Program we show that our method compared with conventional association tests can lead to substantially more discoveries.

Download Full-text

Identification of putative causal loci in whole-genome sequencing data via knockoff statistics

10.1101/2021.03.08.434451 ◽

2021 ◽

Author(s):

Zihuai He ◽

Linxi Liu ◽

Chen Wang ◽

Yann Le Guen ◽

Justin Lee ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Rare Variants ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Association Tests ◽

Sequencing Project ◽

Risk Variants ◽

Sequencing Studies

AbstractThe analysis of whole-genome sequencing studies is challenging due to the large number of rare variants in noncoding regions and the lack of natural units for testing. We propose a statistical method to detect and localize rare and common risk variants in whole-genome sequencing studies based on a recently developed knockoff framework. It can (1) prioritize causal variants over associations due to linkage disequilibrium thereby improving interpretability; (2) help distinguish the signal due to rare variants from shadow effects of significant common variants nearby; (3) integrate multiple knockoffs for improved power, stability and reproducibility; and (4) flexibly incorporate state-of-the-art and future association tests to achieve the benefits proposed here. In applications to whole-genome sequencing data from the Alzheimer’s Disease Sequencing Project (ADSP) and COPDGene samples from NHLBI Trans-Omics for Precision Medicine (TOPMed) Program we show that our method compared with conventional association tests can lead to substantially more discoveries.

Download Full-text

Haplocheck: Phylogeny-based Contamination Detection in Mitochondrial and Whole-Genome Sequencing Studies

10.1101/2020.05.06.080952 ◽

2020 ◽

Cited By ~ 1

Author(s):

Hansi Weissensteiner ◽

Lukas Forer ◽

Liane Fendt ◽

Azin Kheirkhah ◽

Antonio Salas ◽

...

Keyword(s):

Mitochondrial Genome ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Nuclear Dna ◽

Nuclear Genome ◽

Whole Genome ◽

Sequencing Studies ◽

Project Data ◽

The Impact

AbstractWithin-species contamination is a major issue in sequencing studies, especially for mitochondrial studies. Contamination can be detected by analysing the nuclear genome or by inspecting the heteroplasmic sites in the mitochondrial genome. Existing methods using the nuclear genome are computationally expensive, and no suitable tool for detecting contamination in large-scale mitochondrial datasets is available. Here we present haplocheck, a tool that requires only the mitochondrial genome to detect contamination in both mitochondrial and whole-genome sequencing studies. Haplocheck is able to distinguish between contaminated and real heteroplasmic sites using the mitochondrial phylogeny. By applying haplocheck to the 1000 Genomes Project data, we show (1) high concordance in contamination estimates between mitochondrial and nuclear DNA and (2) quantify the impact of mitochondrial copy numbers on the mitochondrial based contamination results. Haplocheck complements leading nuclear DNA based contamination tools, and can therefore be used as a proxy tool in nuclear genome studies.Haplocheck is available both as a command-line tool at https://github.com/genepi/haplocheck and as a cloud web-service producing interactive reports that facilitates the navigation through the phylogeny of contaminated samples.

Download Full-text

A framework for detecting noncoding rare variant associations of large-scale whole-genome sequencing studies

10.1101/2021.11.05.467531 ◽

2021 ◽

Author(s):

Zilin Li ◽

Xihao Li ◽

Hufeng Zhou ◽

Sheila M Gaynor ◽

Margaret Sunitha Selvaraj ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Association Analysis ◽

Genome Sequencing ◽

Large Scale ◽

Rare Variants ◽

Whole Genome ◽

Computationally Efficient ◽

Annotation Information ◽

Sequencing Studies ◽

Complex Human Traits

Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare variants' (RVs) associations with complex human traits. Variant set analysis is a powerful approach to study RV association, and a key component of it is constructing RV sets for analysis. However, existing methods have limited ability to define analysis units in the noncoding genome. Furthermore, there is a lack of robust pipelines for comprehensive and scalable noncoding RV association analysis. Here we propose a computationally-efficient noncoding RV association-detection framework that uses STAAR (variant-set test for association using annotation information) to group noncoding variants in gene-centric analysis based on functional categories. We also propose SCANG (scan the genome)-STAAR, which uses dynamic window sizes and incorporates multiple functional annotations, in a non-gene-centric analysis. We furthermore develop STAARpipeline to perform flexible noncoding RV association analysis, including gene-centric analysis as well as fixed-window-based and dynamic-window-based non-gene-centric analysis. We apply STAARpipeline to identify noncoding RV sets associated with four quantitative lipid traits in 21,015 discovery samples from the Trans-Omics for Precision Medicine (TOPMed) program and replicate several noncoding RV associations in an additional 9,123 TOPMed samples.

Download Full-text

0306 Exploring the feasibility of using copy number variants as genetic markers through large-scale whole genome sequencing experiments

Journal of Animal Science ◽

10.2527/jam2016-0306 ◽

2016 ◽

Vol 94 (suppl_5) ◽

pp. 146-146

Author(s):

D. M. Bickhart ◽

L. Xu ◽

J. L. Hutchison ◽

J. B. Cole ◽

D. J. Null ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genetic Markers ◽

Genome Sequencing ◽

Copy Number ◽

Large Scale ◽

Copy Number Variants ◽

Whole Genome

Download Full-text

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

A large-scale whole-genome sequencing analysis reveals false positives of bacterial essential genes

Applied Microbiology and Biotechnology ◽

10.1007/s00253-021-11702-3 ◽

2021 ◽

Author(s):

Yuanhao Li ◽

Bo Jiang ◽

Weijun Dai

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

False Positives ◽

Essential Genes ◽

Whole Genome ◽

Sequencing Analysis

Download Full-text

Improving tuberculosis surveillance by detecting international transmission using publicly available whole-genome sequencing data

10.1101/834150 ◽

2019 ◽

Author(s):

Andrea Sanchini ◽

Christine Jandrasits ◽

Julius Tembrockhaus ◽

Thomas Andreas Kohl ◽

Christian Utpatel ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Added Value ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

International Transmission ◽

The Public ◽

Public Dataset ◽

Public Repositories

AbstractIntroductionImproving the surveillance of tuberculosis (TB) is especially important for multidrug-resistant (MDR) and extensively drug-resistant (XDR)-TB. The large amount of publicly available whole-genome sequencing (WGS) data for TB gives us the chance to re-use data and to perform additional analysis at a large scale.AimWe assessed the usefulness of raw WGS data of global MDR/XDR-TB isolates available from public repositories to improve TB surveillance.MethodsWe extracted raw WGS data and the related metadata of Mycobacterium tuberculosis isolates available from the Sequence Read Archive. We compared this public dataset with WGS data and metadata of 131 MDR- and XDR-TB isolates from Germany in 2012-2013.ResultsWe aggregated a dataset that includes 1,081 MDR and 250 XDR isolates among which we identified 133 molecular clusters. In 16 clusters, the isolates were from at least two different countries. For example, cluster2 included 56 MDR/XDR isolates from Moldova, Georgia, and Germany. By comparing the WGS data from Germany and the public dataset, we found that 11 clusters contained at least one isolate from Germany and at least one isolate from another country. We could, therefore, connect TB cases despite missing epidemiological information.ConclusionWe demonstrated the added value of using WGS raw data from public repositories to contribute to TB surveillance. By comparing the German and the public dataset, we identified potential international transmission events. Thus, using this approach might support the interpretation of national surveillance results in an international context.

Download Full-text