Accurate, scalable cohort variant calls using DeepVariant and GLnexus

AbstractPopulation-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready variants remains challenging. Here we introduce an open-source cohort variant-calling method using the highly-accurate caller DeepVariant and scalable merging tool GLnexus. We optimized callset quality based on benchmark samples and Mendelian consistency across many sample sizes and sequencing specifications, resulting in substantial quality improvements and cost savings over existing best practices. We further evaluated our pipeline in the 1000 Genomes Project (1KGP) samples, showing superior quality metrics and imputation performance. We publicly release the 1KGP callset to foster development of broad studies of genetic variation.

Download Full-text

Accurate, scalable cohort variant calls using DeepVariant and GLnexus

Bioinformatics ◽

10.1093/bioinformatics/btaa1081 ◽

2021 ◽

Author(s):

Taedong Yun ◽

Helen Li ◽

Pi-Chuan Chang ◽

Michael F Lin ◽

Andrew Carroll ◽

...

Keyword(s):

Best Practices ◽

Quality Metrics ◽

Supplementary Information ◽

Public Research ◽

Supplementary Data ◽

Quality Improvements ◽

1000 Genomes Project ◽

Individual Level ◽

1000 Genomes ◽

Population Scale

Abstract Motivation Population-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready cohort-level variants remains challenging. Results We introduce an open-source cohort-calling method that uses the highly-accurate caller DeepVariant and scalable merging tool GLnexus. Using callset quality metrics based on variant recall and precision in benchmark samples and Mendelian consistency in father-mother-child trios, we optimized the method across a range of cohort sizes, sequencing methods, and sequencing depths. The resulting callsets show consistent quality improvements over those generated using existing best practices with reduced cost. We further evaluate our pipeline in the deeply sequenced 1000 Genomes Project (1KGP) samples and show superior callset quality metrics and imputation reference panel performance compared to an independently-generated GATK Best Practices pipeline. Availability and Implementation We publicly release the 1KGP individual-level variant calls and cohort callset (https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP) to foster additional development and evaluation of cohort merging methods as well as broad studies of genetic variation. Both DeepVariant (https://github.com/google/deepvariant) and GLnexus (https://github.com/dnanexus-rnd/GLnexus) are open-sourced, and the optimized GLnexus setup discovered in this study is also integrated into GLnexus public releases v1.2.2 and later. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Evaluation of serverless computing for scalable execution of a joint variant calling workflow

PLoS ONE ◽

10.1371/journal.pone.0254363 ◽

2021 ◽

Vol 16 (7) ◽

pp. e0254363

Author(s):

Aji John ◽

Kathleen Muenzen ◽

Kristiina Ausmees

Keyword(s):

Genetic Information ◽

Best Practice ◽

Workflow Management ◽

Variant Calling ◽

Phase Iii ◽

1000 Genomes Project ◽

1000 Genomes ◽

Genomics Research ◽

The Cost ◽

Analysis Of Performance

Advances in whole-genome sequencing have greatly reduced the cost and time of obtaining raw genetic information, but the computational requirements of analysis remain a challenge. Serverless computing has emerged as an alternative to using dedicated compute resources, but its utility has not been widely evaluated for standardized genomic workflows. In this study, we define and execute a best-practice joint variant calling workflow using the SWEEP workflow management system. We present an analysis of performance and scalability, and discuss the utility of the serverless paradigm for executing workflows in the field of genomics research. The GATK best-practice short germline joint variant calling pipeline was implemented as a SWEEP workflow comprising 18 tasks. The workflow was executed on Illumina paired-end read samples from the European and African super populations of the 1000 Genomes project phase III. Cost and runtime increased linearly with increasing sample size, although runtime was driven primarily by a single task for larger problem sizes. Execution took a minimum of around 3 hours for 2 samples, up to nearly 13 hours for 62 samples, with costs ranging from $2 to $70.

Download Full-text

1000 Genomes Project Finds Substantial Genetic Variation Among Populations

JAMA ◽

10.1001/jama.2012.88674 ◽

2012 ◽

Vol 308 (22) ◽

pp. 2322 ◽

Cited By ~ 5

Author(s):

Bridget M. Kuehn

Keyword(s):

Genetic Variation ◽

1000 Genomes Project ◽

1000 Genomes

Download Full-text

Haplotype-aware graph indexes

Bioinformatics ◽

10.1093/bioinformatics/btz575 ◽

2019 ◽

Cited By ~ 6

Author(s):

Jouni Sirén ◽

Erik Garrison ◽

Adam M Novak ◽

Benedict Paten ◽

Richard Durbin

Keyword(s):

Genetic Variation ◽

Precision Medicine ◽

Chromosome 17 ◽

Supplementary Information ◽

Whole Genome ◽

Supplementary Data ◽

1000 Genomes Project ◽

1000 Genomes ◽

Burrows Wheeler Transform ◽

Haplotype Information

Abstract Motivation The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. Results We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. Availability and implementation Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Efficient toolkit implementing best practices for principal component analysis of population genetic data

Bioinformatics ◽

10.1093/bioinformatics/btaa520 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4449-4457 ◽

Cited By ~ 4

Author(s):

Florian Privé ◽

Keurcien Luu ◽

Michael G B Blum ◽

John J McGrath ◽

Bjarni J Vilhjálmsson

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Best Practices ◽

Principal Component ◽

Genetic Data ◽

Uk Biobank ◽

1000 Genomes Project ◽

1000 Genomes ◽

R Packages ◽

The Uk

ABSTRACT Motivation Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. Results For example, we find that PC19–PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16–18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data. Availability and implementation R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project

Wellcome Open Research ◽

10.12688/wellcomeopenres.15126.2 ◽

2019 ◽

Vol 4 ◽

pp. 50 ◽

Cited By ~ 7

Author(s):

Ernesto Lowy-Gallego ◽

Susan Fairley ◽

Xiangqun Zheng-Bradley ◽

Magali Ruffier ◽

Laura Clarke ◽

...

Keyword(s):

De Novo ◽

Variant Calling ◽

Final Phase ◽

1000 Genomes Project ◽

Data Set ◽

1000 Genomes ◽

Project Data

We present a set of biallelic SNVs and INDELs, from 2,548 samples spanning 26 populations from the 1000 Genomes Project, called de novo on GRCh38. We believe this will be a useful reference resource for those using GRCh38. It represents an improvement over the “lift-overs” of the 1000 Genomes Project data that have been available to date by encompassing all of the GRCh38 primary assembly autosomes and pseudo-autosomal regions, including novel, medically relevant loci. Here, we describe how the data set was created and benchmark our call set against that produced by the final phase of the 1000 Genomes Project on GRCh37 and the lift-over of that data to GRCh38.

Download Full-text

Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project

Wellcome Open Research ◽

10.12688/wellcomeopenres.15126.1 ◽

2019 ◽

Vol 4 ◽

pp. 50 ◽

Cited By ~ 2

Author(s):

Ernesto Lowy-Gallego ◽

Susan Fairley ◽

Xiangqun Zheng-Bradley ◽

Magali Ruffier ◽

Laura Clarke ◽

...

Keyword(s):

Variant Calling ◽

Final Phase ◽

1000 Genomes Project ◽

1000 Genomes ◽

Project Data

We present biallelic SNVs called from 2,548 samples across 26 populations from the 1000 Genomes Project, called directly on GRCh38. We believe this will be a useful reference resource for those using GRCh38, representing an improvement over the “lift-overs” of the 1000 Genomes Project data that have been available to date and providing a resource necessary for the full adoption of GRCh38 by the community. Here, we describe how the call set was created and provide benchmarking data describing how our call set compares to that produced by the final phase of the 1000 Genomes Project on GRCh37.

Download Full-text

Haplotype-aware graph indexes

10.1101/559583 ◽

2019 ◽

Cited By ~ 7

Author(s):

Jouni Sirén ◽

Erik Garrison ◽

Adam M. Novak ◽

Benedict Paten ◽

Richard Durbin

Keyword(s):

Genetic Variation ◽

Chromosome 17 ◽

Supplementary Information ◽

Whole Genome ◽

Supplementary Data ◽

1000 Genomes Project ◽

1000 Genomes ◽

Link Type ◽

Supplementary Material ◽

Haplotype Information

AbstractMotivationThe variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are nonbiological, unlikely recombinations of true haplotypes.ResultsWe augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheelertransform (GBWT). We demonstrate the scalability of the new implementation by building a whole-genome index of the 5,008 haplotypes of the 1000 Genomes Project, and an index of all 108,070 TOPMed Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.AvailabilityOur software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt, and https://github.com/jltsiren/[email protected] informationSupplementary data are available.

Download Full-text

In trans variant calling reveals enrichment for compound heterozygous variants in genes involved in neuronal development and growth.

Genetics Research ◽

10.1017/s0016672319000065 ◽

2019 ◽

Vol 101 ◽

Cited By ~ 1

Author(s):

Allison J. Cox ◽

Fillan Grady ◽

Gabriel Velez ◽

Vinit B. Mahajan ◽

Polly J. Ferguson ◽

...

Keyword(s):

Multiple Testing ◽

Variant Calling ◽

Epileptic Encephalopathy ◽

European Ancestry ◽

Compound Heterozygous ◽

Recessive Trait ◽

1000 Genomes Project ◽

1000 Genomes ◽

Project Participants ◽

Compound Heterozygous Variants

Abstract Compound heterozygotes occur when different variants at the same locus on both maternal and paternal chromosomes produce a recessive trait. Here we present the tool VarCount for the quantification of variants at the individual level. We used VarCount to characterize compound heterozygous coding variants in patients with epileptic encephalopathy and in the 1000 Genomes Project participants. The Epi4k data contains variants identified by whole exome sequencing in patients with either Lennox-Gastaut Syndrome (LGS) or infantile spasms (IS), as well as their parents. We queried the Epi4k dataset (264 trios) and the phased 1000 Genomes Project data (2504 participants) for recessive variants. To assess enrichment, transcript counts were compared between the Epi4k and 1000 Genomes Project participants using minor allele frequency (MAF) cutoffs of 0.5 and 1.0%, and including all ancestries or only probands of European ancestry. In the Epi4k participants, we found enrichment for rare, compound heterozygous variants in six genes, including three involved in neuronal growth and development – PRTG (p = 0.00086, 1% MAF, combined ancestries), TNC (p = 0.022, 1% MAF, combined ancestries) and MACF1 (p = 0.0245, 0.5% MAF, EU ancestry). Due to the total number of transcripts considered in these analyses, the enrichment detected was not significant after correction for multiple testing and higher powered or prospective studies are necessary to validate the candidacy of these genes. However, PRTG, TNC and MACF1 are potential novel recessive epilepsy genes and our results highlight that compound heterozygous variants should be considered in sporadic epilepsy.

Download Full-text