Accurate, scalable cohort variant calls using DeepVariant and GLnexus

Bioinformatics ◽

10.1093/bioinformatics/btaa1081 ◽

2021 ◽

Author(s):

Taedong Yun ◽

Helen Li ◽

Pi-Chuan Chang ◽

Michael F Lin ◽

Andrew Carroll ◽

...

Keyword(s):

Best Practices ◽

Quality Metrics ◽

Supplementary Information ◽

Public Research ◽

Supplementary Data ◽

Quality Improvements ◽

1000 Genomes Project ◽

Individual Level ◽

1000 Genomes ◽

Population Scale

Abstract Motivation Population-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready cohort-level variants remains challenging. Results We introduce an open-source cohort-calling method that uses the highly-accurate caller DeepVariant and scalable merging tool GLnexus. Using callset quality metrics based on variant recall and precision in benchmark samples and Mendelian consistency in father-mother-child trios, we optimized the method across a range of cohort sizes, sequencing methods, and sequencing depths. The resulting callsets show consistent quality improvements over those generated using existing best practices with reduced cost. We further evaluate our pipeline in the deeply sequenced 1000 Genomes Project (1KGP) samples and show superior callset quality metrics and imputation reference panel performance compared to an independently-generated GATK Best Practices pipeline. Availability and Implementation We publicly release the 1KGP individual-level variant calls and cohort callset (https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP) to foster additional development and evaluation of cohort merging methods as well as broad studies of genetic variation. Both DeepVariant (https://github.com/google/deepvariant) and GLnexus (https://github.com/dnanexus-rnd/GLnexus) are open-sourced, and the optimized GLnexus setup discovered in this study is also integrated into GLnexus public releases v1.2.2 and later. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Accurate, scalable cohort variant calls using DeepVariant and GLnexus

10.1101/2020.02.10.942086 ◽

2020 ◽

Cited By ~ 4

Author(s):

Taedong Yun ◽

Helen Li ◽

Pi-Chuan Chang ◽

Michael F. Lin ◽

Andrew Carroll ◽

...

Keyword(s):

Genetic Variation ◽

Best Practices ◽

Open Source ◽

Variant Calling ◽

Cost Savings ◽

Quality Improvements ◽

1000 Genomes Project ◽

Genetic Analyses ◽

1000 Genomes ◽

Population Scale

AbstractPopulation-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready variants remains challenging. Here we introduce an open-source cohort variant-calling method using the highly-accurate caller DeepVariant and scalable merging tool GLnexus. We optimized callset quality based on benchmark samples and Mendelian consistency across many sample sizes and sequencing specifications, resulting in substantial quality improvements and cost savings over existing best practices. We further evaluated our pipeline in the 1000 Genomes Project (1KGP) samples, showing superior quality metrics and imputation performance. We publicly release the 1KGP callset to foster development of broad studies of genetic variation.

Download Full-text

Haplotype-aware graph indexes

Bioinformatics ◽

10.1093/bioinformatics/btz575 ◽

2019 ◽

Cited By ~ 6

Author(s):

Jouni Sirén ◽

Erik Garrison ◽

Adam M Novak ◽

Benedict Paten ◽

Richard Durbin

Keyword(s):

Genetic Variation ◽

Precision Medicine ◽

Chromosome 17 ◽

Supplementary Information ◽

Whole Genome ◽

Supplementary Data ◽

1000 Genomes Project ◽

1000 Genomes ◽

Burrows Wheeler Transform ◽

Haplotype Information

Abstract Motivation The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. Results We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. Availability and implementation Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Haplotype-aware graph indexes

10.1101/559583 ◽

2019 ◽

Cited By ~ 7

Author(s):

Jouni Sirén ◽

Erik Garrison ◽

Adam M. Novak ◽

Benedict Paten ◽

Richard Durbin

Keyword(s):

Genetic Variation ◽

Chromosome 17 ◽

Supplementary Information ◽

Whole Genome ◽

Supplementary Data ◽

1000 Genomes Project ◽

1000 Genomes ◽

Link Type ◽

Supplementary Material ◽

Haplotype Information

AbstractMotivationThe variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are nonbiological, unlikely recombinations of true haplotypes.ResultsWe augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheelertransform (GBWT). We demonstrate the scalability of the new implementation by building a whole-genome index of the 5,008 haplotypes of the 1000 Genomes Project, and an index of all 108,070 TOPMed Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.AvailabilityOur software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt, and https://github.com/jltsiren/[email protected] informationSupplementary data are available.

Download Full-text

popSTR2 enables clinical and population-scale genotyping of microsatellites

Bioinformatics ◽

10.1093/bioinformatics/btz913 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2269-2271 ◽

Cited By ~ 4

Author(s):

Snædis Kristmundsdottir ◽

Hannes P Eggertsson ◽

Gudny A Arnadottir ◽

Bjarni V Halldorsson

Keyword(s):

Population Based ◽

Supplementary Information ◽

Supplementary Data ◽

Clinical Sequencing ◽

Manual Inspection ◽

Repeat Expansions ◽

Population Scale

Abstract Summary popSTR2 is an update and augmentation of our previous work ‘popSTR: a population-based microsatellite genotyper’. To make genotyping sensitive to inter-sample differences, we supply a kernel to estimate sample-specific slippage rates. For clinical sequencing purposes, a panel of known pathogenic repeat expansions is provided along with a script that scans and flags for manual inspection markers indicative of a pathogenic expansion. Like its predecessor, popSTR2 allows for joint genotyping of samples at a population scale. We now provide a binning method that makes the microsatellite genotypes more amenable to analysis within standard association pipelines and can increase association power. Availability and implementation https://github.com/DecodeGenetics/popSTR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Efficient toolkit implementing best practices for principal component analysis of population genetic data

Bioinformatics ◽

10.1093/bioinformatics/btaa520 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4449-4457 ◽

Cited By ~ 4

Author(s):

Florian Privé ◽

Keurcien Luu ◽

Michael G B Blum ◽

John J McGrath ◽

Bjarni J Vilhjálmsson

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Best Practices ◽

Principal Component ◽

Genetic Data ◽

Uk Biobank ◽

1000 Genomes Project ◽

1000 Genomes ◽

R Packages ◽

The Uk

ABSTRACT Motivation Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. Results For example, we find that PC19–PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16–18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data. Availability and implementation R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Analysis of genomic variation in non-coding elements using population-scale sequencing data from the 1000 Genomes Project

Nucleic Acids Research ◽

10.1093/nar/gkr342 ◽

2011 ◽

Vol 39 (16) ◽

pp. 7058-7076 ◽

Cited By ~ 49

Author(s):

Xinmeng Jasmine Mu ◽

Zhi John Lu ◽

Yong Kong ◽

Hugo Y. K. Lam ◽

Mark B. Gerstein

Keyword(s):

Genomic Variation ◽

Sequencing Data ◽

1000 Genomes Project ◽

1000 Genomes ◽

Population Scale

Download Full-text

WDSPdb: an updated resource for WD40 proteins

Bioinformatics ◽

10.1093/bioinformatics/btz460 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4824-4826 ◽

Cited By ~ 6

Author(s):

Jing Ma ◽

Ke An ◽

Jing-Bo Zhou ◽

Nuo-Si Wu ◽

Yang Wang ◽

...

Keyword(s):

Web Site ◽

Large Family ◽

Supplementary Information ◽

Supplementary Data ◽

Wd40 Repeat ◽

Tertiary Structures ◽

Cellular Processes ◽

1000 Genomes ◽

Repeat Proteins ◽

The Web

AbstractSummaryThe WD40-repeat proteins are a large family of scaffold molecules that assemble complexes in various cellular processes. Obtaining their structures is the key to understanding their interaction details. We present WDSPdb 2.0, a significantly updated resource providing accurately predicted secondary and tertiary structures and featured sites annotations. Based on an optimized pipeline, WDSPdb 2.0 contains about 600 thousand entries, an increase of 10-fold, and integrates more than 37 000 variants from sources of ClinVar, Cosmic, 1000 Genomes, ExAC, IntOGen, cBioPortal and IntAct. In addition, the web site is largely improved for visualization, exploring and data downloading.Availability and implementationhttp://www.wdspdb.com/wdsp/ or http://wu.scbb.pkusz.edu.cn/wdsp/.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

Efficient toolkit implementing best practices for principal component analysis of population genetic data

10.1101/841452 ◽

2019 ◽

Cited By ~ 2

Author(s):

Florian Privé ◽

Keurcien Luu ◽

Michael G.B. Blum ◽

John J. McGrath ◽

Bjarni J. Vilhjálmsson

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Best Practices ◽

Principal Component ◽

Genetic Data ◽

Component Analysis ◽

Uk Biobank ◽

1000 Genomes Project ◽

1000 Genomes ◽

The Uk

AbstractPrincipal Component Analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (1) capturing Linkage Disequilibrium (LD) structure instead of population structure, (2) projected PCs that suffer from shrinkage bias, (3) detecting sample outliers, and (4) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr.For example, we find that PC19 to PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr.Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.

Download Full-text

1000 Genomes Project reveals human variation

Nature ◽

10.1038/news.2010.567 ◽

2010 ◽

Cited By ~ 3

Author(s):

Alla Katsnelson

Keyword(s):

Human Variation ◽

1000 Genomes Project ◽

1000 Genomes

Download Full-text

ccNetViz: a WebGL-based JavaScript library for visualization of large networks

Bioinformatics ◽

10.1093/bioinformatics/btaa559 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4527-4529

Author(s):

Ales Saska ◽

David Tichy ◽

Robert Moore ◽

Achilles Rasquinha ◽

Caner Akdas ◽

...

Keyword(s):

Systems Biology ◽

Complex Networks ◽

Open Source ◽

High Speed ◽

A Priori ◽

Supplementary Information ◽

Network Visualization ◽

Supplementary Data ◽

Web Based ◽

Flow Of Information

Abstract Summary Visualizing a network provides a concise and practical understanding of the information it represents. Open-source web-based libraries help accelerate the creation of biologically based networks and their use. ccNetViz is an open-source, high speed and lightweight JavaScript library for visualization of large and complex networks. It implements customization and analytical features for easy network interpretation. These features include edge and node animations, which illustrate the flow of information through a network as well as node statistics. Properties can be defined a priori or dynamically imported from models and simulations. ccNetViz is thus a network visualization library particularly suited for systems biology. Availability and implementation The ccNetViz library, demos and documentation are freely available at http://helikarlab.github.io/ccNetViz/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text