Evaluation of serverless computing for scalable execution of a joint variant calling workflow

Aji John; Kathleen Muenzen; Kristiina Ausmees

doi:10.1371/journal.pone.0254363

Evaluation of serverless computing for scalable execution of a joint variant calling workflow

PLoS ONE ◽

10.1371/journal.pone.0254363 ◽

2021 ◽

Vol 16 (7) ◽

pp. e0254363

Author(s):

Aji John ◽

Kathleen Muenzen ◽

Kristiina Ausmees

Keyword(s):

Genetic Information ◽

Best Practice ◽

Workflow Management ◽

Variant Calling ◽

Phase Iii ◽

1000 Genomes Project ◽

1000 Genomes ◽

Genomics Research ◽

The Cost ◽

Analysis Of Performance

Advances in whole-genome sequencing have greatly reduced the cost and time of obtaining raw genetic information, but the computational requirements of analysis remain a challenge. Serverless computing has emerged as an alternative to using dedicated compute resources, but its utility has not been widely evaluated for standardized genomic workflows. In this study, we define and execute a best-practice joint variant calling workflow using the SWEEP workflow management system. We present an analysis of performance and scalability, and discuss the utility of the serverless paradigm for executing workflows in the field of genomics research. The GATK best-practice short germline joint variant calling pipeline was implemented as a SWEEP workflow comprising 18 tasks. The workflow was executed on Illumina paired-end read samples from the European and African super populations of the 1000 Genomes project phase III. Cost and runtime increased linearly with increasing sample size, although runtime was driven primarily by a single task for larger problem sizes. Execution took a minimum of around 3 hours for 2 samples, up to nearly 13 hours for 62 samples, with costs ranging from $2 to $70.

Accurate, scalable cohort variant calls using DeepVariant and GLnexus

10.1101/2020.02.10.942086 ◽

2020 ◽

Cited By ~ 4

Author(s):

Taedong Yun ◽

Helen Li ◽

Pi-Chuan Chang ◽

Michael F. Lin ◽

Andrew Carroll ◽

...

Keyword(s):

Genetic Variation ◽

Best Practices ◽

Open Source ◽

Variant Calling ◽

Cost Savings ◽

Quality Improvements ◽

1000 Genomes Project ◽

Genetic Analyses ◽

1000 Genomes ◽

Population Scale

AbstractPopulation-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready variants remains challenging. Here we introduce an open-source cohort variant-calling method using the highly-accurate caller DeepVariant and scalable merging tool GLnexus. We optimized callset quality based on benchmark samples and Mendelian consistency across many sample sizes and sequencing specifications, resulting in substantial quality improvements and cost savings over existing best practices. We further evaluated our pipeline in the 1000 Genomes Project (1KGP) samples, showing superior quality metrics and imputation performance. We publicly release the 1KGP callset to foster development of broad studies of genetic variation.

Design considerations for workflow management systems use in production genomics research and the clinic

10.1101/2021.04.03.437906 ◽

2021 ◽

Author(s):

Azza E Ahmed ◽

Joshua Allen ◽

Tajesvi Bhat ◽

Prakruthi Burra ◽

Christina E Fliege ◽

...

Keyword(s):

Complex Analysis ◽

Workflow Management ◽

Variant Calling ◽

Management Systems ◽

Systematic Evaluation ◽

Workflow Management Systems ◽

Healthcare Settings ◽

Genomics Research ◽

Bioinformatics Application ◽

Big Data Technologies

Background: The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. Results: This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, "which WfMS should be chosen for a given bioinformatics application regardless of analysis type?". Conclusions: The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance.

Design considerations for workflow management systems use in production genomics research and the clinic

Scientific Reports ◽

10.1038/s41598-021-99288-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Azza E. Ahmed ◽

Joshua M. Allen ◽

Tajesvi Bhat ◽

Prakruthi Burra ◽

Christina E. Fliege ◽

...

Keyword(s):

Complex Analysis ◽

Workflow Management ◽

Variant Calling ◽

Management Systems ◽

Systematic Evaluation ◽

Workflow Management Systems ◽

Healthcare Settings ◽

Genomics Research ◽

Bioinformatics Application ◽

Big Data Technologies

AbstractThe changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, which WfMS should be chosen for a given bioinformatics application regardless of analysis type?. The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance.

Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project

Wellcome Open Research ◽

10.12688/wellcomeopenres.15126.2 ◽

2019 ◽

Vol 4 ◽

pp. 50 ◽

Cited By ~ 7

Author(s):

Ernesto Lowy-Gallego ◽

Susan Fairley ◽

Xiangqun Zheng-Bradley ◽

Magali Ruffier ◽

Laura Clarke ◽

...

Keyword(s):

De Novo ◽

Variant Calling ◽

Final Phase ◽

1000 Genomes Project ◽

Data Set ◽

1000 Genomes ◽

Project Data

We present a set of biallelic SNVs and INDELs, from 2,548 samples spanning 26 populations from the 1000 Genomes Project, called de novo on GRCh38. We believe this will be a useful reference resource for those using GRCh38. It represents an improvement over the “lift-overs” of the 1000 Genomes Project data that have been available to date by encompassing all of the GRCh38 primary assembly autosomes and pseudo-autosomal regions, including novel, medically relevant loci. Here, we describe how the data set was created and benchmark our call set against that produced by the final phase of the 1000 Genomes Project on GRCh37 and the lift-over of that data to GRCh38.

Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project

Wellcome Open Research ◽

10.12688/wellcomeopenres.15126.1 ◽

2019 ◽

Vol 4 ◽

pp. 50 ◽

Cited By ~ 2

Author(s):

Ernesto Lowy-Gallego ◽

Susan Fairley ◽

Xiangqun Zheng-Bradley ◽

Magali Ruffier ◽

Laura Clarke ◽

...

Keyword(s):

Variant Calling ◽

Final Phase ◽

1000 Genomes Project ◽

1000 Genomes ◽

Project Data

We present biallelic SNVs called from 2,548 samples across 26 populations from the 1000 Genomes Project, called directly on GRCh38. We believe this will be a useful reference resource for those using GRCh38, representing an improvement over the “lift-overs” of the 1000 Genomes Project data that have been available to date and providing a resource necessary for the full adoption of GRCh38 by the community. Here, we describe how the call set was created and provide benchmarking data describing how our call set compares to that produced by the final phase of the 1000 Genomes Project on GRCh37.

In trans variant calling reveals enrichment for compound heterozygous variants in genes involved in neuronal development and growth.

Genetics Research ◽

10.1017/s0016672319000065 ◽

2019 ◽

Vol 101 ◽

Cited By ~ 1

Author(s):

Allison J. Cox ◽

Fillan Grady ◽

Gabriel Velez ◽

Vinit B. Mahajan ◽

Polly J. Ferguson ◽

...

Keyword(s):

Multiple Testing ◽

Variant Calling ◽

Epileptic Encephalopathy ◽

European Ancestry ◽

Compound Heterozygous ◽

Recessive Trait ◽

1000 Genomes Project ◽

1000 Genomes ◽

Project Participants ◽

Compound Heterozygous Variants

Abstract Compound heterozygotes occur when different variants at the same locus on both maternal and paternal chromosomes produce a recessive trait. Here we present the tool VarCount for the quantification of variants at the individual level. We used VarCount to characterize compound heterozygous coding variants in patients with epileptic encephalopathy and in the 1000 Genomes Project participants. The Epi4k data contains variants identified by whole exome sequencing in patients with either Lennox-Gastaut Syndrome (LGS) or infantile spasms (IS), as well as their parents. We queried the Epi4k dataset (264 trios) and the phased 1000 Genomes Project data (2504 participants) for recessive variants. To assess enrichment, transcript counts were compared between the Epi4k and 1000 Genomes Project participants using minor allele frequency (MAF) cutoffs of 0.5 and 1.0%, and including all ancestries or only probands of European ancestry. In the Epi4k participants, we found enrichment for rare, compound heterozygous variants in six genes, including three involved in neuronal growth and development – PRTG (p = 0.00086, 1% MAF, combined ancestries), TNC (p = 0.022, 1% MAF, combined ancestries) and MACF1 (p = 0.0245, 0.5% MAF, EU ancestry). Due to the total number of transcripts considered in these analyses, the enrichment detected was not significant after correction for multiple testing and higher powered or prospective studies are necessary to validate the candidacy of these genes. However, PRTG, TNC and MACF1 are potential novel recessive epilepsy genes and our results highlight that compound heterozygous variants should be considered in sporadic epilepsy.

Quality control of large genome datasets using genome fingerprints

10.1101/600254 ◽

2019 ◽

Author(s):

Max Robinson ◽

Gustavo Glusman

Keyword(s):

Best Practice ◽

Comparison Method ◽

Modern Human ◽

Reference Sequence ◽

Genome Comparison ◽

Human Genetic Variation ◽

Current Version ◽

1000 Genomes Project ◽

1000 Genomes ◽

Genotype Concordance

AbstractThe 1000 Genomes Project is a foundational resource to modern human biomedicine, serving as a standard reference for human genetic variation. Recently, new versions of the 1000 Genomes Project dataset were released, expressed relative to the current version of the human reference sequence (GRCh38) and partially validated by benchmarking against reference truth sets from the Genome In A Bottle Consortium. We used our ultrafast genome comparison method (genome fingerprinting) to evaluate four versions of the 1000 Genomes Project datasets. These comparisons revealed several discrepancies in dataset membership, multiple cryptic relationships, overall changes in biallelic SNV counts, and more significant changes in SNV counts, heterozygosity and genotype concordance affecting a subset of the individuals. Based on these observations, we recommend performing global dataset comparisons, using genome fingerprints and other metrics, to supplement ‘best practice’ benchmarking relative to predefined truth sets.

Improving variant calling using population data and deep learning

10.1101/2021.01.06.425550 ◽

2021 ◽

Author(s):

Nae-Chyun Chen ◽

Alexey Kolesnikov ◽

Sidharth Goel ◽

Taedong Yun ◽

Pi-Chuan Chang ◽

...

Keyword(s):

Deep Learning ◽

Large Scale ◽

Variant Calling ◽

Population Data ◽

Training Data ◽

1000 Genomes Project ◽

1000 Genomes ◽

Population Information ◽

Scale Population ◽

The Impact

Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we modify DeepVariant to add a new channel encoding population allele frequencies from the 1000 Genomes Project. We show that this model reduces variant calling errors, improving both precision and recall. We assess the impact of using population-specific or diverse reference panels. We achieve the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel.

de novo variant calling identifies cancer mutation profiles in the 1000 Genomes Project

10.1101/2021.05.27.445979 ◽

2021 ◽

Author(s):

Jeffrey K. Ng ◽

Pankaj Vats ◽

Elyn Fritz-Waters ◽

Evin M. Padhi ◽

Zachary L. Payne ◽

...

Keyword(s):

Cell Line ◽

De Novo ◽

Variant Calling ◽

B Cell Lymphoma ◽

Bimodal Distribution ◽

1000 Genomes Project ◽

1000 Genomes ◽

Age Related ◽

Detailed Assessment ◽

Paternal Parent

Detection of de novo variants (DNVs) is critical for studies of disease-related variation and mutation rates. We developed a GPU-based workflow to call DNVs, using 602 trios from the 1000 Genomes Project as a control. We detected 445,711 DNVs, having a bimodal distribution, with peaks at 200 and 2000 DNVs. The excess DNVs are cell line artifacts that are increasing with cell passage. Reduction in DNVs at CpG sites and in percent of DNVs with a paternal parent-of-origin with increasing number of DNVs supports this finding. Detailed assessment of individual NA12878 across multiple genome datasets from 2012 to 2020 reveals increasing number of DNVs over time. Mutation signature analysis across the set revealed individuals had either 1) age-related, 2) B-cell lymphoma, or 3) no prominent signatures. Our approach provides an important advancement for DNV detection and shows cell line artifacts present in lymphoblastoid cell lines are not always random.

1000 Genomes Project reveals human variation

Nature ◽

10.1038/news.2010.567 ◽

2010 ◽

Cited By ~ 3

Author(s):

Alla Katsnelson

Keyword(s):

Human Variation ◽

1000 Genomes Project ◽

1000 Genomes