Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants

Whole-genome sequencing (WGS) is a fundamental technology for research to advance precision medicine, but the limited availability of portable and user-friendly workflows for WGS analyses poses a major challenge for many research groups and hampers scientific progress. Here we present Sarek, an open-source workflow to detect germline variants and somatic mutations based on sequencing data from WGS, whole-exome sequencing (WES), or gene panels. Sarek features (i) easy installation, (ii) robust portability across different computer environments, (iii) comprehensive documentation, (iv) transparent and easy-to-read code, and (v) extensive quality metrics reporting. Sarek is implemented in the Nextflow workflow language and supports both Docker and Singularity containers as well as Conda environments, making it ideal for easy deployment on any POSIX-compatible computers and cloud compute environments. Sarek follows the GATK best-practice recommendations for read alignment and pre-processing, and includes a wide range of software for the identification and annotation of germline and somatic single-nucleotide variants, insertion and deletion variants, structural variants, tumour sample purity, and variations in ploidy and copy number. Sarek offers easy, efficient, and reproducible WGS analyses, and can readily be used both as a production workflow at sequencing facilities and as a powerful stand-alone tool for individual research groups. The Sarek source code, documentation and installation instructions are freely available at https://github.com/nf-core/sarek and at https://nf-co.re/sarek/.

Download Full-text

Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants

F1000Research ◽

10.12688/f1000research.16665.2 ◽

2020 ◽

Vol 9 ◽

pp. 63 ◽

Cited By ~ 3

Author(s):

Maxime Garcia ◽

Szilveszter Juhos ◽

Malin Larsson ◽

Pall I. Olason ◽

Marcel Martin ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Scientific Progress ◽

Whole Genome ◽

Sequencing Analysis ◽

Research Groups ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Link Type ◽

Wide Range

Download Full-text

Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants

10.1101/316976 ◽

2018 ◽

Cited By ~ 4

Author(s):

Maxime Garcia ◽

Szilveszter Juhos ◽

Malin Larsson ◽

Pall I. Olason ◽

Marcel Martin ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Open Source ◽

Genome Sequencing ◽

Development Project ◽

Whole Genome ◽

Sequencing Analysis ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Insertion And Deletion ◽

Sample Heterogeneity

AbstractSummaryWhole-genome sequencing (WGS) is a cornerstone of precision medicine, but portable and reproducible open-source workflows for WGS analyses of germline and somatic variants are lacking. We present Sarek, a modular, comprehensive, and easy-to-install workflow, combining a range of software for the identification and annotation of single-nucleotide variants (SNVs), insertion and deletion variants (indels), structural variants, tumor sample heterogeneity, and karyotyping from germline or paired tumor/normal samples. Sarek is implemented in a bioinformatics workflow language (Nextflow) with Docker and Singularity compatible containers, ensuring easy deployment and full reproducibility at any Linux based compute cluster or cloud computing environment. Sarek supports the human reference genomes GRCh37 and GRCh38, and can readily be used both as a core production workflow at sequencing facilities and as a powerful stand-alone tool for individual research groups.AvailabilitySource code and instructions for local installation are available at GitHub (https://github.com/SciLifeLab/Sarek) under the MIT open-source license, and we invite the research community to contribute additional functionality as a collaborative open-source development project.

Download Full-text

Identification of Pathogenic Structural Variants in Rare Disease Patients through Genome Sequencing

10.1101/627661 ◽

2019 ◽

Cited By ~ 2

Author(s):

James M. Holt ◽

Camille L. Birch ◽

Donna M. Brown ◽

Manavalan Gajapathy ◽

Nadiya Sosonkina ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Rare Disease ◽

Genome Sequencing ◽

Genomic Analysis ◽

Whole Genome ◽

Structural Variants ◽

Single Nucleotide Variants ◽

Standard Clinical Practice ◽

Wide Range ◽

Genetic Features

AbstractPurposeClinical whole genome sequencing is becoming more common for determining the molecular diagnosis of rare disease. However, standard clinical practice often focuses on small variants such as single nucleotide variants and small insertions/deletions. This leaves a wide range of larger “structural variants” that are not commonly analyzed in patients.MethodsWe developed a pipeline for processing structural variants for patients who received whole genome sequencing through the Undiagnosed Diseases Network (UDN). This pipeline called structural variants, stored them in an internal database, and filtered the variants based on internal frequencies and external annotations. The remaining variants were manually inspected and then interesting findings were reported as research variants to clinical sites in the UDN.ResultsOf 477 analyzed UDN cases, 286 cases (≈ 60%) received at least one structural variant as a research finding. The variants in 16 cases (≈ 4%) are considered “Certain” or “Highly likely” molecularly diagnosed and another 4 cases are currently in review. Of those 20 cases, at least 13 were identified originally through our pipeline with one finding leading to identification of a new disease. As part of this paper, we have also released the collection of variant calls identified in our cohort along with heterozygous and homozygous call counts. This data is available at https://github.com/HudsonAlpha/UDN_SV_export.ConclusionStructural variants are key genetic features that should be analyzed during routine clinical genomic analysis. For our UDN patients, structural variants helped solve ≈ 4% of the total number of cases (≈ 13% of all genome sequencing solves), a success rate we expect to improve with better tools and greater understanding of the human genome.

Download Full-text

Harmonization of whole-genome sequencing for outbreak surveillance of Enterobacteriaceae and Enterococci

Microbial Genomics ◽

10.1099/mgen.0.000567 ◽

2021 ◽

Vol 7 (7) ◽

Author(s):

Casper Jamin ◽

Sien De Koster ◽

Stefanie van Koeveringe ◽

Dieter De Coninck ◽

Klaas Mensaert ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Type Species ◽

De Novo ◽

Whole Genome ◽

Data Generation ◽

Sequencing Data ◽

Content Type ◽

Link Type ◽

Antimicrobial Resistance Genes

Whole-genome sequencing (WGS) is becoming the de facto standard for bacterial typing and outbreak surveillance of resistant bacterial pathogens. However, interoperability for WGS of bacterial outbreaks is poorly understood. We hypothesized that harmonization of WGS for outbreak surveillance is achievable through the use of identical protocols for both data generation and data analysis. A set of 30 bacterial isolates, comprising of various species belonging to the Enterobacteriaceae family and Enterococcus genera, were selected and sequenced using the same protocol on the Illumina MiSeq platform in each individual centre. All generated sequencing data were analysed by one centre using BioNumerics (6.7.3) for (i) genotyping origin of replications and antimicrobial resistance genes, (ii) core-genome multi-locus sequence typing (cgMLST) for Escherichia coli and Klebsiella pneumoniae and whole-genome multi-locus sequencing typing (wgMLST) for all species. Additionally, a split k-mer analysis was performed to determine the number of SNPs between samples. A precision of 99.0% and an accuracy of 99.2% was achieved for genotyping. Based on cgMLST, a discrepant allele was called only in 2/27 and 3/15 comparisons between two genomes, for E. coli and K. pneumoniae, respectively. Based on wgMLST, the number of discrepant alleles ranged from 0 to 7 (average 1.6). For SNPs, this ranged from 0 to 11 SNPs (average 3.4). Furthermore, we demonstrate that using different de novo assemblers to analyse the same dataset introduces up to 150 SNPs, which surpasses most thresholds for bacterial outbreaks. This shows the importance of harmonization of data-processing surveillance of bacterial outbreaks. In summary, multi-centre WGS for bacterial surveillance is achievable, but only if protocols are harmonized.

Download Full-text

eSCAN: Scan Regulatory Regions for Aggregate Association Testing using Whole Genome Sequencing Data

10.1101/2020.11.30.405266 ◽

2020 ◽

Author(s):

Yingxi Yang ◽

Yuchen Yang ◽

Le Huang ◽

Jai G. Broome ◽

Adolfo Correa ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

New Technologies ◽

Real Data ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Association Testing ◽

Wide Range ◽

Sequencing Studies

AbstractWith advances in whole genome sequencing (WGS) technology, multiple statistical methods for aggregate association testing have been developed. Many common approaches aggregate variants in a given genomic window of a fixed/varying size and are not reliant on existing knowledge to define appropriate test units, resulting in most identified regions not being clearly linked to genes, limiting biological understanding. Functional information from new technologies (such as Hi-C and its derivatives), which can help link enhancers to the genes they affect, can be leveraged to predefine variant sets for aggregate testing in WGS. Therefore, in this paper we propose the eSCAN (Scan the Enhancers) method for genome-wide assessment of enhancer regions in sequencing studies, combining the advantages of dynamic window selection in SCANG with the advantages of increased incorporation of genomic annotation. eSCAN searches biologically meaningful searching windows, increasing power and aiding biological interpretation, as demonstrated by simulation studies under a wide range of scenarios. We also apply eSCAN for association analysis of blood cell traits using TOPMed WGS data from Women’s Health Initiative (WHI) and Jackson Heart Study (JHS). Results from this real data example show that eSCAN is able to capture more significant signals, and these signals are of shorter length and drive association of larger regions detected by other methods.

Download Full-text

Whole-genome sequencing analysis of clozapine-induced myocarditis

10.1101/2021.07.26.21261157 ◽

2021 ◽

Author(s):

Ankita Narang ◽

Paul Lacaze ◽

Kathlyn Ronaldson ◽

John McNeil ◽

Mahesh Jayaram ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Genome Wide Association Study ◽

Copy Number Variants ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Analysis ◽

Sequencing Data ◽

Dna Variation ◽

Rare Genetic Variants

One of the concerns limiting the use of clozapine in schizophrenia treatment is the risk of rare but potentially fatal myocarditis. Our previous genome-wide association study and human leucocyte antigen analyses identified putative loci associated with clozapine-induced myocarditis. However, the contribution of DNA variation in cytochrome P450 genes, copy number variants and rare deleterious variants have not been investigated. We explored these unexplored classes of DNA variation using whole-genome sequencing data from 25 cases with clozapine-induced myocarditis and 25 demographically-matched clozapine-tolerant control subjects. We identified 15 genes based on rare variant gene-burden analysis (MLLT6, CADPS, TACC2, L3MBTL4, NPY, SLC25A21, PARVB, GPR179, ACAD9, NOL8, C5orf33, FAM127A, AFDN, SLC6A11, PXDN) nominally associated (p<0.05) with clozapine-induced myocarditis. Of these genes, 13 were expressed in human myocardial tissue. Although independent replication of these findings is required, our study provides preliminary insights into the potential role of rare genetic variants in susceptibility to clozapine-induced myocarditis.

Download Full-text

From partial to whole genome imputation of SARS-CoV-2 for epidemiological surveillance

10.1101/2021.04.13.439668 ◽

2021 ◽

Author(s):

Francisco M Ortuno ◽

Carlos Loucera ◽

Carlos S Casimiro-Soriguer ◽

Jose A Lepe ◽

Pedro Camacho Martinez ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

High Rate ◽

Epidemiological Surveillance ◽

Primary Data ◽

Whole Genome ◽

Sequencing Data ◽

Wide Range ◽

Commercial Kits ◽

Almost All

The current SARS-CoV-2 pandemic has emphasized the utility of viral whole genome sequencing in the surveillance and control of the pathogen. An unprecedented ongoing global initiative is increasingly producing hundreds of thousands of sequences worldwide. However, the complex circumstances in which viruses are sequenced, along with the demand of urgent results, causes a high rate of incomplete and therefore useless, sequences. However, viral sequences evolve in the context of a complex phylogeny and therefore different positions along the genome are in linkage disequilibrium. Therefore, an imputation method would be able to predict missing positions from the available sequencing data. We developed impuSARS, an application that includes Minimac, the most widely used strategy for genomic data imputation and, taking advantage of the enormous amount of SARS-CoV-2 whole genome sequences available, a reference panel containing 239,301 sequences was built. The impuSARS application was tested in a wide range of conditions (continuous fragments, amplicons or sparse individual positions missing) showing great fidelity when reconstructing the original sequences. The impuSARS application is also able to impute whole genomes from commercial kits covering less than 20% of the genome or only from the Spike protein with a precision of 0.96. It also recovers the lineage with a 100% precision for almost all the lineages, even in very poorly covered genomes (< 20%). Imputation can improve the pace of SARS-CoV-2 sequencing production by recovering many incomplete or low-quality sequences that would be otherwise discarded. impuSARS can be incorporated in any primary data processing pipeline for SARS-CoV-2 whole genome sequencing.

Download Full-text

A simplified workflow for the analysis of whole-genome sequencing data from Pristionchus pacificus mutant lines

10.1101/2020.11.12.379388 ◽

2020 ◽

Author(s):

Christian Rödelsperger

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Model Organism ◽

Model Systems ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Pristionchus Pacificus ◽

Mutant Lines

AbstractNematodes are attractive model systems to understand the genetic basis of various biological processes ranging from development to complex behaviors. In particular, mutagenesis experiments combined with whole-genome sequencing has been proven as one of the most effective methods to identify core players of multiple biological pathways. To enable experimentalists to apply such integrative genetic and bioinformatic analysis in the case of the satellite model organism Pristionchus pacificus, I present a simplified workflow for the analysis of whole-genome data from mutant lines and corresponding mapping panels. Individual components are based on well-maintained and widely used software packages and are extended by 50 lines of code for the analysis and visualization of allele frequencies. The effectiveness of this workflow is demonstrated by an application to recently generated data of a P. pacificus mutant line, where it reduced the number of candidate mutations from an initial set of 3,500 single nucleotide variants to ten.

Download Full-text

Whole‐Genome Sequencing Analysis Using Next‐Generation Sequencing Data

Current Protocols Essential Laboratory Techniques ◽

10.1002/cpet.2 ◽

2016 ◽

Vol 12 (1) ◽

Author(s):

Chi Kent Ho ◽

Xiaohui Cui ◽

Sharon Grubner ◽

Christopher A. Larson ◽

Ying Wei ◽

...

Keyword(s):

Next Generation Sequencing ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Next Generation Sequencing Data ◽

Whole Genome ◽

Sequencing Analysis ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

Whole Genome Sequence Analysis of SARS-CoV-2 Strains Circulating in Malaysia During First Wave and Early Second Wave of Infections.

10.21203/rs.3.rs-81152/v1 ◽

2020 ◽

Author(s):

Zarina Mohd Zawawi ◽

Jeyanthi Suppiah ◽

Jeevanathan Kalyanasundram ◽

Muhammad Afif Azizan ◽

Shuhaila Mat-Sharani ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequence ◽

Health Concern ◽

Whole Genome ◽

Sequencing Analysis ◽

Full Genome ◽

Sequencing Data ◽

Genome Sequences ◽

Synonymous Mutations

Abstract Background: Since December 2019, the outbreak of COVID-19 has raised a great public health concern globally. Here, we report the whole genome sequencing analysis of SARS-CoV-2 strains in Malaysia isolated from six patients diagnosed with COVID-19.Methods: The SARS-CoV-2 viral RNA extracted from clinical specimens and isolates were subjected to whole genome sequencing using NextSeq 500 platform. The sequencing data were assembled to full genome sequences using Megahit and phylogenetic tree was constructed using Mega X software.Results: Six full genome sequences of SARS-CoV-2 comprising of strains from 1st wave (25th January 2020) and 2nd wave (27th February 2020) infection were obtained. Downstream analysis demonstrated diversity among the Malaysian strains with several synonymous and non-synonymous mutations in four of the six cases, affecting the genes M, orf1ab, and S of the SARS-CoV-2 virus. The phylogenetic analysis revealed viral genome sequences of Malaysian SARS-CoV-2 strains clustered under the ancestral Type B.Conclusion: This study comprehended the SARS-CoV-2 virus evolution during its circulation in Malaysia. Continuous monitoring and analysis of the whole genome sequences of confirmed cases would be crucial to further understand the genetic evolution of the virus.

Download Full-text