Kevlar: a mapping-free framework for accurate discovery ofde novovariants

AbstractMotivationDiscovery of genetic variants by whole genome sequencing has proven a powerful approach to study the etiology of complex genetic disorders. Elucidation of all variants is a necessary step in identifying causative variants and disease genes. In particular, there is an increased interest in detection ofde novovariation and investigation of its role in various disorders. State-of-the-art methods for variant discovery rely on mapping reads from each individual to a reference genome and predicting variants from difference observed between the mapped reads and the reference genome. This process typically results in millions of variant predictions, most of which are inherited and irrelevant to the phenotype of interest. To distinguish between inherited variation and novel variation resulting fromde novogermline mutation, whole-genome sequencing of close relatives (especially parents and siblings) is commonly used. However, standard mapping-based approaches tend to have a high false-discovery rate forde novovariant prediction, which in many cases arises from problems with read mapping. This is a particular challenge in predictingde novoindels and structural variants.ResultsWe have developed a mapping-free method, Kevlar, forde novovariant discovery based on direct comparison of sequence content between related individuals. Kevlar identifies high-abundancek-mers unique to the individual of interest and retrieves the reads containing thesek-mers. These reads are easily partitioned into disjoint sets by sharedk-mer content for subsequent locus-by-locus processing and variant calling. Kevlar also utilizes a novel probabilistic approach to score and rank the variant predictions to identify the most likelyde novovariants. We evaluated Kevlar on simulated and real pedigrees, and demonstrate its ability to detect bothde novoSNVs and indels with high sensitivity and specificity.Availabilityhttps://github.com/kevlar-dev/kevlar

Download Full-text

Whole Genome Sequencing Refines Knowledge on the Population Structure of Mycobacterium bovis from a Multi-Host Tuberculosis System

Microorganisms ◽

10.3390/microorganisms9081585 ◽

2021 ◽

Vol 9 (8) ◽

pp. 1585

Author(s):

Ana C. Reis ◽

Liliana C. M. Salvador ◽

Suelee Robbe-Austerman ◽

Rogério Tenreiro ◽

Ana Botelho ◽

...

Keyword(s):

Population Structure ◽

Whole Genome Sequencing ◽

Wild Boar ◽

Genome Sequencing ◽

Mycobacterium Bovis ◽

Red Deer ◽

Variable Number Tandem Repeat ◽

Variant Calling ◽

Whole Genome ◽

Network Analyses

Classical molecular analyses of Mycobacterium bovis based on spoligotyping and Variable Number Tandem Repeat (MIRU-VNTR) brought the first insights into the epidemiology of animal tuberculosis (TB) in Portugal, showing high genotypic diversity of circulating strains that mostly cluster within the European 2 clonal complex. Previous surveillance provided valuable information on the prevalence and spatial occurrence of TB and highlighted prevalent genotypes in areas where livestock and wild ungulates are sympatric. However, links at the wildlife–livestock interfaces were established mainly via classical genotype associations. Here, we apply whole genome sequencing (WGS) to cattle, red deer and wild boar isolates to reconstruct the M. bovis population structure in a multi-host, multi-region disease system and to explore links at a fine genomic scale between M. bovis from wildlife hosts and cattle. Whole genome sequences of 44 representative M. bovis isolates, obtained between 2003 and 2015 from three TB hotspots, were compared through single nucleotide polymorphism (SNP) variant calling analyses. Consistent with previous results combining classical genotyping with Bayesian population admixture modelling, SNP-based phylogenies support the branching of this M. bovis population into five genetic clades, three with apparent geographic specificities, as well as the establishment of an SNP catalogue specific to each clade, which may be explored in the future as phylogenetic markers. The core genome alignment of SNPs was integrated within a spatiotemporal metadata framework to further structure this M. bovis population by host species and TB hotspots, providing a baseline for network analyses in different epidemiological and disease control contexts. WGS of M. bovis isolates from Portugal is reported for the first time in this pilot study, refining the spatiotemporal context of TB at the wildlife–livestock interface and providing further support to the key role of red deer and wild boar on disease maintenance. The SNP diversity observed within this dataset supports the natural circulation of M. bovis for a long time period, as well as multiple introduction events of the pathogen in this Iberian multi-host system.

Download Full-text

Clinical-grade whole-genome sequencing and 3′ transcriptome analysis of colorectal cancer patients

Genome Medicine ◽

10.1186/s13073-021-00852-8 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Agata Stodolna ◽

Miao He ◽

Mahesh Vasipalli ◽

Zoya Kingsbury ◽

Jennifer Becq ◽

...

Keyword(s):

Colorectal Cancer ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Transcriptome Analysis ◽

Variant Calling ◽

Standard Of Care ◽

Genomic Variation ◽

Whole Genome ◽

Clinical Grade ◽

Pathway Gene

Abstract Background Clinical-grade whole-genome sequencing (cWGS) has the potential to become the standard of care within the clinic because of its breadth of coverage and lack of bias towards certain regions of the genome. Colorectal cancer presents a difficult treatment paradigm, with over 40% of patients presenting at diagnosis with metastatic disease. We hypothesised that cWGS coupled with 3′ transcriptome analysis would give new insights into colorectal cancer. Methods Patients underwent PCR-free whole-genome sequencing and alignment and variant calling using a standardised pipeline to output SNVs, indels, SVs and CNAs. Additional insights into the mutational signatures and tumour biology were gained by the use of 3′ RNA-seq. Results Fifty-four patients were studied in total. Driver analysis identified the Wnt pathway gene APC as the only consistently mutated driver in colorectal cancer. Alterations in the PI3K/mTOR pathways were seen as previously observed in CRC. Multiple private CNAs, SVs and gene fusions were unique to individual tumours. Approximately 30% of patients had a tumour mutational burden of > 10 mutations/Mb of DNA, suggesting suitability for immunotherapy. Conclusions Clinical whole-genome sequencing offers a potential avenue for the identification of private genomic variation that may confer sensitivity to targeted agents and offer patients new options for targeted therapies.

Download Full-text

Effective variant filtering and expected candidate variant yield in studies of rare human disease

npj Genomic Medicine ◽

10.1038/s41525-021-00227-3 ◽

2021 ◽

Vol 6 (1) ◽

Author(s):

Brent S. Pedersen ◽

Joe M. Brown ◽

Harriet Dashnow ◽

Amelia D. Wallace ◽

Matt Velinder ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Rare Disease ◽

Genome Sequencing ◽

Autosomal Dominant ◽

De Novo ◽

Autosomal Dominant Inheritance ◽

Compound Heterozygous ◽

Whole Genome ◽

Dominant Inheritance ◽

Family Based

AbstractIn studies of families with rare disease, it is common to screen for de novo mutations, as well as recessive or dominant variants that explain the phenotype. However, the filtering strategies and software used to prioritize high-confidence variants vary from study to study. In an effort to establish recommendations for rare disease research, we explore effective guidelines for variant (SNP and INDEL) filtering and report the expected number of candidates for de novo dominant, recessive, and autosomal dominant modes of inheritance. We derived these guidelines using two large family-based cohorts that underwent whole-genome sequencing, as well as two family cohorts with whole-exome sequencing. The filters are applied to common attributes, including genotype-quality, sequencing depth, allele balance, and population allele frequency. The resulting guidelines yield ~10 candidate SNP and INDEL variants per exome, and 18 per genome for recessive and de novo dominant modes of inheritance, with substantially more candidates for autosomal dominant inheritance. For family-based, whole-genome sequencing studies, this number includes an average of three de novo, ten compound heterozygous, one autosomal recessive, four X-linked variants, and roughly 100 candidate variants following autosomal dominant inheritance. The slivar software we developed to establish and rapidly apply these filters to VCF files is available at https://github.com/brentp/slivar under an MIT license, and includes documentation and recommendations for best practices for rare disease analysis.

Download Full-text

Estimating sequencing error rates using families

BioData Mining ◽

10.1186/s13040-021-00259-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Kelley Paskov ◽

Jae-Yoon Jung ◽

Brianna Chrisman ◽

Nate T. Stockham ◽

Peter Washington ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Exome Sequencing ◽

Genome Sequencing ◽

Variant Calling ◽

Error Rates ◽

Sequencing Error ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Platform ◽

Whole Exome

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

Download Full-text

Norgal: extraction and de novo assembly of mitochondrial DNA from whole-genome sequencing data

BMC Bioinformatics ◽

10.1186/s12859-017-1927-y ◽

2017 ◽

Vol 18 (1) ◽

Cited By ~ 21

Author(s):

Kosai Al-Nakeeb ◽

Thomas Nordahl Petersen ◽

Thomas Sicheritz-Pontén

Keyword(s):

Mitochondrial Dna ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

De Novo Assembly ◽

De Novo ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data

Download Full-text

Evaluating coverage bias in next-generation sequencing of Escherichia coli

PLoS ONE ◽

10.1371/journal.pone.0253440 ◽

2021 ◽

Vol 16 (6) ◽

pp. e0253440

Author(s):

Samantha Gunasekera ◽

Sam Abraham ◽

Marc Stegger ◽

Stanley Pang ◽

Penghao Wang ◽

...

Keyword(s):

Escherichia Coli ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

De Novo ◽

Fragment Size ◽

Gc Content ◽

Main Concern ◽

Whole Genome ◽

Assembly Quality ◽

Coverage Bias

Whole-genome sequencing is essential to many facets of infectious disease research. However, technical limitations such as bias in coverage and tagmentation, and difficulties characterising genomic regions with extreme GC content have created significant obstacles in its use. Illumina has claimed that the recently released DNA Prep library preparation kit, formerly known as Nextera Flex, overcomes some of these limitations. This study aimed to assess bias in coverage, tagmentation, GC content, average fragment size distribution, and de novo assembly quality using both the Nextera XT and DNA Prep kits from Illumina. When performing whole-genome sequencing on Escherichia coli and where coverage bias is the main concern, the DNA Prep kit may provide higher quality results; though de novo assembly quality, tagmentation bias and GC content related bias are unlikely to improve. Based on these results, laboratories with existing workflows based on Nextera XT would see minor benefits in transitioning to the DNA Prep kit if they were primarily studying organisms with neutral GC content.

Download Full-text

Contributions of de novo variants to systemic lupus erythematosus

European Journal of Human Genetics ◽

10.1038/s41431-020-0698-5 ◽

2020 ◽

Vol 29 (1) ◽

pp. 184-193 ◽

Cited By ~ 1

Author(s):

Jonas Carlsson Almlöf ◽

Sara Nystedt ◽

Aikaterini Mechtidou ◽

Dag Leonard ◽

Maija-Leena Eloranta ◽

...

Keyword(s):

Systemic Lupus Erythematosus ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Lupus Erythematosus ◽

De Novo ◽

Whole Genome ◽

Gene Promoters ◽

Single Nucleotide Variants ◽

Systemic Lupus ◽

Promoter Regions

AbstractBy performing whole-genome sequencing in a Swedish cohort of 71 parent-offspring trios, in which the child in each family is affected by systemic lupus erythematosus (SLE, OMIM 152700), we investigated the contribution of de novo variants to risk of SLE. We found de novo single nucleotide variants (SNVs) to be significantly enriched in gene promoters in SLE patients compared with healthy controls at a level corresponding to 26 de novo promoter SNVs more in each patient than expected. We identified 12 de novo SNVs in promoter regions of genes that have been previously implicated in SLE, or that have functions that could be of relevance to SLE. Furthermore, we detected three missense de novo SNVs, five de novo insertion-deletions, and three de novo structural variants with potential to affect the expression of genes that are relevant for SLE. Based on enrichment analysis, disease-affecting de novo SNVs are expected to occur in one-third of SLE patients. This study shows that de novo variants in promoters commonly contribute to the genetic risk of SLE. The fact that de novo SNVs in SLE were enriched to promoter regions highlights the importance of using whole-genome sequencing for identification of de novo variants.

Download Full-text