Bayesian inference of infectious disease transmission from whole genome sequence data

Genomics is increasingly being used to investigate disease outbreaks, but an important question remains unanswered -- how well do genomic data capture known transmission events, particularly for pathogens with long carriage periods or large within-host population sizes? Here we present a novel Bayesian approach to reconstruct densely-sampled outbreaks from genomic data whilst considering within-host diversity. We infer a time-labelled phylogeny using BEAST, then infer a transmission network via a Monte-Carlo Markov Chain. We find that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial uncertainty even when genomic data reflect a high substitution rate. Reconstruction of a real-world tuberculosis outbreak displayed similar uncertainty, although the correct source case and several clusters of epidemiologically linked cases were identified. We conclude that genomics cannot wholly replace traditional epidemiology, but that Bayesian reconstructions derived from sequence data may form a useful starting point for a genomic epidemiology investigation.

Download Full-text

Faculty Opinions recommendation of Bayesian inference of infectious disease transmission from whole-genome sequence data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718344178.793531004 ◽

2017 ◽

Author(s):

Sarah Rowland-Jones ◽

Sophie Andrews

Keyword(s):

Infectious Disease ◽

Bayesian Inference ◽

Genome Sequence ◽

Disease Transmission ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Infectious Disease Transmission ◽

Genome Sequence Data

Download Full-text

Bacterial genomes in epidemiology—present and future

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2012.0202 ◽

2013 ◽

Vol 368 (1614) ◽

pp. 20120202 ◽

Cited By ~ 37

Author(s):

Nicholas J. Croucher ◽

Simon R. Harris ◽

Yonatan H. Grad ◽

William P. Hanage

Keyword(s):

Data Analysis ◽

High Resolution ◽

Nucleotide Sequence ◽

Sequence Data ◽

Genomic Data ◽

Bacterial Genomes ◽

Viral Pathogens ◽

Nucleotide Sequence Data ◽

Genomic Epidemiology ◽

Sequencing Platforms

Sequence data are well established in the reconstruction of the phylogenetic and demographic scenarios that have given rise to outbreaks of viral pathogens. The application of similar methods to bacteria has been hindered in the main by the lack of high-resolution nucleotide sequence data from quality samples. Developing and already available genomic methods have greatly increased the amount of data that can be used to characterize an isolate and its relationship to others. However, differences in sequencing platforms and data analysis mean that these enhanced data come with a cost in terms of portability: results from one laboratory may not be directly comparable with those from another. Moreover, genomic data for many bacteria bear the mark of a history including extensive recombination, which has the potential to greatly confound phylogenetic and coalescent analyses. Here, we discuss the exacting requirements of genomic epidemiology, and means by which the distorting signal of recombination can be minimized to permit the leverage of growing datasets of genomic data from bacterial pathogens.

Download Full-text

Deconvolution and phylogeny inference of structural variations in tumor genomic samples

10.1101/257014 ◽

2018 ◽

Cited By ~ 2

Author(s):

Jesse Eaton ◽

Jingyi Wang ◽

Russell Schwartz

Keyword(s):

Sequence Data ◽

Phylogenetic Reconstruction ◽

Simulated Data ◽

Genomic Data ◽

The Cancer Genome Atlas ◽

Whole Genome Sequence ◽

Tumor Evolution ◽

Structural Variations ◽

Major Mechanism ◽

Likelihood Model

AbstractPhylogenetic reconstruction of tumor evolution has emerged as a crucial tool for making sense of the complexity of emerging cancer genomic data sets. Despite the growing use of phylogenetics in cancer studies, though, the field has only slowly adapted to many ways that tumor evolution differs from classic species evolution. One crucial question in that regard is how to handle inference of structural variations (SVs), which are a major mechanism of evolution in cancers but have been largely neglected in tumor phylogenetics to date, in part due to the challenges of reliably detecting and typing SVs and interpreting them phylogenetically. We present a novel method for reconstructing evolutionary trajectories of SVs from bulk whole-genome sequence data via joint deconvolution and phylogenetics, to infer clonal subpopulations and reconstruct their ancestry. We establish a novel likelihood model for joint deconvolution and phylogenetic inference on bulk SV data and formulate an associated optimization algorithm. We demonstrate the approach to be efficient and accurate for realistic scenarios of SV mutation on simulated data. Application to breast cancer genomic data from The Cancer Genome Atlas (TCGA) shows it to be practical and effective at reconstructing features of SV-driven evolution in single tumors. All code can be found at https://github.com/jaebird123/tusv

Download Full-text

Increasing calling accuracy, coverage, and read-depth in sequence data by the use of haplotype blocks

PLoS Genetics ◽

10.1371/journal.pgen.1009944 ◽

2021 ◽

Vol 17 (12) ◽

pp. e1009944

Author(s):

Torsten Pook ◽

Adnane Nemri ◽

Eric Gerardo Gonzalez Segovia ◽

Daniel Valle Torres ◽

Henner Simianer ◽

...

Keyword(s):

Data Quality ◽

Genomic Prediction ◽

Sequence Data ◽

Association Studies ◽

Genomic Data ◽

Read Depth ◽

Error Rates ◽

Whole Genome Sequence ◽

Genome Wide Association Studies ◽

Haplotype Blocks

High-throughput genotyping of large numbers of lines remains a key challenge in plant genetics, requiring geneticists and breeders to find a balance between data quality and the number of genotyped lines under a variety of different existing genotyping technologies when resources are limited. In this work, we are proposing a new imputation pipeline (“HBimpute”) that can be used to generate high-quality genomic data from low read-depth whole-genome-sequence data. The key idea of the pipeline is the use of haplotype blocks from the software HaploBlocker to identify locally similar lines and subsequently use the reads of all locally similar lines in the variant calling for a specific line. The effectiveness of the pipeline is showcased on a dataset of 321 doubled haploid lines of a European maize landrace, which were sequenced at 0.5X read-depth. The overall imputing error rates are cut in half compared to state-of-the-art software like BEAGLE and STITCH, while the average read-depth is increased to 83X, thus enabling the calling of copy number variation. The usefulness of the obtained imputed data panel is further evaluated by comparing the performance of sequence data in common breeding applications to that of genomic data generated with a genotyping array. For both genome-wide association studies and genomic prediction, results are on par or even slightly better than results obtained with high-density array data (600k). In particular for genomic prediction, we observe slightly higher data quality for the sequence data compared to the 600k array in the form of higher prediction accuracies. This occurred specifically when reducing the data panel to the set of overlapping markers between sequence and array, indicating that sequencing data can benefit from the same marker ascertainment as used in the array process to increase the quality and usability of genomic data.

Download Full-text

Phylogenetic tree shapes resolve disease transmission patterns

10.1101/003194 ◽

2014 ◽

Cited By ~ 1

Author(s):

Caroline Colijn ◽

Jennifer Gardy

Keyword(s):

Phylogenetic Tree ◽

Real World ◽

Disease Transmission ◽

Phylogenetic Trees ◽

Sequence Data ◽

Communicable Disease ◽

Disease Outbreaks ◽

Transmission Dynamics ◽

Topological Features ◽

Computationally Intensive

AbstractWhole genome sequencing is becoming popular as a tool for understanding outbreaks of communicable diseases, with phylogenetic trees being used to identify individual transmission events or to characterize outbreak-level overall transmission dynamics. Existing methods to infer transmission dynamics from sequence data rely on well-characterised infectious periods, epidemiological and clinical meta-data which may not always be available, and typically require computationally intensive analysis focussing on the branch lengths in phylogenetic trees. We sought to determine whether the topological structures of phylogenetic trees contain signatures of the overall transmission patterns underyling an outbreak. Here we use simulated outbreaks to train and then test computational classifiers. We test the method on data from two real-world outbreaks. We find that different transmission patterns result in quantitatively different phylogenetic tree shapes. We describe five topological features that summarize a phylogeny’s structure and find that computational classifiers based on these are capable of predicting an outbreak’s transmission dynamics. The method is robust to variations in the transmission parameters and network types, and recapitulates known epidemiology of previously characterized real-world outbreaks. We conclude that there are simple structural properties of phylogenetic trees which, when combined, can distinguish communicable disease outbreaks with a super-spreader, homogeneous transmission, and chains of transmission. This is possible using genome data alone, and can be done during an outbreak. We discuss the implications for management of outbreaks.

Download Full-text

Real-time search of all bacterial and viral genomic data

10.1101/234955 ◽

2017 ◽

Cited By ~ 16

Author(s):

Phelim Bradley ◽

Henk C Den Bakker ◽

Eduardo P. C. Rocha ◽

Gil McVean ◽

Zamin Iqbal

Keyword(s):

Real Time ◽

Web Search ◽

Sequence Data ◽

Genomic Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Fast Search ◽

Sequence Search ◽

Exponential Increase ◽

Bacterial Genetic

AbstractGenome sequencing of pathogens is now ubiquitous in microbiology, and the sequence archives are effectively no longer searchable for arbitrary sequences. Furthermore, the exponential increase of these archives is likely to be further spurred by automated diagnostics. To unlock their use for scientific research and real-time surveillance we have combined knowledge about bacterial genetic variation with ideas used in web-search, to build a DNA search engine for microbial data that can grow incrementally. We indexed the complete global corpus of bacterial and viral whole genome sequence data (447,833 genomes), using four orders of magnitude less storage than previous methods. The method allows future scaling to millions of genomes. This renders the global archive accessible to sequence search, which we demonstrate with three applications: ultra-fast search for resistance genes MCR1-3, analysis of host-range for 2827 plasmids, and quantification of the rise of antibiotic resistance prevalence in the sequence archives.

Download Full-text

Unravelling Transmission in Epidemiological Models and its Role in the Disease-Diversity Relationship

10.20944/preprints202110.0295.v1 ◽

2021 ◽

Author(s):

Marjolein E.M. Toorians ◽

Ailene MacPherson ◽

T. Jonathan Davies

Keyword(s):

Disease Transmission ◽

Disease Outbreaks ◽

Host Population ◽

Pathogen Transmission ◽

Transmission Model ◽

General Function ◽

Transmission Method ◽

Model Framework ◽

Host Diversity ◽

Transmission Mechanisms

With the decrease of biodiversity worldwide coinciding with an increase in disease outbreaks, investigating this link is more important then ever before. This review outlines the different modelling methods commonly used for pathogen transmission in animal host systems. There are a multitude of ways a pathogen can invade and spread through a host population. The assumptions of the transmission model used to capture disease propagation determines the outbreak potential, the net reproductive success (R0). This review offers an insight into the assumptions and motivation behind common transmission mechanisms and introduces a general framework with which contact rates, the most important parameter in disease dynamics, determines the transmission method. By using a general function introduced here and this general transmission model framework, we provide a guide for future disease ecologists for how to pick the contact function that best suites their system. Additionally, this manuscript attempts to bridge the gap between mathematical disease modelling and the controversially and heavily debated disease-diversity relationship, by expanding the summarized models to multiple hosts systems and explaining the role of host diversity in disease transmission. By outlining the mechanisms of transmission into a stepwise process, this review will serve as a guide to model pathogens in multi-host systems. We will further describe these models it in the greater context of host diversity and its effect on disease outbreaks, by introducing a novel method to include host species’ evolutionary history into the framework.

Download Full-text

Bayesian Inference of Infectious Disease Transmission from Whole-Genome Sequence Data

Molecular Biology and Evolution ◽

10.1093/molbev/msu121 ◽

2014 ◽

Vol 31 (7) ◽

pp. 1869-1879 ◽

Cited By ~ 129

Author(s):

Xavier Didelot ◽

Jennifer Gardy ◽

Caroline Colijn

Keyword(s):

Infectious Disease ◽

Bayesian Inference ◽

Genome Sequence ◽

Disease Transmission ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Infectious Disease Transmission ◽

Genome Sequence Data

Download Full-text

Exploring genetic regulatory networks in metazoan development: methods and models

Physiological Genomics ◽

10.1152/physiolgenomics.00072.2002 ◽

2002 ◽

Vol 10 (3) ◽

pp. 131-143 ◽

Cited By ~ 19

Author(s):

Marc S. Halfon ◽

Alan M. Michelson

Keyword(s):

Regulatory Networks ◽

Sequence Data ◽

Transcriptional Profiling ◽

Computational Prediction ◽

Regulatory Elements ◽

Genetic Regulatory Networks ◽

Genomic Research ◽

Whole Genome Sequence ◽

Biological Research ◽

Starting Point

One of the foremost challenges of 21st century biological research will be to decipher the complex genetic regulatory networks responsible for embryonic development. The recent explosion of whole genome sequence data and of genome-wide transcriptional profiling methods, such as microarrays, coupled with the development of sophisticated computational tools for exploiting and analyzing genomic data, provide a significant starting point for regulatory network analysis. In this article we review some of the main methodological issues surrounding genome annotation, transcriptional profiling, and computational prediction of cis-regulatory elements and discuss how the power of model genetic organisms can be used to experimentally verify and extend the results of genomic research.

Download Full-text

Increasing calling accuracy, coverage, and read depth in sequence data by the use of haplotype blocks

10.1101/2021.01.07.425688 ◽

2021 ◽

Author(s):

Torsten Pook ◽

Adnane Nemri ◽

Eric Gerardo Gonzalez Segovia ◽

Henner Simianer ◽

Chris Carolin Schoen

Keyword(s):

Data Quality ◽

Sequence Data ◽

Association Studies ◽

Predictive Ability ◽

Genomic Data ◽

Read Depth ◽

Error Rates ◽

Whole Genome Sequence ◽

Genome Wide Association Studies ◽

Haplotype Blocks

High-throughput genotyping of large numbers of lines remains a key challenge in plant genetics, requiring geneticists and breeders to find a balance between data quality and the number of genotyped lines under a variety of different existing technologies when resources are limited. In this work, we are proposing a new imputation pipeline ("HBimpute") that can be used to generate high-quality genomic data from low read-depth whole-genome-sequence data. The key idea of the pipeline is the use of haplotype blocks from the software HaploBlocker to identify locally similar lines and merge their reads locally. The effectiveness of the pipeline is showcased on a dataset of 321 doubled haploid lines of a European maize landrace, which were sequenced with 0.5X read-depth. Overall imputing error rates are cut in half compared to the state-of-the-art software BEAGLE, while the average read-depth is increased to 83X, thus enabling the calling of structural variation. The usefulness of the obtained imputed data panel is further evaluated by comparing the performance in common breeding applications to that of genomic data from a 600k array. In particular for genome-wide association studies, the sequence data is shown to be performing slightly better. Furthermore, genomic prediction based on the overlapping markers from the array and sequence is leading to a slightly higher predictive ability for the imputed sequence data, thereby indicating that the data quality obtained from low read-depth sequencing is on par or even slightly higher than high-density array data. When including all markers for the sequence data, the predictive ability is slightly reduced indicating overall lower data quality in non-array markers.

Download Full-text