scholarly journals Deconvolution and phylogeny inference of structural variations in tumor genomic samples

2018 ◽  
Author(s):  
Jesse Eaton ◽  
Jingyi Wang ◽  
Russell Schwartz

AbstractPhylogenetic reconstruction of tumor evolution has emerged as a crucial tool for making sense of the complexity of emerging cancer genomic data sets. Despite the growing use of phylogenetics in cancer studies, though, the field has only slowly adapted to many ways that tumor evolution differs from classic species evolution. One crucial question in that regard is how to handle inference of structural variations (SVs), which are a major mechanism of evolution in cancers but have been largely neglected in tumor phylogenetics to date, in part due to the challenges of reliably detecting and typing SVs and interpreting them phylogenetically. We present a novel method for reconstructing evolutionary trajectories of SVs from bulk whole-genome sequence data via joint deconvolution and phylogenetics, to infer clonal subpopulations and reconstruct their ancestry. We establish a novel likelihood model for joint deconvolution and phylogenetic inference on bulk SV data and formulate an associated optimization algorithm. We demonstrate the approach to be efficient and accurate for realistic scenarios of SV mutation on simulated data. Application to breast cancer genomic data from The Cancer Genome Atlas (TCGA) shows it to be practical and effective at reconstructing features of SV-driven evolution in single tumors. All code can be found at https://github.com/jaebird123/tusv

2017 ◽  
Vol 49 (1) ◽  
Author(s):  
Long Chen ◽  
Amanda J. Chamberlain ◽  
Coralie M. Reich ◽  
Hans D. Daetwyler ◽  
Ben J. Hayes

2020 ◽  
Author(s):  
Idowu A. Taiwo ◽  
Nike Adeleye ◽  
Fatimah O. Anwoju ◽  
Adeyemi Adeyinka ◽  
Ijeoma C. Uzoma ◽  
...  

AbstractBackgroundCoronaviruses are a group of viruses that belong to the Family Coronaviridae, Genus Betacoronavirus. In December 2019, a new coronavirus disease (COVID-19) characterized by severe respiratory symptoms was discovered. The causative pathogen was a novel coronavirus known as 2019-nCoV and later as SARS-CoV-2. Within two months of its discovery, COVID-19 became a pandemic causing widespread morbidity and mortality.MethodologyWhole genome sequence data of SARS-CoV-2 isolated from Nigerian COVID-19 cases were retrieved by downloading from GISAID database. A total of 18 sequences that satisfied quality assurance (length ≥ 29700 nts and number of unknown bases denoted as ‘N’ ≤ 5%) were used for the study. Multiple sequence alignment (MSA) was done in MAFFT (Version 7.471) while SNP calling was implemented in DnaSP (Version 6.12.03) respectively and then visualized in Jalview (Version 2.11.1.0). Phylogenetic analysis was with MEGA X software.ResultsNigerian SARS-CoV-2 had 99.9% genomic similarity with four large conserved genomic regions. A total of 66 SNPs were identified out of which 31 were informative. Nucleotide diversity assessment gave Pi = 0.00048 and average SNP frequency of 2.22 SNPs per 1000 nts. Non-coding genomic regions particularly 5’UTR and 3’UTR had a SNP density of 3.77 and 35.4 respectively. The region with the highest SNP density was ORF10 with a frequency of 8.55 SNPs/1000 nts). Majority (72.2%) of viruses in Nigeria are of L lineage with preponderance of D614G mutation which accounted for 11 (61.1%) out of the 18 viral sequences. Nigeria SARS-CoV-2 revealed 3 major clades namely Oyo, Ekiti and Osun on a maximum likelihood phylogenetic tree.Conclusion and RecommendationNigerian SARS-CoV-2 reveals high mutation rate together with preponderance of L lineage and D614G mutants. Implication of these mutations for SARS-CoV-2 virulence and the need for more aggressive testing and treatment of COVID-19 in Nigeria is discussed. Additionally, attempt to produce testing kits for COVID-19 in Nigeria should consider the conserved regions identified in this study. Strict adherence to COVID-19 preventive measure is recommended in view of Nigerian SARS-CoV-2 phylogenetic clustering pattern, which suggests intensive community transmission possibly rooted in communal culture characteristic of many ethnicities in Nigeria.


2013 ◽  
Author(s):  
Xavier Didelot ◽  
Jennifer Gardy ◽  
Caroline Colijn

Genomics is increasingly being used to investigate disease outbreaks, but an important question remains unanswered -- how well do genomic data capture known transmission events, particularly for pathogens with long carriage periods or large within-host population sizes? Here we present a novel Bayesian approach to reconstruct densely-sampled outbreaks from genomic data whilst considering within-host diversity. We infer a time-labelled phylogeny using BEAST, then infer a transmission network via a Monte-Carlo Markov Chain. We find that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial uncertainty even when genomic data reflect a high substitution rate. Reconstruction of a real-world tuberculosis outbreak displayed similar uncertainty, although the correct source case and several clusters of epidemiologically linked cases were identified. We conclude that genomics cannot wholly replace traditional epidemiology, but that Bayesian reconstructions derived from sequence data may form a useful starting point for a genomic epidemiology investigation.


2021 ◽  
Author(s):  
Tyler Steven Brown ◽  
Aimee R. Taylor ◽  
Olufunmilayo Arogbokun ◽  
Caroline O. Buckee ◽  
Hsiao-Han Chang

Measuring gene flow between malaria parasite populations in different geographic locations can provide strategic information for malaria control interventions. Multiple important questions pertaining to the design of such studies remain unanswered, limiting efforts to operationalize genomic surveillance tools for routine public health use. This report evaluates numerically the ability to distinguish different levels of gene flow between malaria populations, using different amounts of real and simulated data, where data are simulated using parameters that approximate different epidemiological conditions. Specifically, using Plasmodium falciparum  whole genome sequence data and sequence data simulated for a metapopulation with different migration rates and effective population sizes, we compare two estimators of gene flow, explore the number of genetic markers and number of individuals required to reliably rank highly connected locations, and describe how these thresholds change given different effective population sizes and migration rates. Our results have implications for the design and implementation of malaria genomic surveillance efforts.


2019 ◽  
Author(s):  
K. N. Dinh ◽  
R. Jaksik ◽  
M. Kimmel ◽  
A. Lambert ◽  
S. Tavaré

AbstractRecent years have produced a large amount of work on inference about cancer evolution from mutations identified in cancer samples. Much of the modeling work has been based on classical models of population genetics, generalized to accommodate time-varying cell population size. Reverse-time genealogical views of such models, commonly known as coalescents, have been used to infer aspects of the past of growing populations. Another approach is to use branching processes, the simplest scenario being the linear birth-death process (lbdp), a binary fission Markov age-dependent branching process. A genealogical view of such models is also available. The two approaches lead to similar but not identical results. Inference from evolutionary models of DNA often exploits summary statistics of the sequence data, a common one being the so-called Site Frequency Spectrum (SFS). In a sequencing experiment with a known number of sequences, we can estimate for each site at which a novel somatic mutation has arisen, the number of cells that carry that mutation. These numbers are then grouped into sites which have the same number of copies of the mutant. SFS can be computed from the statistics of mutations in a sample of cells, in which DNA has been sequenced. In this paper, examine how the SFS based on birth-death processes differ from those based on the coalescent model. This may stem from the different sampling mechanisms in the two approaches. However, we also show mathematically and computationally that despite this, they can be made quantitatively comparable at least for the range of parameters typical for tumor cell populations. We also present a model of tumor evolution with selective sweeps, based on coalescence, and demonstrate how it may help in understanding the past history of tumor as well the influence of data pre-processing. We illustrate the theory with applications to several examples of The Cancer Genome Atlas tumors.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Sai Chen ◽  
Peter Krusche ◽  
Egor Dolzhenko ◽  
Rachel M. Sherman ◽  
Roman Petrovski ◽  
...  

AbstractAccurate detection and genotyping of structural variations (SVs) from short-read data is a long-standing area of development in genomics research and clinical sequencing pipelines. We introduce Paragraph, an accurate genotyper that models SVs using sequence graphs and SV annotations. We demonstrate the accuracy of Paragraph on whole-genome sequence data from three samples using long-read SV calls as the truth set, and then apply Paragraph at scale to a cohort of 100 short-read sequenced samples of diverse ancestry. Our analysis shows that Paragraph has better accuracy than other existing genotypers and can be applied to population-scale studies.


PLoS Genetics ◽  
2021 ◽  
Vol 17 (12) ◽  
pp. e1009944
Author(s):  
Torsten Pook ◽  
Adnane Nemri ◽  
Eric Gerardo Gonzalez Segovia ◽  
Daniel Valle Torres ◽  
Henner Simianer ◽  
...  

High-throughput genotyping of large numbers of lines remains a key challenge in plant genetics, requiring geneticists and breeders to find a balance between data quality and the number of genotyped lines under a variety of different existing genotyping technologies when resources are limited. In this work, we are proposing a new imputation pipeline (“HBimpute”) that can be used to generate high-quality genomic data from low read-depth whole-genome-sequence data. The key idea of the pipeline is the use of haplotype blocks from the software HaploBlocker to identify locally similar lines and subsequently use the reads of all locally similar lines in the variant calling for a specific line. The effectiveness of the pipeline is showcased on a dataset of 321 doubled haploid lines of a European maize landrace, which were sequenced at 0.5X read-depth. The overall imputing error rates are cut in half compared to state-of-the-art software like BEAGLE and STITCH, while the average read-depth is increased to 83X, thus enabling the calling of copy number variation. The usefulness of the obtained imputed data panel is further evaluated by comparing the performance of sequence data in common breeding applications to that of genomic data generated with a genotyping array. For both genome-wide association studies and genomic prediction, results are on par or even slightly better than results obtained with high-density array data (600k). In particular for genomic prediction, we observe slightly higher data quality for the sequence data compared to the 600k array in the form of higher prediction accuracies. This occurred specifically when reducing the data panel to the set of overlapping markers between sequence and array, indicating that sequencing data can benefit from the same marker ascertainment as used in the array process to increase the quality and usability of genomic data.


2017 ◽  
Author(s):  
Phelim Bradley ◽  
Henk C Den Bakker ◽  
Eduardo P. C. Rocha ◽  
Gil McVean ◽  
Zamin Iqbal

AbstractGenome sequencing of pathogens is now ubiquitous in microbiology, and the sequence archives are effectively no longer searchable for arbitrary sequences. Furthermore, the exponential increase of these archives is likely to be further spurred by automated diagnostics. To unlock their use for scientific research and real-time surveillance we have combined knowledge about bacterial genetic variation with ideas used in web-search, to build a DNA search engine for microbial data that can grow incrementally. We indexed the complete global corpus of bacterial and viral whole genome sequence data (447,833 genomes), using four orders of magnitude less storage than previous methods. The method allows future scaling to millions of genomes. This renders the global archive accessible to sequence search, which we demonstrate with three applications: ultra-fast search for resistance genes MCR1-3, analysis of host-range for 2827 plasmids, and quantification of the rise of antibiotic resistance prevalence in the sequence archives.


2017 ◽  
Vol 49 (1) ◽  
Author(s):  
Long Chen ◽  
Amanda J. Chamberlain ◽  
Coralie M. Reich ◽  
Hans D. Daetwyler ◽  
Ben J. Hayes

2019 ◽  
Author(s):  
Sai Chen ◽  
Peter Krusche ◽  
Egor Dolzhenko ◽  
Rachel M. Sherman ◽  
Roman Petrovski ◽  
...  

AbstractAccurate detection and genotyping of structural variations (SVs) from short-read data is a long-standing area of development in genomics research and clinical sequencing pipelines. We introduce Paragraph, an accurate genotyper that models SVs using sequence graphs and SV annotations. We demonstrate the accuracy of Paragraph on whole-genome sequence data from three samples using long read SV calls as the truth set, and then apply Paragraph at scale to a cohort of 100 short-read sequenced samples of diverse ancestry. Our analysis shows that Paragraph has better accuracy than other existing genotypers and can be applied to population-scale studies.


Sign in / Sign up

Export Citation Format

Share Document