scholarly journals Reconstruction of clone- and haplotype-specific cancer genome karyotypes from bulk tumor samples

2019 ◽  
Author(s):  
Sergey Aganezov ◽  
Benjamin J. Raphael

AbstractMany cancer genomes are extensively rearranged with highly aberrant chromosomal karyotypes. These genome rearrangements, or structural variants, can be detected in tumor DNA sequencing data by abnormal mapping of se-quence reads to the reference genome. However, nearly all cancer sequencing to date is of bulk tumor samples which consist of a heterogeneous mixture of normal cells and subpopulations of cancers cells, or clones, that harbor distinct somatic structural variants. We introduce a novel algorithm, Reconstructing Cancer Karyotypes (RCK), to reconstruct haplotype-specific karyotypes of one or more rearranged cancer genomes, or clones, that best explain the read alignments from a bulk tumor sample. RCK leverages specific evolutionary constraints on the somatic mutation process in cancer to reduce ambiguity in the deconvolution of admixed DNA sequence data into multiple haplotype-specific cancer karyotypes. In particular, RCK relies on generalizations of the infinite sites assumption that a genome re-arrangement is highly unlikely to occur at the same nucleotide position more than once during somatic evolution. RCK’s comprehensive model allows us to incorporate information both from short and long-read sequencing technologies and is applicable to bulk tumor samples containing a mixture of an arbitrary number of derived genomes. We compared RCK to the state-of-the-art method ReMixT on a dataset of 17 primary and metastatic prostate cancer samples. We demonstrate that ReMixT’s limited support for heterogeneity and lack of evolutionary constrains leads to reconstruction of implausible karyotypes. In contrast, RCK’s infers cancer karyotypes that better explain read alignments from bulk tumor samples and are consistent with a reasonable evolutionary model. RCK’s reconstructions of clone- and haplotype-specific karyotypes will aid further studies of the role of intra-tumor heterogeneity in cancer development and response to treatment. RCK is available at https://github.com/raphael-group/RCK.

2017 ◽  
Author(s):  
Kyle S. Smith ◽  
Debashis Ghosh ◽  
Katherine S. Pollard ◽  
Subhajyoti De

ABSTRACTBy accumulation of somatic mutations, cancer genomes evolve, diverging away from the genome of the host. It remains unclear to what extent somatic evolutionary divergence is comparable across different regions of the cancer genome versus concentrated in specific genomic elements. We present a novel computational framework, SASE-mapper, to identify genomic regions that show signatures of accelerated somatic evolution (SASE) in a subset of samples in a cohort, marked by accumulation of an excess of somatic mutations compared to that expected based on local, context-aware background mutation rates in the cancer genomes. Analyzing tumor whole genome sequencing data for 365 samples from 5 cohorts we detect recurrent SASE at a genome-wide scale. The SASEs were enriched for genomic elements associated with active chromatin, and regulatory regions of several known cancer genes had SASE in multiple cohorts. Regions with SASE carried specific mutagenic signatures and often co-localized within the 3D nuclear space suggesting their common basis. A subset of SASEs was frequently associated with regulatory changes in key cancer pathways and also poor clinical outcome. While the SASE-associated mutations were not necessarily recurrent at base-pair resolution, the SASEs recurrently targeted same functional regions, with similar consequences. It is likely that regulatory redundancy and plasticity promote prevalence of SASE-like patterns in the cancer genomes.


2017 ◽  
Author(s):  
Jeremiah Wala ◽  
Pratiti Bandopadhayay ◽  
Noah Greenwald ◽  
Ryan O’Rourke ◽  
Ted Sharpe ◽  
...  

AbstractStructural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at-scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA’s performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs, and substantially improved detection performance for variants in the 20-300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (< 1,000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types, and found that templated-sequence insertions occur in ~4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized SVs.


2019 ◽  
Author(s):  
David Wyllie ◽  
Trien Do ◽  
Richard Myers ◽  
Vlad Nikolayevskyy ◽  
Derrick Crook ◽  
...  

AbstractBackgroundThe prevalence, association with disease status, and public health impact of infection with mixtures of M. tuberculosis strains is unclear, in part due to limitations of existing methods for detecting mixed infections.MethodsWe developed an algorithm to identify mixtures of M. tuberculosis strains using next generation sequencing data, assessing performance using simulated sequences. We identified mixed M. tuberculosis strains when there was at least one mixed nucleotide position, and where both the mixture’s components were present in similar isolates from other individuals. We determined risk factors for mixed infection among isolations of M. tuberculosis in England using logistic regression. We used survival analyses to assess the association between mixed infection and putative transmission.Findings6,560 isolations of TB were successfully sequenced in England 2016-2018. Of 3,691 (56%) specimens for which similar sequences had been isolated from at least two other individuals, 341 (9.2%) were mixed. Infection with lineages other than Lineage 4 were associated with mixed infection. Among the 1,823 individuals with pulmonary infection with Lineage 4 M. tuberculosis, mixed infection was associated with significantly increased risk of subsequent isolation of closely related organisms from a different individual (HR 1.43, 95% CI 1.05,1.94), indicative of transmission.InterpretationMixtures of transmissible strains occur in at least 5% of tuberculosis infections in England; when present in pulmonary disease, such mixtures are associated with an increased risk of tuberculosis transmission.FundingPublic Health England; NIHR Health Protection Research Unit Oxford; European Union.Research in ContextEvidence Before This StudyWe searched Pubmed using the search terms ‘tuberculosis’ and ‘mixed’ or ‘mixture’ for English Language articles published up to 1 April 2019. Studies, most performed without the benefit of genomic sequencing, report mixed TB infection from a range of medium and high prevalence areas and show it to be associated with delayed treatment response. Modelling suggests detection and treatment of mixed TB infection is an important goal for TB eradication campaigns. Although routine DNA sequencing of M. tuberculosis isolates is becoming widespread, efficient methods for detecting mixed infection from such data are underdeveloped, and the true prevalence of mixed infection and its association with transmission is unclear.Added Value of This StudyThis study investigated a large series of TB isolations obtained as part of a routine Mycobacterial sequencing program by two reference laboratories, in a low incidence area, England. We developed an efficient generalisable approach to identify transmitted mixed M. tuberculosis infection; our approach is capable of sensitive and specific detection of a single mixed nucleotide position. We identified mixed infection of similar strains (‘microvariation’) in about 9.2% of the M. tuberculosis samples which we were able to assess, and found evidence of increased transmission from individuals with mixed infection.Implications of All the Available EvidenceTB microvariation is a risk factor for TB transmission, even in the low incidence area studied. Although an efficient and highly specific technique identifying microvariation exists, it relies on comparison with similar sequences isolated from other patients. Sharing of sequence data from the many TB sequencing programs being deployed globally will increase the sensitivity of microvariation detection, and may assist targeted public health interventions.


2021 ◽  
Author(s):  
Maureen Rebecca Smith ◽  
Maria Trofimova ◽  
Ariane Weber ◽  
Yannick Duport ◽  
Denise Kuhnert ◽  
...  

In May 2021, over 160 million SARS-CoV-2 infections have been reported worldwide. Yet, the true amount of infections is unknown and believed to exceed the reported numbers by several fold, depending on national testing policies that can strongly affect the proportion of undetected cases. To overcome this testing bias and better assess SARS-CoV-2 transmission dynamics, we propose a genome-based computational pipeline, GInPipe, to reconstruct the SARS-CoV-2 incidence dynamics through time. After validating GInPipe against in silico generated outbreak data, as well as more complex phylodynamic analyses, we use the pipeline to reconstruct incidence histories in Denmark, Scotland, Switzerland, and Victoria (Australia) solely from viral sequence data. The proposed method robustly reconstructs the different pandemic waves in the investigated countries and regions, does not require phylodynamic reconstruction, and can be directly applied to publicly deposited SARS-CoV-2 sequencing data sets. We observe differences in the relative magnitude of reconstructed versus reported incidences during times with sparse availability of diagnostic tests. Using the reconstructed incidence dynamics, we assess how testing policies may have affected the probability to diagnose and report infected individuals. We find that under-reporting was highest in mid 2020 in all analysed countries, coinciding with liberal testing policies at times of low test capacities. Due to the increased use of real-time sequencing, it is envisaged that GInPipe can complement established surveillance tools to monitor the SARS-CoV-2 pandemic and evaluate testing policies. The method executes within minutes on very large data sets and is freely available as a fully automated pipeline from https://github.com/KleistLab/GInPipe.


2017 ◽  
Author(s):  
Rebecca Elyanow ◽  
Hsin-Ta Wu ◽  
Benjamin J. Raphael

AbstractStructural variation, including large deletions, duplications, inversions, translocations, and other rearrangements, is common in human and cancer genomes. A number of methods have been developed to identify structural variants from Illumina short-read sequencing data. However, reliable identification of structural variants remains challenging because many variants have breakpoints in repetitive regions of the genome and thus are difficult to identify with short reads. The recently developed linked-read sequencing technology from 10X Genomics combines a novel barcoding strategy with Illumina sequencing. This technology labels all reads that originate from a small number (~5-10) DNA molecules ~50Kbp in length with the same molecular barcode. These barcoded reads contain long-range sequence information that is advantageous for identification of structural variants. We present Novel Adjacency Identification with Barcoded Reads (NAIBR), an algorithm to identify structural variants in linked-read sequencing data. NAIBR predicts novel adjacencies in a individual genome resulting from structural variants using a probabilistic model that combines multiple signals in barcoded reads. We show that NAIBR outperforms several existing methods for structural variant identification – including two recent methods that also analyze linked-reads – on simulated sequencing data and 10X whole-genome sequencing data from the NA12878 human genome and the HCC1954 breast cancer cell line. Several of the novel somatic structural variants identified in HCC1954 overlap known cancer genes.


2016 ◽  
Author(s):  
Miika J Ahdesmäki ◽  
Brad Chapman ◽  
Pablo E Cingolani ◽  
Oliver Hofmann ◽  
Aleksandr Sidoruk ◽  
...  

AbstractSensitivity of short read DNA-sequencing for gene fusion detection is improving, but is hampered by the significant amount of noise composed of uninteresting or false positive hits in the data. In this paper we describe a tiered prioritisation approach to extract high impact gene fusion events. Using cell line and patient DNA sequence data we improve the annotation and interpretation of structural variant calls to best highlight likely cancer driving fusions. We also considerably improve on the automated visualisation of the high impact structural variants to highlight the effects of the variants on the resulting transcripts. The resulting framework greatly improves on readily detecting clinically actionable structural variants.


2020 ◽  
Author(s):  
Dilan S. R. Patiranage ◽  
Elodie Rey ◽  
Nazgol Emrani ◽  
Gordon Wellman ◽  
Karl Schmid ◽  
...  

AbstractQuinoa germplasm preserves useful and substantial genetic variation, yet it remains untapped due to a lack of implementation of modern breeding tools. We have integrated field and sequence data to characterize a large diversity panel of quinoa. Whole-genome sequencing of 310 accessions revealed 2.9 million polymorphic high confidence SNP loci. Highland and Lowland quinoa were clustered into two main groups, with FST divergence of 0.36 and fast LD decay of 6.5 and 49.8 Kb, respectively. A genome-wide association study uncovered 600 SNPs stably associated with 17 agronomic traits. Two candidate genes are associated with thousand seed weight, and a resistance gene analog is associated with downy mildew resistance. We also identified pleiotropically acting loci for four agronomic traits that are highly responding to photoperiod hence important for the adaptation to different environments. This work demonstrates the use of re-sequencing data of an orphan crop, which is partially domesticated to rapidly identify marker-trait association and provides the underpinning elements for genomics-enabled quinoa breeding.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3166 ◽  
Author(s):  
Miika J. Ahdesmäki ◽  
Brad A. Chapman ◽  
Pablo Cingolani ◽  
Oliver Hofmann ◽  
Aleksandr Sidoruk ◽  
...  

Sensitivity of short read DNA-sequencing for gene fusion detection is improving, but is hampered by the significant amount of noise composed of uninteresting or false positive hits in the data. In this paper we describe a tiered prioritisation approach to extract high impact gene fusion events from existing structural variant calls. Using cell line and patient DNA sequence data we improve the annotation and interpretation of structural variant calls to best highlight likely cancer driving fusions. We also considerably improve on the automated visualisation of the high impact structural variants to highlight the effects of the variants on the resulting transcripts. The resulting framework greatly improves on readily detecting clinically actionable structural variants.


2019 ◽  
Author(s):  
Doruk Beyter ◽  
Helga Ingimundardottir ◽  
Asmundur Oddsson ◽  
Hannes P. Eggertsson ◽  
Eythor Bjornsson ◽  
...  

Long-read sequencing (LRS) promises to improve characterization of structural variants (SVs), a major source of genetic diversity. We generated LRS data on 3,622 Icelanders using Oxford Nanopore Technologies, and identified a median of 22,636 SVs per individual (a median of 13,353 insertions and 9,474 deletions), spanning a median of 10 Mb per haploid genome. We discovered a set of 133,886 reliably genotyped SV alleles and imputed them into 166,281 individuals to explore their effects on diseases and other traits. We discovered an association with a rare (AF = 0.037%) deletion of the first exon of PCSK9. Carriers of this deletion have 0.93 mmol/L (1.31 SD) lower LDL cholesterol levels than the population average (p-value = 7.0·10−20). We also discovered an association with a multi-allelic SV inside a large repeat region, contained within single long reads, in an exon of ACAN. Within this repeat region we found 11 alleles that differ in the number of a 57 bp-motif repeat, and observed a linear relationship (0.016 SD per motif inserted, p = 6.2·10−18) between the number of repeats carried and height. These results show that SVs can be accurately characterized at population scale using long read sequence data in a genome-wide non-targeted approach and demonstrate how SVs impact phenotypes.


2018 ◽  
Author(s):  
Li Charlie Xia ◽  
Dongmei Ai ◽  
Hojoon Lee ◽  
Noemi Andor ◽  
Chao Li ◽  
...  

ABSTRACTBackgroundSimulating genome sequence data with features can facilitate the development and benchmarking of structural variant analysis programs. However, there are a limited number of data simulators that provide structural variants in silico. Moreover, there are a paucity of programs that generate structural variants with different allelic fraction and haplotypes.FindingsWe developed SVEngine, an open source tool to address this need. SVEngine simulates next generation sequencing data with embedded structural variations. As input, SVEngine takes template haploid sequences (FASTA) and an external variant file, a variant distribution file and/or a clonal phylogeny tree file (NEWICK) as input. Subsequently, it simulates and outputs sequence contigs (FASTAs), sequence reads (FASTQs) and/or post-alignment files (BAMs). All of the files contain the desired variants, along with BED files containing the ground truth. SVEngine’s flexible design process enables one to specify size, position, and allelic fraction for deletion, insertion, duplication, inversion and translocation variants. Finally, SVEngine simulates sequence data that replicates the characteristics of a sequencing library with mixed sizes of DNA insert molecules. To improve the compute speed, SVEngine is highly parallelized to reduce the simulation time.ConclusionsWe demonstrated the versatile features of SVEngine and its improved runtime comparisons with other available simulators. SVEngine’s features include the simulation of locus-specific variant frequency designed to mimic the phylogeny of cancer clonal evolution. We validated the accuracy of the simulations. Our evaluation included checking various sequencing mapping features such as coverage change, read clipping, insert size shift and neighbouring hanging read pairs for representative variant types. SVEngine is implemented as a standard Python package and is freely available for academic use at: https://bitbucket.org/charade/svengine.


Sign in / Sign up

Export Citation Format

Share Document