scholarly journals Inferring clonal heterogeneity in cancer using SNP arrays and whole genome sequencing

2019 ◽  
Vol 35 (17) ◽  
pp. 2924-2931
Author(s):  
Mark R Zucker ◽  
Lynne V Abruzzo ◽  
Carmen D Herling ◽  
Lynn L Barron ◽  
Michael J Keating ◽  
...  

Abstract Motivation Clonal heterogeneity is common in many types of cancer, including chronic lymphocytic leukemia (CLL). Previous research suggests that the presence of multiple distinct cancer clones is associated with clinical outcome. Detection of clonal heterogeneity from high throughput data, such as sequencing or single nucleotide polymorphism (SNP) array data, is important for gaining a better understanding of cancer and may improve prediction of clinical outcome or response to treatment. Here, we present a new method, CloneSeeker, for inferring clinical heterogeneity from sequencing data, SNP array data, or both. Results We generated simulated SNP array and sequencing data and applied CloneSeeker along with two other methods. We demonstrate that CloneSeeker is more accurate than existing algorithms at determining the number of clones, distribution of cancer cells among clones, and mutation and/or copy numbers belonging to each clone. Next, we applied CloneSeeker to SNP array data from samples of 258 previously untreated CLL patients to gain a better understanding of the characteristics of CLL tumors and to elucidate the relationship between clonal heterogeneity and clinical outcome. We found that a significant majority of CLL patients appear to have multiple clones distinguished by copy number alterations alone. We also found that the presence of multiple clones corresponded with significantly worse survival among CLL patients. These findings may prove useful for improving the accuracy of prognosis and design of treatment strategies. Availability and implementation Code available on R-Forge: https://r-forge.r-project.org/projects/CloneSeeker/ Supplementary information Supplementary data are available at Bioinformatics online.

2017 ◽  
Author(s):  
Zilu Zhou ◽  
Weixin Wang ◽  
Li-San Wang ◽  
Nancy Ruonan Zhang

AbstractMotivationCopy number variations (CNVs) are gains and losses of DNA segments and have been associated with disease. Many large-scale genetic association studies are performing CNV analysis using whole exome sequencing (WES) and whole genome sequencing (WGS). In many of these studies, previous SNP-array data are available. An integrated cross-platform analysis is expected to improve resolution and accuracy, yet there is no tool for effectively combining data from sequencing and array platforms. The detection of CNVs using sequencing data alone can also be further improved by the utilization of allele-specific reads.ResultsWe propose a statistical framework, integrated Copy Number Variation detection algorithm (iCNV), which can be applied to multiple study designs: WES only, WGS only, SNP array only, or any combination of SNP and sequencing data. iCNV applies platform specific normalization, utilizes allele specific reads from sequencing and integrates matched NGS and SNP-array data by a Hidden Markov Model (HMM). We compare integrated two-platform CNV detection using iCNV to naive intersection or union of platforms and show that iCNV increases sensitivity and robustness. We also assess the accuracy of iCNV on WGS data only, and show that the utilization of allele-specific reads improve CNV detection accuracy compared to existing methods.Availabilityhttps://github.com/zhouzilu/[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.


Author(s):  
Jie Huang ◽  
Stefano Pallotti ◽  
Qianling Zhou ◽  
Marcus Kleber ◽  
Xiaomeng Xin ◽  
...  

Abstract The identification of rare haplotypes may greatly expand our knowledge in the genetic architecture of both complex and monogenic traits. To this aim, we developed PERHAPS (Paired-End short Reads-based HAPlotyping from next-generation Sequencing data), a new and simple approach to directly call haplotypes from short-read, paired-end Next Generation Sequencing (NGS) data. To benchmark this method, we considered the APOE classic polymorphism (*1/*2/*3/*4), since it represents one of the best examples of functional polymorphism arising from the haplotype combination of two Single Nucleotide Polymorphisms (SNPs). We leveraged the big Whole Exome Sequencing (WES) and SNP-array data obtained from the multi-ethnic UK BioBank (UKBB, N=48,855). By applying PERHAPS, based on piecing together the paired-end reads according to their FASTQ-labels, we extracted the haplotype data, along with their frequencies and the individual diplotype. Concordance rates between WES directly called diplotypes and the ones generated through statistical pre-phasing and imputation of SNP-array data are extremely high (>99%), either when stratifying the sample by SNP-array genotyping batch or self-reported ethnic group. Hardy-Weinberg Equilibrium tests and the comparison of obtained haplotype frequencies with the ones available from the 1000 Genome Project further supported the reliability of PERHAPS. Notably, we were able to determine the existence of the rare APOE*1 haplotype in two unrelated African subjects from UKBB, supporting its presence at appreciable frequency (approximatively 0.5%) in the African Yoruba population. Despite acknowledging some technical shortcomings, PERHAPS represents a novel and simple approach that will partly overcome the limitations in direct haplotype calling from short read-based sequencing.


2020 ◽  
Vol 36 (15) ◽  
pp. 4369-4371
Author(s):  
Andrew Whalen ◽  
Gregor Gorjanc ◽  
John M Hickey

Abstract Summary AlphaFamImpute is an imputation package for calling, phasing and imputing genome-wide genotypes in outbred full-sib families from single nucleotide polymorphism (SNP) array and genotype-by-sequencing (GBS) data. GBS data are increasingly being used to genotype individuals, especially when SNP arrays do not exist for a population of interest. Low-coverage GBS produces data with a large number of missing or incorrect naïve genotype calls, which can be improved by identifying shared haplotype segments between full-sib individuals. Here, we present AlphaFamImpute, an algorithm specifically designed to exploit the genetic structure of full-sib families. It performs imputation using a two-step approach. In the first step, it phases and imputes parental genotypes based on the segregation states of their offspring (i.e. which pair of parental haplotypes the offspring inherited). In the second step, it phases and imputes the offspring genotypes by detecting which haplotype segments the offspring inherited from their parents. With a series of simulations, we find that AlphaFamImpute obtains high-accuracy genotypes, even when the parents are not genotyped and individuals are sequenced at <1x coverage. Availability and implementation AlphaFamImpute is available as a Python package from the AlphaGenes website http://www.AlphaGenes.roslin.ed.ac.uk/AlphaFamImpute. Supplementary information Supplementary data are available at Bioinformatics online.


2022 ◽  
Vol 12 ◽  
Author(s):  
Lawrence E. Bramham ◽  
Tongtong Wang ◽  
Erin E. Higgins ◽  
Isobel A. P. Parkin ◽  
Guy C. Barker ◽  
...  

Turnip mosaic virus (TuMV) induces disease in susceptible hosts, notably impacting cultivation of important crop species of the Brassica genus. Few effective plant viral disease management strategies exist with the majority of current approaches aiming to mitigate the virus indirectly through control of aphid vector species. Multiple sources of genetic resistance to TuMV have been identified previously, although the majority are strain-specific and have not been exploited commercially. Here, two Brassica juncea lines (TWBJ14 and TWBJ20) with resistance against important TuMV isolates (UK 1, vVIR24, CDN 1, and GBR 6) representing the most prevalent pathotypes of TuMV (1, 3, 4, and 4, respectively) and known to overcome other sources of resistance, have been identified and characterized. Genetic inheritance of both resistances was determined to be based on a recessive two-gene model. Using both single nucleotide polymorphism (SNP) array and genotyping by sequencing (GBS) methods, quantitative trait loci (QTL) analyses were performed using first backcross (BC1) genetic mapping populations segregating for TuMV resistance. Pairs of statistically significant TuMV resistance-associated QTLs with additive interactive effects were identified on chromosomes A03 and A06 for both TWBJ14 and TWBJ20 material. Complementation testing between these B. juncea lines indicated that one resistance-linked locus was shared. Following established resistance gene nomenclature for recessive TuMV resistance genes, these new resistance-associated loci have been termed retr04 (chromosome A06, TWBJ14, and TWBJ20), retr05 (A03, TWBJ14), and retr06 (A03, TWBJ20). Genotyping by sequencing data investigated in parallel to robust SNP array data was highly suboptimal, with informative data not established for key BC1 parental samples. This necessitated careful consideration and the development of new methods for processing compromised data. Using reductive screening of potential markers according to allelic variation and the recombination observed across BC1 samples genotyped, compromised GBS data was rendered functional with near-equivalent QTL outputs to the SNP array data. The reductive screening strategy employed here offers an alternative to methods relying upon imputation or artificial correction of genotypic data and may prove effective for similar biparental QTL mapping studies.


2019 ◽  
Author(s):  
Barbara Tabak ◽  
Gordon Saksena ◽  
Coyin Oh ◽  
Galen F. Gao ◽  
Barbara Hill Meyers ◽  
...  

AbstractMotivationSomatic copy-number alterations (SCNAs) play an important role in cancer development. Systematic noise in sequencing and array data present a significant challenge to the inference of SCNAs for cancer genome analyses. As part of The Cancer Genome Atlas (TCGA), the Broad Institute Genome Characterization Center developed the Tangent copy-number inference pipeline to generate copy-number profiles using single-nucleotide polymorphism (SNP) array and whole-exome sequencing (WES) data from over 10,000 pairs of tumors and matched normal samples. Here, we describe the Tangent pipeline, which begins with DNA sequencing data in the form of .bam files or raw SNP array probe-level intensity data, and ends with segmented copy-number calls to facilitate the identification of novel genes potentially targeted by SCNAs. We also describe a modification of Tangent, Pseudo-Tangent, which enables denoising through comparisons between tumor profiles when few normal samples are available.ResultsTangent Normalization offers substantial signal-to-noise ratio (SNR) improvements compared to conventional normalization methods in both SNP array and WES analyses. The improvement in SNRs is achieved primarily through noise reduction with minimal effect on signal. Pseudo-Tangent also reduces noise when few normal samples are available. Tangent and Pseudo-Tangent are broadly applicable and enable more accurate inference of SCNAs from DNA sequencing and array data.Availability and ImplementationTangent is available at https://github.com/coyin/tangent and as a Docker image (https://hub.docker.com/r/coyin/tangent). Tangent is also the normalization method for the Copy Number pipeline in Genome Analysis Toolkit 4 (GATK4)[email protected], [email protected], [email protected]


2018 ◽  
Vol 35 (13) ◽  
pp. 2300-2302 ◽  
Author(s):  
Yasminka A Jakubek ◽  
F Anthony San Lucas ◽  
Paul Scheet

Abstract Motivation Genetic analysis of cancer regularly includes two or more samples from the same patient. Somatic copy number alterations leading to allelic imbalance (AI) play a critical role in cancer initiation and progression. Directional analysis and visualization of the alleles in imbalance in multi-sample settings allow for inference of recurrent mutations, providing insights into mutation rates, clonality and the genomic architecture and etiology of cancer. Results The REpeat Chromosomal changes Uncovered by Reflection (RECUR) is an R application for the comparative analysis of AI profiles derived from SNP array and next-generation sequencing data. The algorithm accepts genotype calls and ‘B allele’ frequencies (BAFs) from at least two samples derived from the same individual. For a predefined set of genomic regions with AI, RECUR compares BAF values among samples. In the presence of AI, the expected value of a BAF can shift in two possible directions, reflecting an increased or decreased abundance of the maternal haplotype, relative to the paternal. The phenomenon of opposite haplotype shifts, or ‘mirrored subclonal allelic imbalance’, is a form of heterogeneity, and has been linked to clinico-pathological features of cancer. RECUR detects such genomic segments of opposite haplotypes in imbalance and plots BAF values for all samples, using a two-color scheme for intuitive visualization. Availability and implementation RECUR is available as an R application. Source code and documentation are available at scheet.org. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 20 (S25) ◽  
Author(s):  
Fei Luo

Abstract Background The Copy Number Alterations (CNAs) are discovered to be tightly associated with cancers, so accurately detecting them is one of the most important tasks in the cancer genomics. A series of CNAs detection methods have been proposed and new ones are still being developed. Due to the complexity of CNAs in cancers, no CNAs detection method has been accepted as the gold standard caller. Several evaluation works have made attempts to reveal typical CNAs detection methods’ performance. Limited by the scale of evaluation data, these different comparison works don’t reach a consensus and the researchers are still confused on how to choose one proper CNAs caller for their analysis. Therefore, it needs a more comprehensive evaluation of typical CNAs detection methods’ performance. Results In this work, we use a large-scale real dataset from CAGEKID consortium to evaluate total 12 typical CNAs detection methods. These methods are most widely used in cancer researches and always used as benchmark for the newly proposed CNAs detection methods. This large-scale dataset comprises of SNP array data on 94 samples and the whole genome sequencing data on 10 samples. Evaluations are comprehensively implemented in current scenarios of CNAs detection, which include that detect CNAs on SNP array data, on sequencing data with tumor and normal matched samples and on sequencing data with single tumor sample. Three SNP based methods are firstly ranked. Subsequently, the best SNP based method’s results are used as benchmark to compare six matched samples based methods and three single tumor sample based methods in terms of the preprocessing, recall rate, Jaccard index and segmentation characteristics. Conclusions Our survey thoroughly reveals 12 typical methods’ superiority and inferiority. We explain why methods show specific characteristics from a methodological standpoint. Finally, we present the guiding principle for choosing one proper CNAs detection method under specific conditions. Some unsolved problems and expectations are also addressed for upcoming CNAs detection methods.


2021 ◽  
Vol 12 ◽  
Author(s):  
Ting-Hsuan Sun ◽  
Yu-Hsuan Joni Shao ◽  
Chien-Lin Mao ◽  
Miao-Neng Hung ◽  
Yi-Yun Lo ◽  
...  

Background: Single-nucleotide polymorphism (SNP) arrays are an ideal technology for genotyping genetic variants in mass screening. However, using SNP arrays to detect rare variants [with a minor allele frequency (MAF) of <1%] is still a challenge because of noise signals and batch effects. An approach that improves the genotyping quality is needed for clinical applications.Methods: We developed a quality-control procedure for rare variants which integrates different algorithms, filters, and experiments to increase the accuracy of variant calling. Using data from the TWB 2.0 custom Axiom array, we adopted an advanced normalization adjustment to prevent false calls caused by splitting the cluster and a rare het adjustment which decreases false calls in rare variants. The concordance of allelic frequencies from array data was compared to those from sequencing datasets of Taiwanese. Finally, genotyping results were used to detect familial hypercholesterolemia (FH), thrombophilia (TH), and maturity-onset diabetes of the young (MODY) to assess the performance in disease screening. All heterozygous calls were verified by Sanger sequencing or qPCR. The positive predictive value (PPV) of each step was estimated to evaluate the performance of our procedure.Results: We analyzed SNP array data from 43,433 individuals, which interrogated 267,247 rare variants. The advanced normalization and rare het adjustment methods adjusted genotyping calling of 168,134 variants (96.49%). We further removed 3916 probesets which were discordant in MAFs between the SNP array and sequencing data. The PPV for detecting pathogenic variants with 0.01%<MAF≤1% exceeded 99.37%. PPVs for those with an MAF of ≤0.01% improved from 95% to 100% for FH, 42.11% to 85.19% for TH, and 18.24% to 72.22% for MODY after adopting our rare variant quality-control procedure and experimental verification.Conclusion: Adopting our quality-control procedure, SNP arrays can adequately detect variants with MAF values ranging 0.01%∼0.1%. For variants with MAF values of ≤0.01%, experimental validation is needed unless sequencing data from a homogeneous population of >10,000 are available. The results demonstrated our procedure could perform correct genotype calling of rare variants. It provides a solution of pathogenic variant detection through SNP array. The approach brings tremendous promise for implementing precision medicine in medical practice.


2019 ◽  
Author(s):  
Matthew A. Myers ◽  
Gryte Satas ◽  
Benjamin J. Raphael

Background: Determining the clonal composition and somatic evolution of a tumor greatly aids in accurate prognosis and effective treatment for cancer. In order to understand how a tumor evolves over time and/or in response to treatment, multiple recent studies have performed longitudinal DNA sequencing of tumor samples from the same patient at several different time points. However, none of the existing algorithms that infer clonal composition and phylogeny using several bulk tumor samples from the same patient integrate the information that these samples were obtained from longitudinal observations. Results: We introduce a model for a longitudinally-observed phylogeny and derive constraints that longitudinal samples impose on the reconstruction of a phylogeny from bulk samples. These constraints form the basis for a new algorithm, Cancer Analysis of Longitudinal Data through Evolutionary Reconstruction (CALDER), which infers phylogenetic trees from longitudinal bulk DNA sequencing data. We show on simulated data that constraints from longitudinal sampling can substantially reduce ambiguity when deriving a phylogeny from multiple bulk tumor samples, each a mixture of tumor clones. On real data, where there is often considerable uncertainty in the clonal composition of a sample, longitudinal constraints yield more parsimonious phylogenies with fewer tumor clones per sample. We demonstrate that CALDER reconstructs more plausible phylogenies than existing methods on two longitudinal DNA sequencing datasets from chronic lymphocytic leukemia patients. These findings show the advantages of directly incorporating temporal information from longitudinal sampling into tumor evolution studies. Availability: CALDER is available at https://github.com/raphael-group.


Sign in / Sign up

Export Citation Format

Share Document