Inferring clonal heterogeneity in cancer using SNP arrays and whole genome sequencing

Mark R Zucker; Lynne V Abruzzo; Carmen D Herling; Lynn L Barron; Michael J Keating; Zachary B Abrams; Nyla Heerema; Kevin R Coombes

doi:10.1093/bioinformatics/btz057

Inferring clonal heterogeneity in cancer using SNP arrays and whole genome sequencing

Bioinformatics ◽

10.1093/bioinformatics/btz057 ◽

2019 ◽

Vol 35 (17) ◽

pp. 2924-2931

Author(s):

Mark R Zucker ◽

Lynne V Abruzzo ◽

Carmen D Herling ◽

Lynn L Barron ◽

Michael J Keating ◽

...

Keyword(s):

Clinical Outcome ◽

Snp Array ◽

Treatment Strategies ◽

Lymphocytic Leukemia ◽

Response To Treatment ◽

Supplementary Information ◽

Sequencing Data ◽

Clonal Heterogeneity ◽

Array Data ◽

Copy Numbers

Abstract Motivation Clonal heterogeneity is common in many types of cancer, including chronic lymphocytic leukemia (CLL). Previous research suggests that the presence of multiple distinct cancer clones is associated with clinical outcome. Detection of clonal heterogeneity from high throughput data, such as sequencing or single nucleotide polymorphism (SNP) array data, is important for gaining a better understanding of cancer and may improve prediction of clinical outcome or response to treatment. Here, we present a new method, CloneSeeker, for inferring clinical heterogeneity from sequencing data, SNP array data, or both. Results We generated simulated SNP array and sequencing data and applied CloneSeeker along with two other methods. We demonstrate that CloneSeeker is more accurate than existing algorithms at determining the number of clones, distribution of cancer cells among clones, and mutation and/or copy numbers belonging to each clone. Next, we applied CloneSeeker to SNP array data from samples of 258 previously untreated CLL patients to gain a better understanding of the characteristics of CLL tumors and to elucidate the relationship between clonal heterogeneity and clinical outcome. We found that a significant majority of CLL patients appear to have multiple clones distinguished by copy number alterations alone. We also found that the presence of multiple clones corresponded with significantly worse survival among CLL patients. These findings may prove useful for improving the accuracy of prognosis and design of treatment strategies. Availability and implementation Code available on R-Forge: https://r-forge.r-project.org/projects/CloneSeeker/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Integrative DNA copy number detection and genotyping from sequencing and array-based platforms

10.1101/172700 ◽

2017 ◽

Cited By ~ 2

Author(s):

Zilu Zhou ◽

Weixin Wang ◽

Li-San Wang ◽

Nancy Ruonan Zhang

Keyword(s):

Copy Number ◽

Association Studies ◽

Snp Array ◽

Supplementary Information ◽

Detection Accuracy ◽

Sequencing Data ◽

Array Data ◽

Combining Data ◽

Allele Specific ◽

Cnv Detection

AbstractMotivationCopy number variations (CNVs) are gains and losses of DNA segments and have been associated with disease. Many large-scale genetic association studies are performing CNV analysis using whole exome sequencing (WES) and whole genome sequencing (WGS). In many of these studies, previous SNP-array data are available. An integrated cross-platform analysis is expected to improve resolution and accuracy, yet there is no tool for effectively combining data from sequencing and array platforms. The detection of CNVs using sequencing data alone can also be further improved by the utilization of allele-specific reads.ResultsWe propose a statistical framework, integrated Copy Number Variation detection algorithm (iCNV), which can be applied to multiple study designs: WES only, WGS only, SNP array only, or any combination of SNP and sequencing data. iCNV applies platform specific normalization, utilizes allele specific reads from sequencing and integrates matched NGS and SNP-array data by a Hidden Markov Model (HMM). We compare integrated two-platform CNV detection using iCNV to naive intersection or union of platforms and show that iCNV increases sensitivity and robustness. We also assess the accuracy of iCNV on WGS data only, and show that the utilization of allele-specific reads improve CNV detection accuracy compared to existing methods.Availabilityhttps://github.com/zhouzilu/[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

PERHAPS: Paired-End short Reads-based HAPlotyping from next-generation Sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbaa320 ◽

2020 ◽

Author(s):

Jie Huang ◽

Stefano Pallotti ◽

Qianling Zhou ◽

Marcus Kleber ◽

Xiaomeng Xin ◽

...

Keyword(s):

Next Generation Sequencing ◽

Snp Array ◽

Simple Approach ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Short Read ◽

Array Data ◽

Short Reads ◽

Generation Sequencing

Abstract The identification of rare haplotypes may greatly expand our knowledge in the genetic architecture of both complex and monogenic traits. To this aim, we developed PERHAPS (Paired-End short Reads-based HAPlotyping from next-generation Sequencing data), a new and simple approach to directly call haplotypes from short-read, paired-end Next Generation Sequencing (NGS) data. To benchmark this method, we considered the APOE classic polymorphism (*1/*2/*3/*4), since it represents one of the best examples of functional polymorphism arising from the haplotype combination of two Single Nucleotide Polymorphisms (SNPs). We leveraged the big Whole Exome Sequencing (WES) and SNP-array data obtained from the multi-ethnic UK BioBank (UKBB, N=48,855). By applying PERHAPS, based on piecing together the paired-end reads according to their FASTQ-labels, we extracted the haplotype data, along with their frequencies and the individual diplotype. Concordance rates between WES directly called diplotypes and the ones generated through statistical pre-phasing and imputation of SNP-array data are extremely high (>99%), either when stratifying the sample by SNP-array genotyping batch or self-reported ethnic group. Hardy-Weinberg Equilibrium tests and the comparison of obtained haplotype frequencies with the ones available from the 1000 Genome Project further supported the reliability of PERHAPS. Notably, we were able to determine the existence of the rare APOE*1 haplotype in two unrelated African subjects from UKBB, supporting its presence at appreciable frequency (approximatively 0.5%) in the African Yoruba population. Despite acknowledging some technical shortcomings, PERHAPS represents a novel and simple approach that will partly overcome the limitations in direct haplotype calling from short read-based sequencing.

Download Full-text

AlphaFamImpute: high-accuracy imputation in full-sib families from genotype-by-sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa499 ◽

2020 ◽

Vol 36 (15) ◽

pp. 4369-4371

Author(s):

Andrew Whalen ◽

Gregor Gorjanc ◽

John M Hickey

Keyword(s):

Snp Array ◽

High Accuracy ◽

Supplementary Information ◽

Sequencing Data ◽

Single Nucleotide ◽

Genome Wide ◽

Sib Families ◽

Genotype By Sequencing ◽

Low Coverage ◽

Python Package

Abstract Summary AlphaFamImpute is an imputation package for calling, phasing and imputing genome-wide genotypes in outbred full-sib families from single nucleotide polymorphism (SNP) array and genotype-by-sequencing (GBS) data. GBS data are increasingly being used to genotype individuals, especially when SNP arrays do not exist for a population of interest. Low-coverage GBS produces data with a large number of missing or incorrect naïve genotype calls, which can be improved by identifying shared haplotype segments between full-sib individuals. Here, we present AlphaFamImpute, an algorithm specifically designed to exploit the genetic structure of full-sib families. It performs imputation using a two-step approach. In the first step, it phases and imputes parental genotypes based on the segregation states of their offspring (i.e. which pair of parental haplotypes the offspring inherited). In the second step, it phases and imputes the offspring genotypes by detecting which haplotype segments the offspring inherited from their parents. With a series of simulations, we find that AlphaFamImpute obtains high-accuracy genotypes, even when the parents are not genotyped and individuals are sequenced at <1x coverage. Availability and implementation AlphaFamImpute is available as a Python package from the AlphaGenes website http://www.AlphaGenes.roslin.ed.ac.uk/AlphaFamImpute. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Characterization and Mapping of retr04, retr05 and retr06 Broad-Spectrum Resistances to Turnip Mosaic Virus in Brassica juncea, and the Development of Robust Methods for Utilizing Recalcitrant Genotyping Data

Frontiers in Plant Science ◽

10.3389/fpls.2021.787354 ◽

2022 ◽

Vol 12 ◽

Author(s):

Lawrence E. Bramham ◽

Tongtong Wang ◽

Erin E. Higgins ◽

Isobel A. P. Parkin ◽

Guy C. Barker ◽

...

Keyword(s):

Mosaic Virus ◽

Brassica Juncea ◽

Allelic Variation ◽

Snp Array ◽

Genotyping By Sequencing ◽

Turnip Mosaic Virus ◽

Interactive Effects ◽

Sequencing Data ◽

Array Data ◽

Tumv Resistance

Turnip mosaic virus (TuMV) induces disease in susceptible hosts, notably impacting cultivation of important crop species of the Brassica genus. Few effective plant viral disease management strategies exist with the majority of current approaches aiming to mitigate the virus indirectly through control of aphid vector species. Multiple sources of genetic resistance to TuMV have been identified previously, although the majority are strain-specific and have not been exploited commercially. Here, two Brassica juncea lines (TWBJ14 and TWBJ20) with resistance against important TuMV isolates (UK 1, vVIR24, CDN 1, and GBR 6) representing the most prevalent pathotypes of TuMV (1, 3, 4, and 4, respectively) and known to overcome other sources of resistance, have been identified and characterized. Genetic inheritance of both resistances was determined to be based on a recessive two-gene model. Using both single nucleotide polymorphism (SNP) array and genotyping by sequencing (GBS) methods, quantitative trait loci (QTL) analyses were performed using first backcross (BC1) genetic mapping populations segregating for TuMV resistance. Pairs of statistically significant TuMV resistance-associated QTLs with additive interactive effects were identified on chromosomes A03 and A06 for both TWBJ14 and TWBJ20 material. Complementation testing between these B. juncea lines indicated that one resistance-linked locus was shared. Following established resistance gene nomenclature for recessive TuMV resistance genes, these new resistance-associated loci have been termed retr04 (chromosome A06, TWBJ14, and TWBJ20), retr05 (A03, TWBJ14), and retr06 (A03, TWBJ20). Genotyping by sequencing data investigated in parallel to robust SNP array data was highly suboptimal, with informative data not established for key BC1 parental samples. This necessitated careful consideration and the development of new methods for processing compromised data. Using reductive screening of potential markers according to allelic variation and the recombination observed across BC1 samples genotyped, compromised GBS data was rendered functional with near-equivalent QTL outputs to the SNP array data. The reductive screening strategy employed here offers an alternative to methods relying upon imputation or artificial correction of genotypic data and may prove effective for similar biparental QTL mapping studies.

Download Full-text

The Tangent copy-number inference pipeline for cancer genome analyses

10.1101/566505 ◽

2019 ◽

Cited By ~ 3

Author(s):

Barbara Tabak ◽

Gordon Saksena ◽

Coyin Oh ◽

Galen F. Gao ◽

Barbara Hill Meyers ◽

...

Keyword(s):

Dna Sequencing ◽

Copy Number ◽

Signal To Noise Ratio ◽

Snp Array ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Sequencing Data ◽

Array Data ◽

Link Type ◽

Genome Analyses

AbstractMotivationSomatic copy-number alterations (SCNAs) play an important role in cancer development. Systematic noise in sequencing and array data present a significant challenge to the inference of SCNAs for cancer genome analyses. As part of The Cancer Genome Atlas (TCGA), the Broad Institute Genome Characterization Center developed the Tangent copy-number inference pipeline to generate copy-number profiles using single-nucleotide polymorphism (SNP) array and whole-exome sequencing (WES) data from over 10,000 pairs of tumors and matched normal samples. Here, we describe the Tangent pipeline, which begins with DNA sequencing data in the form of .bam files or raw SNP array probe-level intensity data, and ends with segmented copy-number calls to facilitate the identification of novel genes potentially targeted by SCNAs. We also describe a modification of Tangent, Pseudo-Tangent, which enables denoising through comparisons between tumor profiles when few normal samples are available.ResultsTangent Normalization offers substantial signal-to-noise ratio (SNR) improvements compared to conventional normalization methods in both SNP array and WES analyses. The improvement in SNRs is achieved primarily through noise reduction with minimal effect on signal. Pseudo-Tangent also reduces noise when few normal samples are available. Tangent and Pseudo-Tangent are broadly applicable and enable more accurate inference of SCNAs from DNA sequencing and array data.Availability and ImplementationTangent is available at https://github.com/coyin/tangent and as a Docker image (https://hub.docker.com/r/coyin/tangent). Tangent is also the normalization method for the Copy Number pipeline in Genome Analysis Toolkit 4 (GATK4)[email protected], [email protected], [email protected]

Download Full-text

Directional allelic imbalance profiling and visualization from multi-sample data with RECUR

Bioinformatics ◽

10.1093/bioinformatics/bty885 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2300-2302 ◽

Cited By ~ 3

Author(s):

Yasminka A Jakubek ◽

F Anthony San Lucas ◽

Paul Scheet

Keyword(s):

Allelic Imbalance ◽

Critical Role ◽

Snp Array ◽

Supplementary Information ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Recurrent Mutations ◽

Two Samples ◽

Chromosomal Changes ◽

Color Scheme

Abstract Motivation Genetic analysis of cancer regularly includes two or more samples from the same patient. Somatic copy number alterations leading to allelic imbalance (AI) play a critical role in cancer initiation and progression. Directional analysis and visualization of the alleles in imbalance in multi-sample settings allow for inference of recurrent mutations, providing insights into mutation rates, clonality and the genomic architecture and etiology of cancer. Results The REpeat Chromosomal changes Uncovered by Reflection (RECUR) is an R application for the comparative analysis of AI profiles derived from SNP array and next-generation sequencing data. The algorithm accepts genotype calls and ‘B allele’ frequencies (BAFs) from at least two samples derived from the same individual. For a predefined set of genomic regions with AI, RECUR compares BAF values among samples. In the presence of AI, the expected value of a BAF can shift in two possible directions, reflecting an increased or decreased abundance of the maternal haplotype, relative to the paternal. The phenomenon of opposite haplotype shifts, or ‘mirrored subclonal allelic imbalance’, is a form of heterogeneity, and has been linked to clinico-pathological features of cancer. RECUR detects such genomic segments of opposite haplotypes in imbalance and plots BAF values for all samples, using a two-color scheme for intuitive visualization. Availability and implementation RECUR is available as an R application. Source code and documentation are available at scheet.org. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A systematic evaluation of copy number alterations detection methods on real SNP array and deep sequencing data

BMC Bioinformatics ◽

10.1186/s12859-019-3266-7 ◽

2019 ◽

Vol 20 (S25) ◽

Cited By ~ 1

Author(s):

Fei Luo

Keyword(s):

Copy Number ◽

Large Scale ◽

Detection Method ◽

Snp Array ◽

Detection Methods ◽

Copy Number Alterations ◽

Sequencing Data ◽

Array Data ◽

Matched Samples ◽

Single Tumor

Abstract Background The Copy Number Alterations (CNAs) are discovered to be tightly associated with cancers, so accurately detecting them is one of the most important tasks in the cancer genomics. A series of CNAs detection methods have been proposed and new ones are still being developed. Due to the complexity of CNAs in cancers, no CNAs detection method has been accepted as the gold standard caller. Several evaluation works have made attempts to reveal typical CNAs detection methods’ performance. Limited by the scale of evaluation data, these different comparison works don’t reach a consensus and the researchers are still confused on how to choose one proper CNAs caller for their analysis. Therefore, it needs a more comprehensive evaluation of typical CNAs detection methods’ performance. Results In this work, we use a large-scale real dataset from CAGEKID consortium to evaluate total 12 typical CNAs detection methods. These methods are most widely used in cancer researches and always used as benchmark for the newly proposed CNAs detection methods. This large-scale dataset comprises of SNP array data on 94 samples and the whole genome sequencing data on 10 samples. Evaluations are comprehensively implemented in current scenarios of CNAs detection, which include that detect CNAs on SNP array data, on sequencing data with tumor and normal matched samples and on sequencing data with single tumor sample. Three SNP based methods are firstly ranked. Subsequently, the best SNP based method’s results are used as benchmark to compare six matched samples based methods and three single tumor sample based methods in terms of the preprocessing, recall rate, Jaccard index and segmentation characteristics. Conclusions Our survey thoroughly reveals 12 typical methods’ superiority and inferiority. We explain why methods show specific characteristics from a methodological standpoint. Finally, we present the guiding principle for choosing one proper CNAs detection method under specific conditions. Some unsolved problems and expectations are also addressed for upcoming CNAs detection methods.

Download Full-text

A Novel Quality-Control Procedure to Improve the Accuracy of Rare Variant Calling in SNP Arrays

Frontiers in Genetics ◽

10.3389/fgene.2021.736390 ◽

2021 ◽

Vol 12 ◽

Author(s):

Ting-Hsuan Sun ◽

Yu-Hsuan Joni Shao ◽

Chien-Lin Mao ◽

Miao-Neng Hung ◽

Yi-Yun Lo ◽

...

Keyword(s):

Quality Control ◽

Rare Variant ◽

Rare Variants ◽

Snp Array ◽

Variant Calling ◽

Control Procedure ◽

Sequencing Data ◽

Snp Arrays ◽

Array Data ◽

Quality Control Procedure

Background: Single-nucleotide polymorphism (SNP) arrays are an ideal technology for genotyping genetic variants in mass screening. However, using SNP arrays to detect rare variants [with a minor allele frequency (MAF) of <1%] is still a challenge because of noise signals and batch effects. An approach that improves the genotyping quality is needed for clinical applications.Methods: We developed a quality-control procedure for rare variants which integrates different algorithms, filters, and experiments to increase the accuracy of variant calling. Using data from the TWB 2.0 custom Axiom array, we adopted an advanced normalization adjustment to prevent false calls caused by splitting the cluster and a rare het adjustment which decreases false calls in rare variants. The concordance of allelic frequencies from array data was compared to those from sequencing datasets of Taiwanese. Finally, genotyping results were used to detect familial hypercholesterolemia (FH), thrombophilia (TH), and maturity-onset diabetes of the young (MODY) to assess the performance in disease screening. All heterozygous calls were verified by Sanger sequencing or qPCR. The positive predictive value (PPV) of each step was estimated to evaluate the performance of our procedure.Results: We analyzed SNP array data from 43,433 individuals, which interrogated 267,247 rare variants. The advanced normalization and rare het adjustment methods adjusted genotyping calling of 168,134 variants (96.49%). We further removed 3916 probesets which were discordant in MAFs between the SNP array and sequencing data. The PPV for detecting pathogenic variants with 0.01%<MAF≤1% exceeded 99.37%. PPVs for those with an MAF of ≤0.01% improved from 95% to 100% for FH, 42.11% to 85.19% for TH, and 18.24% to 72.22% for MODY after adopting our rare variant quality-control procedure and experimental verification.Conclusion: Adopting our quality-control procedure, SNP arrays can adequately detect variants with MAF values ranging 0.01%∼0.1%. For variants with MAF values of ≤0.01%, experimental validation is needed unless sequencing data from a homogeneous population of >10,000 are available. The results demonstrated our procedure could perform correct genotype calling of rare variants. It provides a solution of pathogenic variant detection through SNP array. The approach brings tremendous promise for implementing precision medicine in medical practice.

Download Full-text

Performance comparison of SNP detection tools with illumina exome sequencing data—an assessment using both family pedigree information and sample-matched SNP array data

Nucleic Acids Research ◽

10.1093/nar/gku392 ◽

2014 ◽

Vol 42 (12) ◽

pp. e101-e101 ◽

Cited By ~ 31

Author(s):

Ming Yi ◽

Yongmei Zhao ◽

Li Jia ◽

Mei He ◽

Electron Kebebew ◽

...

Keyword(s):

Exome Sequencing ◽

Snp Array ◽

Performance Comparison ◽

Pedigree Information ◽

Sequencing Data ◽

Snp Detection ◽

Array Data ◽

Exome Sequencing Data ◽

Family Pedigree

Download Full-text

Inferring tumor evolution from longitudinal samples

10.1101/526814 ◽

2019 ◽

Cited By ~ 3

Author(s):

Matthew A. Myers ◽

Gryte Satas ◽

Benjamin J. Raphael

Keyword(s):

Dna Sequencing ◽

Phylogenetic Trees ◽

Simulated Data ◽

Lymphocytic Leukemia ◽

Response To Treatment ◽

Tumor Evolution ◽

Sequencing Data ◽

Considerable Uncertainty ◽

Longitudinal Sampling ◽

Clonal Composition

Background: Determining the clonal composition and somatic evolution of a tumor greatly aids in accurate prognosis and effective treatment for cancer. In order to understand how a tumor evolves over time and/or in response to treatment, multiple recent studies have performed longitudinal DNA sequencing of tumor samples from the same patient at several different time points. However, none of the existing algorithms that infer clonal composition and phylogeny using several bulk tumor samples from the same patient integrate the information that these samples were obtained from longitudinal observations. Results: We introduce a model for a longitudinally-observed phylogeny and derive constraints that longitudinal samples impose on the reconstruction of a phylogeny from bulk samples. These constraints form the basis for a new algorithm, Cancer Analysis of Longitudinal Data through Evolutionary Reconstruction (CALDER), which infers phylogenetic trees from longitudinal bulk DNA sequencing data. We show on simulated data that constraints from longitudinal sampling can substantially reduce ambiguity when deriving a phylogeny from multiple bulk tumor samples, each a mixture of tumor clones. On real data, where there is often considerable uncertainty in the clonal composition of a sample, longitudinal constraints yield more parsimonious phylogenies with fewer tumor clones per sample. We demonstrate that CALDER reconstructs more plausible phylogenies than existing methods on two longitudinal DNA sequencing datasets from chronic lymphocytic leukemia patients. These findings show the advantages of directly incorporating temporal information from longitudinal sampling into tumor evolution studies. Availability: CALDER is available at https://github.com/raphael-group.

Download Full-text