scholarly journals Integrative DNA copy number detection and genotyping from sequencing and array-based platforms

2017 ◽  
Author(s):  
Zilu Zhou ◽  
Weixin Wang ◽  
Li-San Wang ◽  
Nancy Ruonan Zhang

AbstractMotivationCopy number variations (CNVs) are gains and losses of DNA segments and have been associated with disease. Many large-scale genetic association studies are performing CNV analysis using whole exome sequencing (WES) and whole genome sequencing (WGS). In many of these studies, previous SNP-array data are available. An integrated cross-platform analysis is expected to improve resolution and accuracy, yet there is no tool for effectively combining data from sequencing and array platforms. The detection of CNVs using sequencing data alone can also be further improved by the utilization of allele-specific reads.ResultsWe propose a statistical framework, integrated Copy Number Variation detection algorithm (iCNV), which can be applied to multiple study designs: WES only, WGS only, SNP array only, or any combination of SNP and sequencing data. iCNV applies platform specific normalization, utilizes allele specific reads from sequencing and integrates matched NGS and SNP-array data by a Hidden Markov Model (HMM). We compare integrated two-platform CNV detection using iCNV to naive intersection or union of platforms and show that iCNV increases sensitivity and robustness. We also assess the accuracy of iCNV on WGS data only, and show that the utilization of allele-specific reads improve CNV detection accuracy compared to existing methods.Availabilityhttps://github.com/zhouzilu/[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.

2019 ◽  
Vol 35 (17) ◽  
pp. 2924-2931
Author(s):  
Mark R Zucker ◽  
Lynne V Abruzzo ◽  
Carmen D Herling ◽  
Lynn L Barron ◽  
Michael J Keating ◽  
...  

Abstract Motivation Clonal heterogeneity is common in many types of cancer, including chronic lymphocytic leukemia (CLL). Previous research suggests that the presence of multiple distinct cancer clones is associated with clinical outcome. Detection of clonal heterogeneity from high throughput data, such as sequencing or single nucleotide polymorphism (SNP) array data, is important for gaining a better understanding of cancer and may improve prediction of clinical outcome or response to treatment. Here, we present a new method, CloneSeeker, for inferring clinical heterogeneity from sequencing data, SNP array data, or both. Results We generated simulated SNP array and sequencing data and applied CloneSeeker along with two other methods. We demonstrate that CloneSeeker is more accurate than existing algorithms at determining the number of clones, distribution of cancer cells among clones, and mutation and/or copy numbers belonging to each clone. Next, we applied CloneSeeker to SNP array data from samples of 258 previously untreated CLL patients to gain a better understanding of the characteristics of CLL tumors and to elucidate the relationship between clonal heterogeneity and clinical outcome. We found that a significant majority of CLL patients appear to have multiple clones distinguished by copy number alterations alone. We also found that the presence of multiple clones corresponded with significantly worse survival among CLL patients. These findings may prove useful for improving the accuracy of prognosis and design of treatment strategies. Availability and implementation Code available on R-Forge: https://r-forge.r-project.org/projects/CloneSeeker/ Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Xizhi Luo ◽  
Fei Qin ◽  
Guoshuai Cai ◽  
Feifei Xiao

Abstract Motivation Copy number variation plays important roles in human complex diseases. The detection of copy number variants (CNVs) is identifying mean shift in genetic intensities to locate chromosomal breakpoints, the step of which is referred to as chromosomal segmentation. Many segmentation algorithms have been developed with a strong assumption of independent observations in the genetic loci, and they assume each locus has an equal chance to be a breakpoint (i.e. boundary of CNVs). However, this assumption is violated in the genetics perspective due to the existence of correlation among genomic positions, such as linkage disequilibrium (LD). Our study showed that the LD structure is related to the location distribution of CNVs, which indeed presents a non-random pattern on the genome. To generate more accurate CNVs, we proposed a novel algorithm, LDcnv, that models the CNV data with its biological characteristics relating to genetic dependence structure (i.e. LD). Results We theoretically demonstrated the correlation structure of CNV data in SNP array, which further supports the necessity of integrating biological structure in statistical methods for CNV detection. Therefore, we developed the LDcnv that integrated the genomic correlation structure with a local search strategy into statistical modeling of the CNV intensities. To evaluate the performance of LDcnv, we conducted extensive simulations and analyzed large-scale HapMap datasets. We showed that LDcnv presented high accuracy, stability and robustness in CNV detection and higher precision in detecting short CNVs compared to existing methods. This new segmentation algorithm has a wide scope of potential application with data from various high-throughput technology platforms. Availability and implementation https://github.com/FeifeiXiaoUSC/LDcnv. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Barbara Tabak ◽  
Gordon Saksena ◽  
Coyin Oh ◽  
Galen F. Gao ◽  
Barbara Hill Meyers ◽  
...  

AbstractMotivationSomatic copy-number alterations (SCNAs) play an important role in cancer development. Systematic noise in sequencing and array data present a significant challenge to the inference of SCNAs for cancer genome analyses. As part of The Cancer Genome Atlas (TCGA), the Broad Institute Genome Characterization Center developed the Tangent copy-number inference pipeline to generate copy-number profiles using single-nucleotide polymorphism (SNP) array and whole-exome sequencing (WES) data from over 10,000 pairs of tumors and matched normal samples. Here, we describe the Tangent pipeline, which begins with DNA sequencing data in the form of .bam files or raw SNP array probe-level intensity data, and ends with segmented copy-number calls to facilitate the identification of novel genes potentially targeted by SCNAs. We also describe a modification of Tangent, Pseudo-Tangent, which enables denoising through comparisons between tumor profiles when few normal samples are available.ResultsTangent Normalization offers substantial signal-to-noise ratio (SNR) improvements compared to conventional normalization methods in both SNP array and WES analyses. The improvement in SNRs is achieved primarily through noise reduction with minimal effect on signal. Pseudo-Tangent also reduces noise when few normal samples are available. Tangent and Pseudo-Tangent are broadly applicable and enable more accurate inference of SCNAs from DNA sequencing and array data.Availability and ImplementationTangent is available at https://github.com/coyin/tangent and as a Docker image (https://hub.docker.com/r/coyin/tangent). Tangent is also the normalization method for the Copy Number pipeline in Genome Analysis Toolkit 4 (GATK4)[email protected], [email protected], [email protected]


2018 ◽  
Author(s):  
Brendan O’Fallon ◽  
Jacob Durtschi ◽  
Tracey Lewis ◽  
Devin Close

AbstractCopy number variants (CNVs) play a significant role in human heredity and disease, however sensitive and specific characterization of CNVs from NGS data has remained challenging. Detection is especially problematic for hybridization-capture data in which read counts are the sole source of copy number information. We describe two algorithmic adaptations that improve CNV detection accuracy in a Hidden Markov Model (HMM) context. First, we present a method for com puting target- and copy number state-specific emission distributions. Second, we demonstrate that the Pointwise Maximum a posteriori (PMAP) HMM decoding procedure yields improved sensitivity for small CNV calls compared to the more common Viterbi HMM decoder. We develop a prototype implementation, called Cobalt, and compare it to other CNV detection tools using sets of simulated and previously detected CNVs with sizes spanning a single exon up to a full chromosome. In both the simulation and previously detected CNV studies Cobalt shows similar sensitivity but significantly improved positive predictive value (PPV) compared to other callers. Overall sensitivity is 80%-90% for deletion CNVs spanning 1-4 targets and 90%-100% for larger deletion events, while sensitivity is somewhat lower for small duplication CNVs. Cobalt demonstrates significantly improved positive predictive value (PPV) compared to other callers with similar sensitivity, typically making 5X fewer total calls overall.


2018 ◽  
Author(s):  
Zhongyang Zhang ◽  
Haoxiang Cheng ◽  
Xiumei Hong ◽  
Antonio F. Di Narzo ◽  
Oscar Franzen ◽  
...  

ABSTRACTThe associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV a) identifies and eliminates batch effects at raw data level; b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; d) refines CNVR boundaries by local correlation structure in copy number intensities; e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases.


2019 ◽  
Vol 20 (S25) ◽  
Author(s):  
Fei Luo

Abstract Background The Copy Number Alterations (CNAs) are discovered to be tightly associated with cancers, so accurately detecting them is one of the most important tasks in the cancer genomics. A series of CNAs detection methods have been proposed and new ones are still being developed. Due to the complexity of CNAs in cancers, no CNAs detection method has been accepted as the gold standard caller. Several evaluation works have made attempts to reveal typical CNAs detection methods’ performance. Limited by the scale of evaluation data, these different comparison works don’t reach a consensus and the researchers are still confused on how to choose one proper CNAs caller for their analysis. Therefore, it needs a more comprehensive evaluation of typical CNAs detection methods’ performance. Results In this work, we use a large-scale real dataset from CAGEKID consortium to evaluate total 12 typical CNAs detection methods. These methods are most widely used in cancer researches and always used as benchmark for the newly proposed CNAs detection methods. This large-scale dataset comprises of SNP array data on 94 samples and the whole genome sequencing data on 10 samples. Evaluations are comprehensively implemented in current scenarios of CNAs detection, which include that detect CNAs on SNP array data, on sequencing data with tumor and normal matched samples and on sequencing data with single tumor sample. Three SNP based methods are firstly ranked. Subsequently, the best SNP based method’s results are used as benchmark to compare six matched samples based methods and three single tumor sample based methods in terms of the preprocessing, recall rate, Jaccard index and segmentation characteristics. Conclusions Our survey thoroughly reveals 12 typical methods’ superiority and inferiority. We explain why methods show specific characteristics from a methodological standpoint. Finally, we present the guiding principle for choosing one proper CNAs detection method under specific conditions. Some unsolved problems and expectations are also addressed for upcoming CNAs detection methods.


2015 ◽  
Vol 32 (6) ◽  
pp. 926-928 ◽  
Author(s):  
Xuefeng Wang ◽  
Mengjie Chen ◽  
Xiaoqing Yu ◽  
Natapol Pornputtapong ◽  
Hao Chen ◽  
...  

Abstract Summary: In this article, we introduce a robust and efficient strategy for deriving global and allele-specific copy number alternations (CNA) from cancer whole exome sequencing data based on Log R ratios and B-allele frequencies. Applying the approach to the analysis of over 200 skin cancer samples, we demonstrate its utility for discovering distinct CNA events and for deriving ancillary information such as tumor purity. Availability and implementation: https://github.com/xfwang/CLOSE Contact: [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Xinping Fan ◽  
Guanghao Luo ◽  
Yu S. Huang

Abstract Background Copy number alterations (CNAs), due to their large impact on the genome, have been an important contributing factor to oncogenesis and metastasis. Detecting genomic alterations from the shallow-sequencing data of a low-purity tumor sample remains a challenging task. Results We introduce Accucopy, a method to infer total copy numbers (TCNs) and allele-specific copy numbers (ASCNs) from challenging low-purity and low-coverage tumor samples. Accucopy adopts many robust statistical techniques such as kernel smoothing of coverage differentiation information to discern signals from noise and combines ideas from time-series analysis and the signal-processing field to derive a range of estimates for the period in a histogram of coverage differentiation information. Statistical learning models such as the tiered Gaussian mixture model, the expectation–maximization algorithm, and sparse Bayesian learning were customized and built into the model. Accucopy is implemented in C++ /Rust, packaged in a docker image, and supports non-human samples, more at http://www.yfish.org/software/. Conclusions We describe Accucopy, a method that can predict both TCNs and ASCNs from low-coverage low-purity tumor sequencing data. Through comparative analyses in both simulated and real-sequencing samples, we demonstrate that Accucopy is more accurate than Sclust, ABSOLUTE, and Sequenza.


2020 ◽  
Vol 36 (12) ◽  
pp. 3890-3891
Author(s):  
Linjie Wu ◽  
Han Wang ◽  
Yuchao Xia ◽  
Ruibin Xi

Abstract Motivation Whole-genome sequencing (WGS) is widely used for copy number variation (CNV) detection. However, for most bacteria, their circular genome structure and high replication rate make reads more enriched near the replication origin. CNV detection based on read depth could be seriously influenced by such replication bias. Results We show that the replication bias is widespread using ∼200 bacterial WGS data. We develop CNV-BAC (CNV-Bacteria) that can properly normalize the replication bias and other known biases in bacterial WGS data and can accurately detect CNVs. Simulation and real data analysis show that CNV-BAC achieves the best performance in CNV detection compared with available algorithms. Availability and implementation CNV-BAC is available at https://github.com/XiDsLab/CNV-BAC. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document