The Tangent copy-number inference pipeline for cancer genome analyses

AbstractMotivationSomatic copy-number alterations (SCNAs) play an important role in cancer development. Systematic noise in sequencing and array data present a significant challenge to the inference of SCNAs for cancer genome analyses. As part of The Cancer Genome Atlas (TCGA), the Broad Institute Genome Characterization Center developed the Tangent copy-number inference pipeline to generate copy-number profiles using single-nucleotide polymorphism (SNP) array and whole-exome sequencing (WES) data from over 10,000 pairs of tumors and matched normal samples. Here, we describe the Tangent pipeline, which begins with DNA sequencing data in the form of .bam files or raw SNP array probe-level intensity data, and ends with segmented copy-number calls to facilitate the identification of novel genes potentially targeted by SCNAs. We also describe a modification of Tangent, Pseudo-Tangent, which enables denoising through comparisons between tumor profiles when few normal samples are available.ResultsTangent Normalization offers substantial signal-to-noise ratio (SNR) improvements compared to conventional normalization methods in both SNP array and WES analyses. The improvement in SNRs is achieved primarily through noise reduction with minimal effect on signal. Pseudo-Tangent also reduces noise when few normal samples are available. Tangent and Pseudo-Tangent are broadly applicable and enable more accurate inference of SCNAs from DNA sequencing and array data.Availability and ImplementationTangent is available at https://github.com/coyin/tangent and as a Docker image (https://hub.docker.com/r/coyin/tangent). Tangent is also the normalization method for the Copy Number pipeline in Genome Analysis Toolkit 4 (GATK4)[email protected], [email protected], [email protected]

Download Full-text

Integrative DNA copy number detection and genotyping from sequencing and array-based platforms

10.1101/172700 ◽

2017 ◽

Cited By ~ 2

Author(s):

Zilu Zhou ◽

Weixin Wang ◽

Li-San Wang ◽

Nancy Ruonan Zhang

Keyword(s):

Copy Number ◽

Association Studies ◽

Snp Array ◽

Supplementary Information ◽

Detection Accuracy ◽

Sequencing Data ◽

Array Data ◽

Combining Data ◽

Allele Specific ◽

Cnv Detection

AbstractMotivationCopy number variations (CNVs) are gains and losses of DNA segments and have been associated with disease. Many large-scale genetic association studies are performing CNV analysis using whole exome sequencing (WES) and whole genome sequencing (WGS). In many of these studies, previous SNP-array data are available. An integrated cross-platform analysis is expected to improve resolution and accuracy, yet there is no tool for effectively combining data from sequencing and array platforms. The detection of CNVs using sequencing data alone can also be further improved by the utilization of allele-specific reads.ResultsWe propose a statistical framework, integrated Copy Number Variation detection algorithm (iCNV), which can be applied to multiple study designs: WES only, WGS only, SNP array only, or any combination of SNP and sequencing data. iCNV applies platform specific normalization, utilizes allele specific reads from sequencing and integrates matched NGS and SNP-array data by a Hidden Markov Model (HMM). We compare integrated two-platform CNV detection using iCNV to naive intersection or union of platforms and show that iCNV increases sensitivity and robustness. We also assess the accuracy of iCNV on WGS data only, and show that the utilization of allele-specific reads improve CNV detection accuracy compared to existing methods.Availabilityhttps://github.com/zhouzilu/[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

A systematic evaluation of copy number alterations detection methods on real SNP array and deep sequencing data

BMC Bioinformatics ◽

10.1186/s12859-019-3266-7 ◽

2019 ◽

Vol 20 (S25) ◽

Cited By ~ 1

Author(s):

Fei Luo

Keyword(s):

Copy Number ◽

Large Scale ◽

Detection Method ◽

Snp Array ◽

Detection Methods ◽

Copy Number Alterations ◽

Sequencing Data ◽

Array Data ◽

Matched Samples ◽

Single Tumor

Abstract Background The Copy Number Alterations (CNAs) are discovered to be tightly associated with cancers, so accurately detecting them is one of the most important tasks in the cancer genomics. A series of CNAs detection methods have been proposed and new ones are still being developed. Due to the complexity of CNAs in cancers, no CNAs detection method has been accepted as the gold standard caller. Several evaluation works have made attempts to reveal typical CNAs detection methods’ performance. Limited by the scale of evaluation data, these different comparison works don’t reach a consensus and the researchers are still confused on how to choose one proper CNAs caller for their analysis. Therefore, it needs a more comprehensive evaluation of typical CNAs detection methods’ performance. Results In this work, we use a large-scale real dataset from CAGEKID consortium to evaluate total 12 typical CNAs detection methods. These methods are most widely used in cancer researches and always used as benchmark for the newly proposed CNAs detection methods. This large-scale dataset comprises of SNP array data on 94 samples and the whole genome sequencing data on 10 samples. Evaluations are comprehensively implemented in current scenarios of CNAs detection, which include that detect CNAs on SNP array data, on sequencing data with tumor and normal matched samples and on sequencing data with single tumor sample. Three SNP based methods are firstly ranked. Subsequently, the best SNP based method’s results are used as benchmark to compare six matched samples based methods and three single tumor sample based methods in terms of the preprocessing, recall rate, Jaccard index and segmentation characteristics. Conclusions Our survey thoroughly reveals 12 typical methods’ superiority and inferiority. We explain why methods show specific characteristics from a methodological standpoint. Finally, we present the guiding principle for choosing one proper CNAs detection method under specific conditions. Some unsolved problems and expectations are also addressed for upcoming CNAs detection methods.

Download Full-text

CaSNP: a database for interrogating copy number alterations of cancer genome from SNP array data

Nucleic Acids Research ◽

10.1093/nar/gkq997 ◽

2010 ◽

Vol 39 (suppl_1) ◽

pp. D968-D974 ◽

Cited By ~ 15

Author(s):

Qingyi Cao ◽

Meng Zhou ◽

Xujun Wang ◽

Cliff A. Meyer ◽

Yong Zhang ◽

...

Keyword(s):

Copy Number ◽

Snp Array ◽

Cancer Genome ◽

Copy Number Alterations ◽

Array Data

Download Full-text

Identification of an extracellular vesicle-related gene signature in the prediction of pancreatic cancer clinical prognosis

Bioscience Reports ◽

10.1042/bsr20201087 ◽

2020 ◽

Vol 40 (12) ◽

Author(s):

Dafeng Xu ◽

Yu Wang ◽

Kailun Zhou ◽

Jincai Wu ◽

Zhensheng Zhang ◽

...

Keyword(s):

Gene Expression ◽

Pancreatic Cancer ◽

Immune Cells ◽

Prognostic Value ◽

Tumor Tissue ◽

Clinical Stage ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Sequencing Data ◽

Clinical Prognosis

Abstract Although extracellular vesicles (EVs) in body fluid have been considered to be ideal biomarkers for cancer diagnosis and prognosis, it is still difficult to distinguish EVs derived from tumor tissue and normal tissue. Therefore, the prognostic value of tumor-specific EVs was evaluated through related molecules in pancreatic tumor tissue. NA sequencing data of pancreatic adenocarcinoma (PAAD) were acquired from The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC). EV-related genes in pancreatic cancer were obtained from exoRBase. Protein–protein interaction (PPI) network analysis was used to identify modules related to clinical stage. CIBERSORT was used to assess the abundance of immune and non-immune cells in the tumor microenvironment. A total of 12 PPI modules were identified, and the 3-PPI-MOD was identified based on the randomForest package. The genes of this model are involved in DNA damage and repair and cell membrane-related pathways. The independent external verification cohorts showed that the 3-PPI-MOD can significantly classify patient prognosis. Moreover, compared with the model constructed by pure gene expression, the 3-PPI-MOD showed better prognostic value. The expression of genes in the 3-PPI-MOD had a significant positive correlation with immune cells. Genes related to the hypoxia pathway were significantly enriched in the high-risk tumors predicted by the 3-PPI-MOD. External databases were used to verify the gene expression in the 3-PPI-MOD. The 3-PPI-MOD had satisfactory predictive performance and could be used as a prognostic predictive biomarker for pancreatic cancer.

Download Full-text

Assessing the performance of methods for copy number aberration detection from single-cell DNA sequencing data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008012 ◽

2020 ◽

Vol 16 (7) ◽

pp. e1008012 ◽

Cited By ~ 2

Author(s):

Xian F. Mallory ◽

Mohammadamin Edrisi ◽

Nicholas Navin ◽

Luay Nakhleh

Keyword(s):

Dna Sequencing ◽

Single Cell ◽

Copy Number ◽

Copy Number Aberration ◽

Sequencing Data ◽

Aberration Detection

Download Full-text

Inferring clonal heterogeneity in cancer using SNP arrays and whole genome sequencing

Bioinformatics ◽

10.1093/bioinformatics/btz057 ◽

2019 ◽

Vol 35 (17) ◽

pp. 2924-2931

Author(s):

Mark R Zucker ◽

Lynne V Abruzzo ◽

Carmen D Herling ◽

Lynn L Barron ◽

Michael J Keating ◽

...

Keyword(s):

Clinical Outcome ◽

Snp Array ◽

Treatment Strategies ◽

Lymphocytic Leukemia ◽

Response To Treatment ◽

Supplementary Information ◽

Sequencing Data ◽

Clonal Heterogeneity ◽

Array Data ◽

Copy Numbers

Abstract Motivation Clonal heterogeneity is common in many types of cancer, including chronic lymphocytic leukemia (CLL). Previous research suggests that the presence of multiple distinct cancer clones is associated with clinical outcome. Detection of clonal heterogeneity from high throughput data, such as sequencing or single nucleotide polymorphism (SNP) array data, is important for gaining a better understanding of cancer and may improve prediction of clinical outcome or response to treatment. Here, we present a new method, CloneSeeker, for inferring clinical heterogeneity from sequencing data, SNP array data, or both. Results We generated simulated SNP array and sequencing data and applied CloneSeeker along with two other methods. We demonstrate that CloneSeeker is more accurate than existing algorithms at determining the number of clones, distribution of cancer cells among clones, and mutation and/or copy numbers belonging to each clone. Next, we applied CloneSeeker to SNP array data from samples of 258 previously untreated CLL patients to gain a better understanding of the characteristics of CLL tumors and to elucidate the relationship between clonal heterogeneity and clinical outcome. We found that a significant majority of CLL patients appear to have multiple clones distinguished by copy number alterations alone. We also found that the presence of multiple clones corresponded with significantly worse survival among CLL patients. These findings may prove useful for improving the accuracy of prognosis and design of treatment strategies. Availability and implementation Code available on R-Forge: https://r-forge.r-project.org/projects/CloneSeeker/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PERHAPS: Paired-End short Reads-based HAPlotyping from next-generation Sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbaa320 ◽

2020 ◽

Author(s):

Jie Huang ◽

Stefano Pallotti ◽

Qianling Zhou ◽

Marcus Kleber ◽

Xiaomeng Xin ◽

...

Keyword(s):

Next Generation Sequencing ◽

Snp Array ◽

Simple Approach ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Short Read ◽

Array Data ◽

Short Reads ◽

Generation Sequencing

Abstract The identification of rare haplotypes may greatly expand our knowledge in the genetic architecture of both complex and monogenic traits. To this aim, we developed PERHAPS (Paired-End short Reads-based HAPlotyping from next-generation Sequencing data), a new and simple approach to directly call haplotypes from short-read, paired-end Next Generation Sequencing (NGS) data. To benchmark this method, we considered the APOE classic polymorphism (*1/*2/*3/*4), since it represents one of the best examples of functional polymorphism arising from the haplotype combination of two Single Nucleotide Polymorphisms (SNPs). We leveraged the big Whole Exome Sequencing (WES) and SNP-array data obtained from the multi-ethnic UK BioBank (UKBB, N=48,855). By applying PERHAPS, based on piecing together the paired-end reads according to their FASTQ-labels, we extracted the haplotype data, along with their frequencies and the individual diplotype. Concordance rates between WES directly called diplotypes and the ones generated through statistical pre-phasing and imputation of SNP-array data are extremely high (>99%), either when stratifying the sample by SNP-array genotyping batch or self-reported ethnic group. Hardy-Weinberg Equilibrium tests and the comparison of obtained haplotype frequencies with the ones available from the 1000 Genome Project further supported the reliability of PERHAPS. Notably, we were able to determine the existence of the rare APOE*1 haplotype in two unrelated African subjects from UKBB, supporting its presence at appreciable frequency (approximatively 0.5%) in the African Yoruba population. Despite acknowledging some technical shortcomings, PERHAPS represents a novel and simple approach that will partly overcome the limitations in direct haplotype calling from short read-based sequencing.

Download Full-text

Toolbox for Mobile-Element Insertion Detection on Cancer Genomes

Cancer Informatics ◽

10.4137/cin.s24657 ◽

2015 ◽

Vol 14s1 ◽

pp. CIN.S24657

Author(s):

Wan-Ping Lee ◽

Jiantao Wu ◽

Gabor T. Marth

Keyword(s):

Human Genome ◽

Mobile Element ◽

Mobile Elements ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Element Insertion ◽

Human Genome Evolution ◽

Mobile Element Insertion

Mobile elements constitute greater than 45% of the human genome as a result of repeated insertion events during human genome evolution. Although most of mobile elements are fixed within the human population, some elements (including ALU, long interspersed elements (LINE) 1 (L1), and SVA) are still actively duplicating and may result in life-threatening human diseases such as cancer, motivating the need for accurate mobile-element insertion (MEI) detection tools. We developed a software package, TANGRAM, for MEI detection in next-generation sequencing data, currently serving as the primary MEI detection tool in the 1000 Genomes Project. TANGRAM takes advantage of valuable mapping information provided by our own MOSAIK mapper, and until recently required MOSAIK mappings as its input. In this study, we report a new feature that enables TANGRAM to be used on alignments generated by any mainstream short-read mapper, making it accessible for many genomic users. To demonstrate its utility for cancer genome analysis, we have applied TANGRAM to the TCGA (The Cancer Genome Atlas) mutation calling benchmark 4 dataset. TANGRAM is fast, accurate, easy to use, and open source on https://github.com/jiantao/Tangram .

Download Full-text