scholarly journals Pan-Cancer Analysis Reveals Technical Artifacts in TCGA Germline Variant Calls

2016 ◽  
Author(s):  
Alexandra R. Buckley ◽  
Kristopher A. Standish ◽  
Kunal Bhutani ◽  
Trey Ideker ◽  
Hannah Carter ◽  
...  

AbstractThe degree to which germline variation drives cancer development and shapes tumor phenotypes remains largely unexplored, possibly due to a lack of large scale publicly available germline data for a cancer cohort. Here we called germline variants on 9,618 cases from The Cancer Genome Atlas (TCGA) database representing 31 cancer types. We identified batch effects affecting loss of function (LOF) variant calls that can be traced back to differences in the way the sequence data were generated both within and across cancer types. Overall, LOF indel calls were more sensitive to technical artifacts than LOF Single Nucleotide Variant (SNV) calls. In particular, whole genome amplification of DNA prior to sequencing led to an artificially increased burden of LOF indel calls, which confounded association analyses relating germline variants to tumor type despite stringent indel filtering strategies. Due to the inherent noise we chose to remove all 614 amplified DNA samples, including all acute myeloid leukemia and virtually all ovarian cancer samples, from the final dataset. This study demonstrates how insufficient quality control can lead to false positive germlinetumor type associations and draws attention to the need to be sensitive to problems associated with a lack of uniformity in data generation in TCGA data.Author SummaryCancer research to date has largely focused on genetic aberrations specific to tumor tissue. In contrast, the degree to which germline, or inherited, variation contributes to tumorigenesis remains unclear, possibly due to a lack of accessible germline variant data. In this study we identify germline variants in 9,618 samples using raw germline exome data from The Cancer Genome Atlas (TCGA). There are substantial differences in the way exome sequence data was generated both across and within cancer types in TCGA. We observe that differences in sequence data generation introduced batch effects, or variation that is due to technical factors not true biological variation, in our variant data. Most notably, we observe that amplification of DNA prior to sequencing resulted in an excess of predicted damaging indel variants. We show how these batch effects can confound germline association analyses if not properly addressed. Our study highlights the difficulties of working with large public genomic datasets like TCGA where samples are collected over time and across data centers, and particularly cautions the use of amplified DNA samples for genetic association analyses.

mSystems ◽  
2018 ◽  
Vol 3 (5) ◽  
Author(s):  
Sara R. Selitsky ◽  
David Marron ◽  
Lisle E. Mose ◽  
Joel S. Parker ◽  
Dirk P. Dittmer

ABSTRACTEpstein-Barr virus (EBV) is convincingly associated with gastric cancer, nasopharyngeal carcinoma, and certain lymphomas, but its role in other cancer types remains controversial. To test the hypothesis that there are additional cancer types with high prevalence of EBV, we determined EBV viral expression in all the Cancer Genome Atlas Project (TCGA) mRNA sequencing (mRNA-seq) samples (n= 10,396) from 32 different tumor types. We found that EBV was present in gastric adenocarcinoma and lymphoma, as expected, and was also present in >5% of samples in 10 additional tumor types. For most samples, EBV transcript levels were low, which suggests that EBV was likely present due to infected infiltrating B cells. In order to determine if there was a difference in the B-cell populations, we assembled B-cell receptors for each sample and found B-cell receptor abundance (P≤ 1.4 × 10−20) and diversity (P≤ 8.3 × 10−27) were significantly higher in EBV-positive samples. Moreover, diversity was independent of B-cell abundance, suggesting that the presence of EBV was associated with an increased and altered B-cell population.IMPORTANCEAround 20% of human cancers are associated with viruses. Epstein-Barr virus (EBV) contributes to gastric cancer, nasopharyngeal carcinoma, and certain lymphomas, but its role in other cancer types remains controversial. We assessed the prevalence of EBV in RNA-seq from 32 tumor types in the Cancer Genome Atlas Project (TCGA) and found EBV to be present in >5% of samples in 12 tumor types. EBV infects epithelial cells and B cells and in B cells causes proliferation. We hypothesized that the low expression of EBV in most of the tumor types was due to infiltration of B cells into the tumor. The increase in B-cell abundance and diversity in subjects where EBV was detected in the tumors strengthens this hypothesis. Overall, we found that EBV was associated with an increased and altered immune response. This result is not evidence of causality, but a potential novel biomarker for tumor immune status.


2014 ◽  
Vol 13s2 ◽  
pp. CIN.S13776
Author(s):  
Yanxun Xu ◽  
Yitan Zhu ◽  
Peter Müller ◽  
Riten Mitra ◽  
Yuan Ji

The Cancer Genome Atlas (TCGA) generates comprehensive genomic data for thousands of patients over more than 20 cancer types. TCGA data are typically whole-genome measurements of multiple genomic features, such as DNA copy numbers, DNA methylation, and gene expression, providing unique opportunities for investigating cancer mechanism from multiple molecular and regulatory layers. We propose a Bayesian graphical model to systemically integrate multi-platform TCGA data for inference of the interactions between different genomic features either within a gene or between multiple genes. The presence or absence of edges in the graph indicates the presence or absence of conditional dependence between genomic features. The inference is restricted to genes within a known biological network, but can be extended to any sets of genes. Applying the model to the same genes using patient samples in two different cancer types, we identify network components that are common as well as different between cancer types. The examples and codes are available at https://www.ma.utexas.edu/users/yxu/software.html .


2018 ◽  
Author(s):  
Roni Rasnic ◽  
Nadav Brandes ◽  
Or Zuk ◽  
Michal Linial

ABSTRACTBackgroundIn recent years, research on cancer predisposition germline variants has emerged as a prominent field. The identity of somatic mutations is based on a reliable mapping of the patient germline variants. In addition, the statistics of germline variants frequencies in healthy individuals and cancer patients is the basis for seeking candidates for cancer predisposition genes. The Cancer Genome Atlas (TCGA) is one of the main sources of such data, providing a diverse collection of molecular data including deep sequencing for more than 30 types of cancer from >10,000 patients.MethodsOur hypothesis in this study is that whole exome sequences from healthy blood samples of cancer patients are not expected to show systematic differences among cancer types. To test this hypothesis, we analyzed common and rare germline variants across six cancer types, covering 2,241 samples from TCGA. In our analysis we accounted for inherent variables in the data including the different variant calling protocols, sequencing platforms, and ethnicity.ResultsWe report on substantial batch effects in germline variants associated with cancer types. We attribute the effect to the specific sequencing centers that produced the data. Specifically, we measured 30% variability in the number of reported germline variants per sample across sequencing centers. The batch effect is further expressed in nucleotide composition and variant frequencies. Importantly, the batch effect causes substantial differences in germline variant distribution patterns across numerous genes, including prominent cancer predisposition genes such as BRCA1, RET, MAX, and KRAS. For most of known cancer predisposition genes, we found a distinct batch-dependent difference in germline variants.ConclusionTCGA germline data is exposed to strong batch effects with substantial variabilities among TCGA sequencing centers. We claim that those batch effects are consequential for numerous TCGA pan-cancer studies. In particular, these effects may compromise the reliability and the potency to detect new cancer predisposition genes. Furthermore, interpretation of pan-cancer analyses should be revisited in view of the source of the genomic data after accounting for the reported batch effects.


2020 ◽  
Author(s):  
Juliet Luft ◽  
Robert S. Young ◽  
Alison M. Meynert ◽  
Martin S. Taylor

AbstractBackgroundThe loss of genetic diversity in segments over a genome (loss-of-heterozygosity, LOH) is a common occurrence in many types of cancer. By analysing patterns of preferential allelic retention during LOH in approximately 10,000 cancer samples from The Cancer Genome Atlas (TCGA), we sought to systematically identify genetic polymorphisms currently segregating in the human population that are preferentially selected for, or against during cancer development.ResultsExperimental batch effects and cross-sample contamination were found to be substantial confounders in this widely used and well studied dataset. To mitigate these we developed a generally applicable classifier (GenomeArtiFinder) to quantify contamination and other abnormalities. We provide these results as a resource to aid further analysis of TCGA whole exome sequencing data. In total, 1,678 pairs of samples (14.7%) were found to be contaminated or affected by systematic experimental error. After filtering, our analysis of LOH revealed an overall trend for biased retention of cancer-associated risk alleles previously identified by genome wide association studies. Analysis of predicted damaging germline variants identified highly significant oncogenic selection for recessive tumour suppressor alleles. These are enriched for biological pathways involved in genome maintenance and stability.ConclusionsOur results identified predicted damaging germline variants in genes responsible for the repair of DNA strand breaks and homologous repair as the most common targets of allele biased LOH. This suggests a ratchet-like process where heterozygous germline mutations in these genes reduce the efficacy of DNA double-strand break repair, increasing the likelihood of a second hit at the locus removing the wild-type allele and triggering an oncogenic mutator phenotype.


2017 ◽  
pp. 1-13 ◽  
Author(s):  
Anshuman Panda ◽  
Anil Betigeri ◽  
Kalyanasundaram Subramanian ◽  
Jeffrey S. Ross ◽  
Dean C. Pavlick ◽  
...  

Purpose An association between mutational burden and response to immune checkpoint therapy has been documented in several cancer types. The potential for such a mutational burden threshold to predict response to immune checkpoint therapy was evaluated in several clinical datasets, where mutational burden was measured either by whole-exome sequencing or by using commercially available sequencing panels. Methods Whole-exome sequencing and RNA sequencing data of 33 solid cancer types from The Cancer Genome Atlas were analyzed to determine whether a robust immune checkpoint–activating mutation (iCAM) burden threshold associated with evidence of immune checkpoint activation exists in these cancers that may serve as a biomarker of response to immune checkpoint blockade therapy. Results We found that a robust iCAM threshold, associated with signatures of immune checkpoint activation, exists in eight of 33 solid cancers: melanoma, lung adenocarcinoma, colon adenocarcinoma, endometrial cancer, stomach adenocarcinoma, cervical cancer, estrogen receptor–positive/human epidermal growth factor receptor 2–negative breast cancer, and bladder-urothelial cancer. Tumors with a mutational burden higher than the threshold (iCAM positive) also had clear histologic evidence of lymphocytic infiltration. In published datasets of melanoma, lung adenocarcinoma, and colon cancer, patients with iCAM-positive tumors had significantly better response to immune checkpoint therapy compared with those with iCAM-negative tumors. Receiver operating characteristic analysis using The Cancer Genome Atlas predictions as the gold standard showed that iCAM-positive tumors are accurately identifiable using clinical sequencing assays, such as FoundationOne (Foundation Medicine, Cambridge, MA) or StrandAdvantage (Strand Life Sciences, Bangalore, India). Using the FoundationOne-derived threshold, an analysis of 113 melanoma tumors showed that patients with iCAM-positive disease have significantly better response to immune checkpoint therapy. iCAM-positive and iCAM-negative tumors have distinct mutation patterns and different immune microenvironments. Conclusion In eight solid cancers, a mutational burden threshold exists that may predict response to immune checkpoint blockade. This threshold is identifiable using available clinical sequencing assays.


2018 ◽  
Author(s):  
Jake Lever ◽  
Eric Y. Zhao ◽  
Jasleen Grewal ◽  
Martin R. Jones ◽  
Steven J. M. Jones

AbstractUnderstanding a mutation in cancer requires knowledge of the different roles that genes play in cancer as drivers, oncogenes and tumor suppressors. We present CancerMine, a high-quality text-mined knowledgebase that catalogues over 856 genes as drivers, 2,421 as oncogenes and 2,037 as tumor suppressors in 426 cancer types. We compile 3,485 genes that are not in the IntOGen resource of drivers and complement the Cancer Gene Census with 3,136 new genes identified as oncogenes and tumor suppressors. CancerMine provides a method for gene-centric clustering of cancer types illustrating genetic similarities between cancer types of different organs and was validated against data from the Cancer Genome Atlas (TCGA) project. Finally with 178 novel cancer gene mentions in publications each month, this resource will be updated monthly, pre-empting the need to manually curate the ever-increasing number of novel cancer associated genes. CancerMine is viewable through a web portal (http://bionlp.bcgsc.ca/cancermine/) and available for download (https://github.com/jakelever/cancermine).


2017 ◽  
Author(s):  
Xin Hu ◽  
Qianghu Wang ◽  
Floris Barthel ◽  
Ming Tang ◽  
Samirkumar Amin ◽  
...  

Fusion genes, particularly those involving kinases, have been demonstrated as drivers and are frequent therapeutic targets in cancer1. Here, we describe our results on detecting transcript fusions across 33 cancer types from The Cancer Genome Atlas (TCGA), totaling 9,966 cancer samples and 648 normal samples2. Preprocessing, including read alignment to both genome and transcriptome, and fusion detection were carried out using a uniform pipeline3. To validate the resultant fusions, we also called somatic structural variations for 561 cancers from whole genome sequencing data. A summary of the data used in this study is provided in Table S1. Our results can be accessed per our portal at http://www.tumorfusions.org.


2021 ◽  
Vol 5 (1) ◽  
Author(s):  
Sock Hoai Chan ◽  
Ying Ni ◽  
Shao-Tzu Li ◽  
Jing Xian Teo ◽  
Nur Diana Binte Ishak ◽  
...  

Abstract Background Fanconi anemia (FA) is a rare genetic disorder associated with hematological disorders and solid tumor predisposition. Owing to phenotypic heterogeneity, some patients remain undetected until adulthood, usually following cancer diagnoses. The uneven prevalence of FA cases with different underlying FA gene mutations worldwide suggests variable genetic distribution across populations. Here, we aim to assess the genetic spectrum of FA-associated genes across populations of varying ancestries and explore potential genotype–phenotype associations in cancer. Methods Carrier frequency and variant spectrum of potentially pathogenic germline variants in 17 FA genes (excluding BRCA1/FANCS, BRCA2/FANCD1, BRIP1/FANCJ, PALB2/FANCN, RAD51C/FANCO) were evaluated in 3523 Singaporeans and 7 populations encompassing Asian, European, African, and admixed ancestries from the Genome Aggregation Database. Germline and somatic variants of 17 FA genes in 7 cancer cohorts from The Cancer Genome Atlas were assessed to explore genotype–phenotype associations. Results Germline variants in FANCA were consistently more frequent in all populations. Similar trends in carrier frequency and variant spectrum were detected in Singaporeans and East Asians, both distinct from other ancestry groups, particularly in the lack of recurrent variants. Our exploration of The Cancer Genome Atlas dataset suggested higher germline and somatic mutation burden between FANCA and FANCC with head and neck and lung squamous cell carcinomas as well as FANCI and SLX4/FANCP with uterine cancer, but the analysis was insufficiently powered to detect any statistical significance. Conclusion Our findings highlight the diverse genetic spectrum of FA-associated genes across populations of varying ancestries, emphasizing the need to include all known FA-related genes for accurate molecular diagnosis of FA.


2021 ◽  
Vol 11 ◽  
Author(s):  
Yi-Hong Liu ◽  
Yu-Lian Chen ◽  
Ting-Yu Lai ◽  
Ying-Chieh Ko ◽  
Yu-Fu Chou ◽  
...  

BackgroundPartial epithelial-mesenchymal transition (p-EMT) is a distinct clinicopathological feature prevalent in oral cavity tumors of The Cancer Genome Atlas. Located at the invasion front, p-EMT cells require additional support from the tumor stroma for collective cell migration, including track clearing, extracellular matrix remodeling and immune evasion. The pathological roles of otherwise nonmalignant cancer-associated fibroblasts (CAFs) in cancer progression are emerging.MethodsGene set enrichment analysis was used to reveal differentially enriched genes and molecular pathways in OC3 and TW2.6 xenograft tissues, representing mesenchymal and p-EMT tumors, respectively. R packages of genomic data science were executed for statistical evaluations and data visualization. Immunohistochemistry and Alcian blue staining were conducted to validate the bioinformatic results. Univariate and multivariate Cox proportional hazards models were performed to identify covariates significantly associated with overall survival in clinical datasets. Kaplan–Meier curves of estimated overall survival were compared for statistical difference using the log-rank test.ResultsCompared to mesenchymal OC3 cells, tumor stroma derived from p-EMT TW2.6 cells was significantly enriched in microvessel density, tumor-excluded macrophages, inflammatory CAFs, and extracellular hyaluronan deposition. By translating these results to clinical transcriptomic datasets of oral cancer specimens, including the Puram single-cell RNA-seq cohort comprising ~6000 cells, we identified the expression of stromal TGFBI and HYAL1 as independent poor and protective biomarkers, respectively, for 40 Taiwanese oral cancer tissues that were all derived from betel quid users. In The Cancer Genome Atlas, TGFBI was a poor marker not only for head and neck cancer but also for additional six cancer types and HYAL1 was a good indicator for four tumor cohorts, suggesting common stromal effects existing in different cancer types.ConclusionsAs the tumor stroma coevolves with cancer progression, the cellular origins of molecular markers identified from conventional whole tissue mRNA-based analyses should be cautiously interpreted. By incorporating disease-matched xenograft tissue and single-cell RNA-seq results, we suggested that TGFBI and HYAL1, primarily expressed by stromal CAFs and endothelial cells, respectively, could serve as robust prognostic biomarkers for oral cancer control.


2021 ◽  
Vol 11 ◽  
Author(s):  
Luuk Harbers ◽  
Federico Agostini ◽  
Marcin Nicos ◽  
Dimitri Poddighe ◽  
Magda Bienko ◽  
...  

Somatic copy number alterations (SCNAs) are a pervasive trait of human cancers that contributes to tumorigenesis by affecting the dosage of multiple genes at the same time. In the past decade, The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) initiatives have generated and made publicly available SCNA genomic profiles from thousands of tumor samples across multiple cancer types. Here, we present a comprehensive analysis of 853,218 SCNAs across 10,729 tumor samples belonging to 32 cancer types using TCGA data. We then discuss current models for how SCNAs likely arise during carcinogenesis and how genomic SCNA profiles can inform clinical practice. Lastly, we highlight open questions in the field of cancer-associated SCNAs.


Sign in / Sign up

Export Citation Format

Share Document