Uniform Genomic Data Analysis in the NCI Genomic Data Commons

AbstractThe goal of the National Cancer Institute (NCI) Genomic Data Commons (GDC) is to provide the cancer research community with a data repository of uniformly processed genomic and associated clinical data that enables data sharing and collaborative analysis in the support of precision medicine. The initial GDC dataset include genomic, epigenomic, proteomic, clinical and other data from the NCI TCGA and TARGET programs. Data production for the GDC started in June, 2015 using an OpenStack-based private cloud. By June of 2016, the GDC had analyzed more than 50,000 raw sequencing data inputs, as well as multiple other data types. Using the latest human genome reference build GRCh38, the GDC generated a variety of data types from aligned reads to somatic mutations, gene expression, miRNA expression, DNA methylation status, and copy number variation. In this paper, we describe the pipelines and workflows used to process and harmonize the data in the GDC. The generated data, as well as the original input files from TCGA and TARGET, are available for download and exploratory analysis at the GDC Data Portal and Legacy Archive (https://gdc.cancer.gov/).

Download Full-text

Uniform genomic data analysis in the NCI Genomic Data Commons

Nature Communications ◽

10.1038/s41467-021-21254-9 ◽

2021 ◽

Vol 12 (1) ◽

Cited By ~ 2

Author(s):

Zhenyu Zhang ◽

Kyle Hernandez ◽

Jeremiah Savage ◽

Shenglai Li ◽

Dan Miller ◽

...

Keyword(s):

Methylation Status ◽

Genomic Data ◽

Data Repository ◽

Sequencing Data ◽

Data Types ◽

Data Production ◽

Number Variation ◽

Data Portal ◽

Data Commons ◽

Human Genome Reference

AbstractThe goal of the National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is to provide the cancer research community with a data repository of uniformly processed genomic and associated clinical data that enables data sharing and collaborative analysis in the support of precision medicine. The initial GDC dataset include genomic, epigenomic, proteomic, clinical and other data from the NCI TCGA and TARGET programs. Data production for the GDC started in June, 2015 using an OpenStack-based private cloud. By June of 2016, the GDC had analyzed more than 50,000 raw sequencing data inputs, as well as multiple other data types. Using the latest human genome reference build GRCh38, the GDC generated a variety of data types from aligned reads to somatic mutations, gene expression, miRNA expression, DNA methylation status, and copy number variation. In this paper, we describe the pipelines and workflows used to process and harmonize the data in the GDC. The generated data, as well as the original input files from TCGA and TARGET, are available for download and exploratory analysis at the GDC Data Portal and Legacy Archive (https://gdc.cancer.gov/).

Download Full-text

Analyses of cancer data in the Genomic Data Commons Data Portal with new functionalities in the TCGAbiolinks R/Bioconductor package

10.1101/350439 ◽

2018 ◽

Author(s):

Mohamed Mounir ◽

Tiago C. Silva ◽

Marta Lucchetta ◽

Catharina Olsen ◽

Gianluca Bontempi ◽

...

Keyword(s):

Differential Expression ◽

Differential Expression Analysis ◽

Genomic Data ◽

Tissue Expression ◽

The Cancer Genome Atlas ◽

Bioconductor Package ◽

Cancer Data ◽

Tumor Purity ◽

Data Portal ◽

Data Commons

ABSTRACTThe advent of Next Generation Sequencing (NGS) technologies has opened new perspectives in deciphering the genetic mechanisms underlying complex diseases. Nowadays, the amount of genomic data is massive and substantial efforts and new tools are required to unveil the information hidden in the data.The Genomic Data Commons (GDC) Data Portal is a large data collection platform that includes different genomic studies included the ones from The Cancer Genome Atlas (TCGA) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiatives, accounting for more than 40 tumor types originating from nearly 30000 patients. Such platforms, although very attractive, must make sure the stored data are easily accessible and adequately harmonized. Moreover, they have the primary focus on the data storage in a unique place, and they do not provide a comprehensive toolkit for analyses and interpretation of the data. To fulfill this urgent need, comprehensive but easily accessible computational methods for integrative analyses of genomic data without renouncing a robust statistical and theoretical framework are needed. In this context, the R/Bioconductor package TCGAbiolinks was developed, offering a variety of bioinformatics functionalities. Here we introduce new features and enhancements of TCGAbiolinks in terms of i) more accurate and flexible pipelines for differential expression analyses, ii) different methods for tumor purity estimation and filtering, iii) integration of normal samples from the Genotype-Tissue-Expression (GTEx) platform iv) support for other genomics datasets, here exemplified by the TARGET data.Evidence has shown that accounting for tumor purity is essential in the study of tumorigenesis, as these factors promote confounding behavior regarding differential expression analysis. Henceforth, we implemented these filtering procedures in TCGAbiolinks. Moreover, a limitation of some of the TCGA datasets is the unavailability or paucity of corresponding normal samples. We thus integrated into TCGAbiolinks the possibility to use normal samples from the Genotype-Tissue Expression (GTEx) project, which is another large-scale repository cataloging gene expression from healthy individuals. The new functionalities are available in the TCGABiolinks v 2.8 and higher released in Bioconductor version 3.7.

Download Full-text

TCGAbiolinksGUI: A graphical user interface to analyze cancer molecular and clinical data

F1000Research ◽

10.12688/f1000research.14197.1 ◽

2018 ◽

Vol 7 ◽

pp. 439 ◽

Cited By ~ 6

Author(s):

Tiago Chedraoui Silva ◽

Antonio Colaprico ◽

Catharina Olsen ◽

Tathiane M Malta ◽

Gianluca Bontempi ◽

...

Keyword(s):

Data Analysis ◽

User Interface ◽

Graphical User Interface ◽

Cancer Genomics ◽

Genomic Data ◽

Bioconductor Project ◽

Video Tutorials ◽

Data Portal ◽

Advanced Knowledge ◽

Data Commons

The GDC (Genomic Data Commons) data portal provides users with data from cancer genomics studies. Recently, we developed the R/Bioconductor TCGAbiolinks package, which allows users to search, download and prepare cancer genomics data for integrative data analysis. The use of this package requires users to have advanced knowledge of R thus limiting the number of users. To overcome this obstacle and improve the accessibility of the package by a wider range of users, we developed a graphical user interface (GUI) using Shiny available through the package TCGAbiolinksGUI. The TCGAbiolinksGUI package is freely available within the Bioconductor project at http://bioconductor.org/packages/TCGAbiolinksGUI/. Links to the GitHub repository, a demo version of the tool, a docker image and PDF/video tutorials are available from the TCGAbiolinksGUI site.

Download Full-text

Author Correction: The NCI Genomic Data Commons

Nature Genetics ◽

10.1038/s41588-021-00883-2 ◽

2021 ◽

Author(s):

Allison P. Heath ◽

Vincent Ferretti ◽

Stuti Agrawal ◽

Maksim An ◽

James C. Angelakos ◽

...

Keyword(s):

Genomic Data ◽

Data Commons

Download Full-text

A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data

International Journal of Genomics ◽

10.1155/2016/7983236 ◽

2016 ◽

Vol 2016 ◽

pp. 1-16 ◽

Cited By ~ 16

Author(s):

Jennifer D. Hintzsche ◽

William A. Robinson ◽

Aik Choon Tan

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Sequencing Data ◽

Disease Treatment ◽

Computational Tools ◽

Whole Exome ◽

Data Production ◽

Whole Exome Sequencing Data ◽

Computationally Intensive ◽

Generation Technology

Whole Exome Sequencing (WES) is the application of the next-generation technology to determine the variations in the exome and is becoming a standard approach in studying genetic variants in diseases. Understanding the exomes of individuals at single base resolution allows the identification of actionable mutations for disease treatment and management. WES technologies have shifted the bottleneck in experimental data production to computationally intensive informatics-based data analysis. Novel computational tools and methods have been developed to analyze and interpret WES data. Here, we review some of the current tools that are being used to analyze WES data. These tools range from the alignment of raw sequencing reads all the way to linking variants to actionable therapeutics. Strengths and weaknesses of each tool are discussed for the purpose of helping researchers make more informative decisions on selecting the best tools to analyze their WES data.

Download Full-text

KNNCNV: A K-Nearest Neighbor Based Method for Detection of Copy Number Variations Using NGS Data

Frontiers in Cell and Developmental Biology ◽

10.3389/fcell.2021.796249 ◽

2021 ◽

Vol 9 ◽

Author(s):

Kun Xie ◽

Kang Liu ◽

Haque A K Alvi ◽

Yuehui Chen ◽

Shuzhen Wang ◽

...

Keyword(s):

Copy Number ◽

Nearest Neighbor ◽

Human Cancer ◽

Gaussian Mixture ◽

Disease Diagnosis ◽

Copy Number Variations ◽

Sequencing Data ◽

K Nearest Neighbor ◽

Data Types ◽

Ngs Data

Copy number variation (CNV) is a well-known type of genomic mutation that is associated with the development of human cancer diseases. Detection of CNVs from the human genome is a crucial step for the pipeline of starting from mutation analysis to cancer disease diagnosis and treatment. Next-generation sequencing (NGS) data provides an unprecedented opportunity for CNVs detection at the base-level resolution, and currently, many methods have been developed for CNVs detection using NGS data. However, due to the intrinsic complexity of CNVs structures and NGS data itself, accurate detection of CNVs still faces many challenges. In this paper, we present an alternative method, called KNNCNV (K-Nearest Neighbor based CNV detection), for the detection of CNVs using NGS data. Compared to current methods, KNNCNV has several distinctive features: 1) it assigns an outlier score to each genome segment based solely on its first k nearest-neighbor distances, which is not only easy to extend to other data types but also improves the power of discovering CNVs, especially the local CNVs that are likely to be masked by their surrounding regions; 2) it employs the variational Bayesian Gaussian mixture model (VBGMM) to transform these scores into a series of binary labels without a user-defined threshold. To evaluate the performance of KNNCNV, we conduct both simulation and real sequencing data experiments and make comparisons with peer methods. The experimental results show that KNNCNV could derive better performance than others in terms of F1-score.

Download Full-text

Insights into dispersed duplications and complex structural mutations from whole genome sequencing 706 families

10.1101/2020.08.03.235358 ◽

2020 ◽

Author(s):

Christopher W. Whelan ◽

Robert E. Handsaker ◽

Giulio Genovese ◽

Seva Kashin ◽

Monkol Lek ◽

...

Keyword(s):

Gene Expression ◽

Copy Number Variation ◽

Copy Number ◽

De Novo ◽

Whole Genome ◽

Sequencing Data ◽

Number Variation ◽

Structural Mutations ◽

Or Gene ◽

Genomic Locations

AbstractTwo intriguing forms of genome structural variation (SV) – dispersed duplications, and de novo rearrangements of complex, multi-allelic loci – have long escaped genomic analysis. We describe a new way to find and characterize such variation by utilizing identity-by-descent (IBD) relationships between siblings together with high-precision measurements of segmental copy number. Analyzing whole-genome sequence data from 706 families, we find hundreds of “IBD-discordant” (IBDD) CNVs: loci at which siblings’ CNV measurements and IBD states are mathematically inconsistent. We found that commonly-IBDD CNVs identify dispersed duplications; we mapped 95 of these common dispersed duplications to their true genomic locations through family-based linkage and population linkage disequilibrium (LD), and found several to be in strong LD with genome-wide association (GWAS) signals for common diseases or gene expression variation at their revealed genomic locations. Other CNVs that were IBDD in a single family appear to involve de novo mutations in complex and multi-allelic loci; we identified 26 de novo structural mutations that had not been previously detected in earlier analyses of the same families by diverse SV analysis methods. These included a de novo mutation of the amylase gene locus and multiple de novo mutations at chromosome 15q14. Combining these complex mutations with more-conventional CNVs, we estimate that segmental mutations larger than 1kb arise in about one per 22 human meioses. These methods are complementary to previous techniques in that they interrogate genomic regions that are home to segmental duplication, high CNV allele frequencies, and multi-allelic CNVs.Author SummaryCopy number variation is an important form of genetic variation in which individuals differ in the number of copies of segments of their genomes. Certain aspects of copy number variation have traditionally been difficult to study using short-read sequencing data. For example, standard analyses often cannot tell whether the duplicated copies of a segment are located near the original copy or are dispersed to other regions of the genome. Another aspect of copy number variation that has been difficult to study is the detection of mutations in the copy number of DNA segments passed down from parents to their children, particularly when the mutations affect genome segments which already display common copy number variation in the population. We develop an analytical approach to solving these problems when sequencing data is available for all members of families with at least two children. This method is based on determining the number of parental haplotypes the two siblings share at each location in their genome, and using that information to determine the possible inheritance patterns that might explain the copy numbers we observe in each family member. We show that dispersed duplications and mutations can be identified by looking for copy number variants that do not follow these expected inheritance patterns. We use this approach to determine the location of 95 common duplications which are dispersed to distant regions of the genome, and demonstrate that these duplications are linked to genetic variants that affect disease risk or gene expression levels. We also identify a set of copy number mutations not detected by previous analyses of sequencing data from a large cohort of families, and show that repetitive and complex regions of the genome undergo frequent mutations in copy number.

Download Full-text

Identification of prognostic values defined by copy number variation, mRNA and protein expression of LANCL2 and EGFR in IDH1/2-wild-type glioblastoma

10.21203/rs.3.rs-276679/v1 ◽

2021 ◽

Author(s):

Hua-fu Zhao ◽

Xiu-ming Zhou ◽

Jing Wang ◽

Fan-fan Chen ◽

Chang-peng Wu ◽

...

Keyword(s):

Protein Expression ◽

Copy Number ◽

Intracellular Localization ◽

Methylation Status ◽

Growth Factor Receptor ◽

Copy Number Variations ◽

Wild Type ◽

Number Variation ◽

Newly Diagnosed Glioblastoma ◽

Mrna And Protein Expression

Abstract Background Epidermal growth factor receptor (EGFR) and lanthionine synthetase C-like 2 (LanCL2) genes locate in the same amplicon, and co-amplification of EGFR and LANCL2 is frequent in glioblastoma. However, the prognostic value of LANCL2 and EGFR co-amplification, and their mRNA and protein expression in glioblastoma remain unclear yet. Methods This study analyzed the prognostic values of the copy number variations (CNVs), mRNA and protein expression of LANCL2 and EGFR in glioblastoma specimens from TCGA database or our tumor banks. Results The amplification of LANCL2 or EGFR, and their co-amplification were frequent in glioblastoma of TCGA database and our tumor banks. CNVs of LANCL2 or EGFR were significantly correlated with IDH1/2 mutation but not MGMT promoter methylation status. LANCL2 or EGFR amplification, and their co-amplification were significantly associated with reduced overall survival (OS) of glioblastoma patients, rather than IDH1/2-wild-type glioblastoma patients. mRNA and protein overexpression of LANCL2 and EGFR was also frequently found in glioblastoma. LANCL2, rather than EGFR, was overexpressed in relapsing glioblastoma, compared with newly diagnosed glioblastoma. However, mRNA or protein expression of EGFR and LANCL2 was not significantly correlated with OS of glioblastoma patients. In addition, the intracellular localization of LanCL2, not EGFR, was associated with the grade of gliomas. Conclusions Taken together, amplification and mRNA overexpression of LANCL2 and EGFR, and their co-amplification and co-expression were frequent in glioblastoma patients. Our findings suggest that CNVs of LANCL2 and EGFR were the independent diagnostic and prognostic biomarkers for histological glioblastoma patients, but not for IDH1/2-wild-type glioblastoma patients.

Download Full-text

ILDR1 Is a Prognostic Biomarker and Associated With Immune Inﬁltration in Gastric Cancer

10.21203/rs.3.rs-944596/v1 ◽

2021 ◽

Author(s):

Yanling Ma ◽

WenBo Qi ◽

BaoHong Gu ◽

XueMei Li ◽

ZhenYu Yin ◽

...

Keyword(s):

Gastric Cancer ◽

T Cells ◽

Dendritic Cells ◽

Cancer Patients ◽

Cd8 T Cells ◽

Immune Cells ◽

Copy Number ◽

Sequencing Data ◽

Number Variation ◽

Gastric Cancer Patients

Abstract Objective: To investigate the association between ILDR1 and prognosis and immune infiltration in gastric cancer. Methods: We analyzed the RNA sequencing data of 9736 tumor tissues and 8587 normal tissues in the TCGA and GTEx databases through the GEPIA2 platform. The expression of ILDR1 in gastric cancer and normal gastric mucosa tissues with GEPIA and TIMER. Clinical subgroup analysis was made through Kaplan-Meier analysis. Analyzed the correlation between ILDR1 and VEGFA expression in gastric cancer, through the gene sequencing data of gastric cancer in TCGA. Explored the relationship between ILDR1 methylation and the prognosis of gastric cancer patients through the MethSurv database. The correlation between ILDR1 and immune cells and the correlation of copy number variation were explored through the TIMER database. Results: ILDR1-high GC patients had a lower PFS and OS. High ILDR1 expression was significantly correlated with tumor grade. There was a negative correlation between the ILDR1 expression and the abundances of CD8+ T, Macrophages and DC and etc. The methylation level of ILDR1 is associated with a good prognosis of gastric cancer. ILDR1 copy number variation was correlated with immune cells, IDLR1 arm-loss was associated with the infiltration of B cells, CD8+ T cells, CD4+ T cells, macrophages, neutrophils, and dendritic cells, and arm-duplication was associated with the infiltration of B cells, CD8+ T cells, CD4+ T cells, macrophages, neutrophils and dendritic cells. Conclusion: The increased expression of ILDR1 is associated with poor prognosis in patients with gastric cancer. ILDR1 can be used as a novel predictive biomarker to provide a new therapeutic target for gastric cancer patients.

Download Full-text

Spatially constrained tumour growth affects the patterns of clonal selection and neutral drift in cancer genomic data

10.1101/544536 ◽

2019 ◽

Cited By ~ 3

Author(s):

Kate Chkhaidze ◽

Timon Heide ◽

Benjamin Werner ◽

Marc J. Williams ◽

Weini Huang ◽

...

Keyword(s):

Next Generation Sequencing ◽

Tumour Growth ◽

Evolutionary Dynamics ◽

Clonal Selection ◽

Genomic Data ◽

Confounding Factors ◽

Data Generation ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

AbstractQuantification of the effect of spatial tumour sampling on the patterns of mutations detected in next-generation sequencing data is largely lacking. Here we use a spatial stochastic cellular automaton model of tumour growth that accounts for somatic mutations, selection, drift and spatial constrains, to simulate multi-region sequencing data derived from spatial sampling of a neoplasm. We show that the spatial structure of a solid cancer has a major impact on the detection of clonal selection and genetic drift from bulk sequencing data and single-cell sequencing data. Our results indicate that spatial constrains can introduce significant sampling biases when performing multi-region bulk sampling and that such bias becomes a major confounding factor for the measurement of the evolutionary dynamics of human tumours. We present a statistical inference framework that takes into account the spatial effects of a growing tumour and allows inferring the evolutionary dynamics from patient genomic data. Our analysis shows that measuring cancer evolution using next-generation sequencing while accounting for the numerous confounding factors requires a mechanistic model-based approach that captures the sources of noise in the data.SummarySequencing the DNA of cancer cells from human tumours has become one of the main tools to study cancer biology. However, sequencing data are complex and often difficult to interpret. In particular, the way in which the tissue is sampled and the data are collected, impact the interpretation of the results significantly. We argue that understanding cancer genomic data requires mathematical models and computer simulations that tell us what we expect the data to look like, with the aim of understanding the impact of confounding factors and biases in the data generation step. In this study, we develop a spatial simulation of tumour growth that also simulates the data generation process, and demonstrate that biases in the sampling step and current technological limitations severely impact the interpretation of the results. We then provide a statistical framework that can be used to overcome these biases and more robustly measure aspects of the biology of tumours from the data.

Download Full-text