scholarly journals Multiomic Integration of Public Oncology Databases in Bioconductor

2020 ◽  
pp. 958-971
Author(s):  
Marcel Ramos ◽  
Ludwig Geistlinger ◽  
Sehyun Oh ◽  
Lucas Schiffer ◽  
Rimsha Azhar ◽  
...  

PURPOSE Investigations of the molecular basis for the development, progression, and treatment of cancer increasingly use complementary genomic assays to gather multiomic data, but management and analysis of such data remain complex. The cBioPortal for cancer genomics currently provides multiomic data from > 260 public studies, including The Cancer Genome Atlas (TCGA) data sets, but integration of different data types remains challenging and error prone for computational methods and tools using these resources. Recent advances in data infrastructure within the Bioconductor project enable a novel and powerful approach to creating fully integrated representations of these multiomic, pan-cancer databases. METHODS We provide a set of R/Bioconductor packages for working with TCGA legacy data and cBioPortal data, with special considerations for loading time; efficient representations in and out of memory; analysis platform; and an integrative framework, such as MultiAssayExperiment. Large methylation data sets are provided through out-of-memory data representation to provide responsive loading times and analysis capabilities on machines with limited memory. RESULTS We developed the curatedTCGAData and cBioPortalData R/Bioconductor packages to provide integrated multiomic data sets from the TCGA legacy database and the cBioPortal web application programming interface using the MultiAssayExperiment data structure. This suite of tools provides coordination of diverse experimental assays with clinicopathological data with minimal data management burden, as demonstrated through several greatly simplified multiomic and pan-cancer analyses. CONCLUSION These integrated representations enable analysts and tool developers to apply general statistical and plotting methods to extensive multiomic data through user-friendly commands and documented examples.

2020 ◽  
Author(s):  
Annika Tjuka ◽  
Robert Forkel ◽  
Johann-Mattis List

Psychologists and linguists have collected a great diversity of data for word and concept properties. In psychology, many studies accumulate norms and ratings such as word frequencies or age-of-acquisition often for a large number of words. Linguistics, on the other hand, provides valuable insights into relations of word meanings. We present a collection of those data sets for norms, ratings, and relations that cover different languages: ‘NoRaRe.’ To enable a comparison between the diverse data types, we established workflows that facilitate the expansion of the database. A web application allows convenient access to the data (https://digling.org/norare/). Furthermore, a software API ensures consistent data curation by providing tests to validate the data sets. The NoRaRe collection is linked to the database curated by the Concepticon project (https://concepticon.clld.org) which offers a reference catalog of unified concept sets. The link between words in the data sets and the Concepticon concept sets makes a cross-linguistic comparison possible. In three case studies, we test the validity of our approach, the accuracy of our workflow, and the applicability of our database. The results indicate that the NoRaRe database can be applied for the study of word properties across multiple languages. The data can be used by psychologists and linguists to benefit from the knowledge rooted in both research disciplines.


F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 1542 ◽  
Author(s):  
Tiago C. Silva ◽  
Antonio Colaprico ◽  
Catharina Olsen ◽  
Fulvio D'Angelo ◽  
Gianluca Bontempi ◽  
...  

Biotechnological advances in sequencing have led to an explosion of publicly available data via large international consortia such as The Cancer Genome Atlas (TCGA), The Encyclopedia of DNA Elements (ENCODE), and The NIH Roadmap Epigenomics Mapping Consortium (Roadmap). These projects have provided unprecedented opportunities to interrogate the epigenome of cultured cancer cell lines as well as normal and tumor tissues with high genomic resolution. The Bioconductor project offers more than 1,000 open-source software and statistical packages to analyze high-throughput genomic data. However, most packages are designed for specific data types (e.g. expression, epigenetics, genomics) and there is no one comprehensive tool that provides a complete integrative analysis of the resources and data provided by all three public projects. A need to create an integration of these different analyses was recently proposed. In this workflow, we provide a series of biologically focused integrative analyses of different molecular data. We describe how to download, process and prepare TCGA data and by harnessing several key Bioconductor packages, we describe how to extract biologically meaningful genomic and epigenomic data. Using Roadmap and ENCODE data, we provide a work plan to identify biologically relevant functional epigenomic elements associated with cancer. To illustrate our workflow, we analyzed two types of brain tumors: low-grade glioma (LGG) versus high-grade glioma (glioblastoma multiform or GBM). This workflow introduces the following Bioconductor packages: AnnotationHub, ChIPSeeker, ComplexHeatmap, pathview, ELMER, GAIA, MINET, RTCGAToolbox, TCGAbiolinks.


2018 ◽  
Vol 17 ◽  
pp. 117693511877478 ◽  
Author(s):  
Jovan Cejovic ◽  
Jelena Radenkovic ◽  
Vladimir Mladenovic ◽  
Adam Stanojevic ◽  
Milica Miletic ◽  
...  

Increased efforts in cancer genomics research and bioinformatics are producing tremendous amounts of data. These data are diverse in origin, format, and content. As the amount of available sequencing data increase, technologies that make them discoverable and usable are critically needed. In response, we have developed a Semantic Web–based Data Browser, a tool allowing users to visually build and execute ontology-driven queries. This approach simplifies access to available data and improves the process of using them in analyses on the Seven Bridges Cancer Genomics Cloud (CGC; www.cancergenomicscloud.org ). The Data Browser makes large data sets easily explorable and simplifies the retrieval of specific data of interest. Although initially implemented on top of The Cancer Genome Atlas (TCGA) data set, the Data Browser’s architecture allows for seamless integration of other data sets. By deploying it on the CGC, we have enabled remote researchers to access data and perform collaborative investigations.


2021 ◽  
Author(s):  
Lihong Huang ◽  
ruoling zheng ◽  
Huasong Gong ◽  
Yongchao Qiao

Abstract Although emerging cells or animals based evidence supports an association between nuclear factor kappa-B1 (NF-κB1) cells and cancers, there has no pan-cancer analysis. Therefore, based on TCGA (The Cancer Genome Atlas) and GEO (Gene Expression Omnibus) data sets, we first studied the potential carcinogenic effect of NF-κB1 in 33 tumors. As we not only found high expression of NF-κB1 in most tumors, but also found that NF-κB1 expression is closely related to the prognosis of tumor patients. Enhanced phosphorylation of S893 was observed in several tumors, such as breast cancer, uterine corpus endometrial carcinoma or lung adenocarcinoma. In thymoma, NF-κB1 expression was relevant to CD8+ T-cell infiltration levels, and tumor-associated fibroblast infiltration has also seen in other tumors, such as uterine corpus endometrial carcinoma or glioblastoma multiforme. In addition, the functional mechanism of NF-κB1 also involves the related functions of protein processing and RNA metabolism. In this study, NF-κB1 was pan-cancer study in order to have a systematic and comprehensive understanding of the carcinogenic effect of NF-κB1 in different tumors.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Judith Abécassis ◽  
Fabien Reyal ◽  
Jean-Philippe Vert

AbstractSystematic DNA sequencing of cancer samples has highlighted the importance of two aspects of cancer genomics: intra-tumor heterogeneity (ITH) and mutational processes. These two aspects may not always be independent, as different mutational processes could be involved in different stages or regions of the tumor, but existing computational approaches to study them largely ignore this potential dependency. Here, we present CloneSig, a computational method to jointly infer ITH and mutational processes in a tumor from bulk-sequencing data. Extensive simulations show that CloneSig outperforms current methods for ITH inference and detection of mutational processes when the distribution of mutational signatures changes between clones. Applied to a large cohort of 8,951 tumors with whole-exome sequencing data from The Cancer Genome Atlas, and on a pan-cancer dataset of 2,632 whole-genome sequencing tumor samples from the Pan-Cancer Analysis of Whole Genomes initiative, CloneSig obtains results overall coherent with previous studies.


2017 ◽  
Author(s):  
Marcel Ramos ◽  
Lucas Schiffer ◽  
Angela Re ◽  
Rimsha Azhar ◽  
Azfar Basunia ◽  
...  

ABSTRACTMulti-omics experiments are increasingly commonplace in biomedical research, and add layers of complexity to experimental design, data integration, and analysis. R and Bioconductor provide a generic framework for statistical analysis and visualization, as well as specialized data classes for a variety of high-throughput data types, but methods are lacking for integrative analysis of multi-omics experiments. The MultiAssayExperiment software package, implemented in R and leveraging Bioconductor software and design principles, provides for the coordinated representation of, storage of, and operation on multiple diverse genomics data. We provide all of the multiple ‘omics data for each cancer tissue in The Cancer Genome Atlas (TCGA) as ready-to-analyze MultiAssayExperiment objects, and demonstrate in these and other datasets how the software simplifies data representation, statistical analysis, and visualization. The MultiAssayExperiment Bioconductor package reduces major obstacles to efficient, scalable and reproducible statistical analysis of multi-omics data and enhances data science applications of multiple omics datasets.


2018 ◽  
Author(s):  
Pamela Wu ◽  
Zachary J Heins ◽  
James T Muller ◽  
Adam A Abeshouse ◽  
Yichao Sun ◽  
...  

SummaryThe Clinical Proteomic Tumor Analysis Consortium (CPTAC) has produced extensive mass spectrometry based proteomics data for selected breast, colon and ovarian tumors from The Cancer Genome Atlas (TCGA). We have incorporated the CPTAC proteomics data into the cBioPotal to support easy exploration and integrative analysis of these proteomic datasets in the context of the clinical and genomics data from the same tumors. cBioPortal is an open source platform for exploring, visualizing, and analyzing multi-dimensional cancer genomics and clinical data. The public instance of the cBioPortal (http://cbioportal.org/) hosts more than 100 cancer genomics studies including all of the data from TCGA. Its biologist-friendly interface provides many rich analysis features, including a graphical summary of gene-level data across multiple platforms, correlation analysis between genes or other data types, survival analysis, and network visualization. Here, we present the integration of the CPTAC mass spectrometry based proteomics data into the cBioPortal, consisting of 77 breast, 95 colorectal, and 174 ovarian tumors that already have been profiled by TCGA for mutations, copy number alterations, gene expression, and DNA methylation. As a result, the CPTAC data can now be easily explored and analyzed in the cBioPortal in the context of clinical and genomics data. By integrating CPTAC data into cBioPortal, limitations of TCGA proteomics array data can be overcome while also providing a user-friendly web interface, a web API and an R client to query the mass spectrometry data together with genomic, epigenomic, and clinical data.


Author(s):  
Zhuohui Wei ◽  
Yue Zhang ◽  
Wanlin Weng ◽  
Jiazhou Chen ◽  
Hongmin Cai

Abstract The significance of pan-cancer categories has recently been recognized as widespread in cancer research. Pan-cancer categorizes a cancer based on its molecular pathology rather than an organ. The molecular similarities among multi-omics data found in different cancer types can play several roles in both biological processes and therapeutic developments. Therefore, an integrated analysis for various genomic data is frequently used to reveal novel genetic and molecular mechanisms. However, a variety of algorithms for multi-omics clustering have been proposed in different fields. The comparison of different computational clustering methods in pan-cancer analysis performance remains unclear. To increase the utilization of current integrative methods in pan-cancer analysis, we first provide an overview of five popular computational integrative tools: similarity network fusion, integrative clustering of multiple genomic data types (iCluster), cancer integration via multi-kernel learning (CIMLR), perturbation clustering for data integration and disease subtyping (PINS) and low-rank clustering (LRACluster). Then, a priori interactions in multi-omics data were incorporated to detect prominent molecular patterns in pan-cancer data sets. Finally, we present comparative assessments of these methods, with discussion over key issues in applying these algorithms. We found that all five methods can identify distinct tumor compositions. The pan-cancer samples can be reclassified into several groups by different proportions. Interestingly, each method can classify the tumors into categories that are different from original cancer types or subtypes, especially for ovarian serous cystadenocarcinoma (OV) and breast invasive carcinoma (BRCA) tumors. In addition, all clusters of the five computational methods show notable prognostic values. Furthermore, both the 9 recurrent differential genes and the 15 common pathway characteristics were identified across all the methods. The results and discussion can help the community select appropriate integrative tools according to different research tasks or aims in pan-cancer analysis.


2018 ◽  
Author(s):  
Andrew M. Hudson ◽  
Natalie L. Stephenson ◽  
Cynthia Li ◽  
Eleanor Trotter ◽  
Adam J. Fletcher ◽  
...  

AbstractA major challenge in cancer genomics is identifying driver mutations from the large number of neutral passenger mutations within a given tumor. Here, we utilize motifs critical for kinase activity to functionally filter genomic data to identify driver mutations that would otherwise be lost within mutational noise. In the first step of our screen, we define a putative tumor suppressing kinome by identifying kinases with truncation mutations occurring within or before the kinase domain. We aligned these kinase sequences and, utilizing data from the Cancer Cell Line Encyclopedia and The Cancer Genome Atlas databases, identified amino acids that represent predicted hotspots for loss-of-function mutations. The functional consequences of new LOF mutations were validated and the top 15 hotspot LOF residues were used in a pan-cancer analysis to define the tumor-suppressing kinome. A ranked list revealed MAP2K7 as a candidate tumor suppressor in gastric cancer, despite the mutational frequency of MAP2K7 falling within the mutational noise for this cancer type. The majority of mutations in MAP2K7 abolished catalytic activity compared to the wild type kinase, consistent with a tumor suppressive role for MAP2K7 in gastric cancer. Furthermore, reactivation of the JNK pathway in gastric cancer cells harboring LOF mutations in MAP2K7 or JNK1 suppresses clonogenicity and growth in soft agar, demonstrating the functional importance of inactivating the JNK pathway in gastric cancer. In summary, our data highlights a broadly applicable strategy to identify functional cancer driver mutations leading us to define the JNK pathway as tumor suppressive in gastric cancer.SummaryA unique computational pan-cancer analysis pinpoints novel tumor suppressing kinases, and highlights the power of functional genomics by defining the JNK pathway as tumor suppressive in gastric cancer.


2019 ◽  
Author(s):  
Margaret Linan ◽  
Junwen Wang ◽  
Valentin Dinu

AbstractWe performed a comprehensive pan-cancer analysis in the Cancer Genomics Cloud of HTSeq-FPKM normalized protein coding mRNA data from 17 cancer projects in the Cancer Genome Atlas, these are Adrenal Gland, Bile Duct, Bladder, Brain, Breast, Cervix, Colorectal, Esophagus, Head and Neck, Kidney, Liver, Lung, Pancreas, Prostate, Stomach, Thyroid and Uterus. The PoTRA algorithm was applied to the normalized mRNA protein coding data and detected dysregulated pathways that can be implicated in the pathogenesis of these cancers. Then the PageRank algorithm was applied to the PoTRA results to find the most influential dysregulated pathways among all 17 cancer types. Pathways in cancer is the most common dysregulated pathway, and the MAPK signaling pathway is the most influential (PageRank score = 0.2034) while the purine metabolism pathway is the most significantly dysregulated metabolic pathway.


Sign in / Sign up

Export Citation Format

Share Document