scholarly journals Identifying Scientific Project-generated Data Citation from Full-text Articles: An Investigation of TCGA Data Citation

2017 ◽  
Vol 1 (2) ◽  
pp. 32-44
Author(s):  
Jiao Li ◽  
Si Zheng ◽  
Hongyu Kang ◽  
Zhen Hou ◽  
Qing Qian

AbstractPurposeIn the open science era, it is typical to share project-generated scientific data by depositing it in an open and accessible database. Moreover, scientific publications are preserved in a digital library archive. It is challenging to identify the data usage that is mentioned in literature and associate it with its source. Here, we investigated the data usage of a government-funded cancer genomics project, The Cancer Genome Atlas (TCGA), via a full-text literature analysis.Design/methodology/approachWe focused on identifying articles using the TCGA dataset and constructing linkages between the articles and the specific TCGA dataset. First, we collected 5,372 TCGA-related articles from PubMed Central (PMC). Second, we constructed a benchmark set with 25 full-text articles that truly used the TCGA data in their studies, and we summarized the key features of the benchmark set. Third, the key features were applied to the remaining PMC full-text articles that were collected from PMC.FindingsThe amount of publications that use TCGA data has increased significantly since 2011, although the TCGA project was launched in 2005. Additionally, we found that the critical areas of focus in the studies that use the TCGA data were glioblastoma multiforme, lung cancer, and breast cancer; meanwhile, data from the RNA-sequencing (RNA-seq) platform is the most preferable for use.Research limitationsThe current workflow to identify articles that truly used TCGA data is labor-intensive. An automatic method is expected to improve the performance.Practical implicationsThis study will help cancer genomics researchers determine the latest advancements in cancer molecular therapy, and it will promote data sharing and data-intensive scientific discovery.Originality/valueFew studies have been conducted to investigate data usage by government-funded projects/programs since their launch. In this preliminary study, we extracted articles that use TCGA data from PMC, and we created a link between the full-text articles and the source data.

2015 ◽  
Author(s):  
Sunho Park ◽  
Seung-Jun Kim ◽  
Donghyeon Yu ◽  
Samuel Pena-Llopis ◽  
Jianjiong Gao ◽  
...  

Identification of altered pathways that are clinically relevant across human cancers is a key challenge in cancer genomics. We developed a network-based algorithm to integrate somatic mutation data with gene networks and pathways, in order to identify pathways altered by somatic mutations across cancers. We applied our approach to The Cancer Genome Atlas (TCGA) dataset of somatic mutations in 4,790 cancer patients with 19 different types of malignancies. Our analysis identified cancer-type-specific altered pathways enriched with known cancer-relevant genes and drug targets. Consensus clustering using gene expression datasets that included 4,870 patients from TCGA and multiple independent cohorts confirmed that the altered pathways could be used to stratify patients into subgroups with significantly different clinical outcomes. Of particular significance, certain patient subpopulations with poor prognosis were identified because they had specific altered pathways for which there are available targeted therapies. These findings could be used to tailor and intensify therapy in these patients, for whom current therapy is suboptimal.


Cancers ◽  
2021 ◽  
Vol 13 (10) ◽  
pp. 2487
Author(s):  
Chao Gao ◽  
Guangxu Jin ◽  
Elizabeth Forbes ◽  
Lingegowda S. Mangala ◽  
Yingmei Wang ◽  
...  

IK is a mitotic factor that promotes cell cycle progression. Our previous investigation of 271 endometrial cancer (EC) samples from the Cancer Genome Atlas (TCGA) dataset showed IK somatic mutations were enriched in a cluster of patients with high-grade and high-stage cancers, and this group had longer survival. This study provides insight into how IK somatic mutations contribute to EC pathophysiology. We analyzed the somatic mutational landscape of IK gene in 547 EC patients using expanded TCGA dataset. Co-immunoprecipitation and mass spectrometry were used to identify protein interactions. In vitro and in vivo experiments were used to evaluate IK’s role in EC. The patients with IK-inactivating mutations had longer survival during 10-year follow-up. Frameshift and stop-gain were common mutations and were associated with decreased IK expression. IK knockdown led to enrichment of G2/M phase cells, inactivation of DNA repair signaling mediated by heterodimerization of Ku80 and Ku70, and sensitization of EC cells to cisplatin treatment. IK/Ku80 mutations were accompanied by higher mutation rates and associated with significantly better overall survival. Inactivating mutations of IK gene and loss of IK protein expression were associated with weakened Ku80/Ku70-mediated DNA repair, increased mutation burden, and better response to chemotherapy in patients with EC.


2014 ◽  
Vol 22 (2) ◽  
pp. 173-185 ◽  
Author(s):  
Eli Dart ◽  
Lauren Rotman ◽  
Brian Tierney ◽  
Mary Hester ◽  
Jason Zurawski

The ever-increasing scale of scientific data has become a significant challenge for researchers that rely on networks to interact with remote computing systems and transfer results to collaborators worldwide. Despite the availability of high-capacity connections, scientists struggle with inadequate cyberinfrastructure that cripples data transfer performance, and impedes scientific progress. The ScienceDMZparadigm comprises a proven set of network design patterns that collectively address these problems for scientists. We explain the Science DMZ model, including network architecture, system configuration, cybersecurity, and performance tools, that creates an optimized network environment for science. We describe use cases from universities, supercomputing centers and research laboratories, highlighting the effectiveness of the Science DMZ model in diverse operational settings. In all, the Science DMZ model is a solid platform that supports any science workflow, and flexibly accommodates emerging network technologies. As a result, the Science DMZ vastly improves collaboration, accelerating scientific discovery.


2020 ◽  
Author(s):  
Xun Gu

AbstractCurrent cancer genomics databases have accumulated millions of somatic mutations that remain to be further explored, faciltating enormous high throuput analyses to explore the underlying mechanisms that may contribute to malignant initiation or progression. In the context of over-dominant passenger mutations (unrelated to cancers), the challenge is to identify somatic mutations that are cancer-driving. Under the notion that carcinogenesis is a form of somatic-cell evolution, we developed a two-component mixture model that enables to accomplish the following analyses. (i) We formulated a quasi-likelihood approach to test whether the two-component model is significantly better than a single-component model, which can be used for new cancer gene predicting. (ii) We implemented an empirical Bayesian method to calculate the posterior probabilities of a site to be cancer-driving for all sites of a gene, which can be used for new driving site predicting. (iii) We developed a computational procedure to calculate the somatic selection intensity at driver sites and passenger sites, respectively, as well as site-specific profiles for all sites. Using these newly-developed methods, we comprehensively analyzed 294 known cancer genes based on The Cancer Genome Atlas (TCGA) database.


2021 ◽  
Vol 11 ◽  
Author(s):  
Guo Zu ◽  
Jiacheng Gao ◽  
Tingting Zhou

BackgroundThe clinicopathological and prognostic significance of SRY-box transcription factor 9 (SOX9) expression in gastric cancer (GC) patients is still controversial. Our aim is to investigate the clinicopathological and prognostic value of SOX9 expression in GC patients.MethodsA systemic literature search and meta-analysis were used to evaluate the clinicopathological significance and overall survival (OS) of SOX9 expression in GC patients. The Cancer Genome Atlas (TCGA) dataset was used to investigate the relationship between SOX9 expression and OS of stomach adenocarcinoma (STAD) patients.ResultsA total of 11 articles involving 3,060 GC patients were included. In GC patients, the SOX9 expression was not associated with age [odds ratio (OR) = 0.743, 95% CI = 0.507–1.089, p = 0.128], sex (OR = 0.794, 95% CI = 0.605–1.042, p = 0.097), differentiation (OR = 0.728, 95% CI = 0.475–1.115, p = 0.144), and lymph node metastasis (OR = 1.031, 95% CI = 0.793–1.340, p = 0.820). SOX9 expression was associated with depth of invasion (OR = 0.348, 95% CI = 0.247–0.489, p = 0.000) and TNM stage (OR = 0.428, 95% CI = 0.308–0.595, p = 0.000). The 1-year OS (OR = 1.507, 95% CI = 1.167–1.945, p = 0.002), 3-year OS (OR = 1.482, 95% CI = 1.189–1.847, p = 0.000), and 5-year OS (OR = 1.487, 95% CI = 1.187–1.862, p = 0.001) were significantly shorter in GC patients with high SOX9 expression. TCGA analysis showed that SOX9 was upregulated in STAD patients compared with that in normal patients (p < 0.001), and the OS of STAD patients with a high expression of SOX9 is poorer than that in patients with low expression of SOX9, but the statistical difference is not obvious (p = 0.31).ConclusionSOX9 expression was associated with the depth of tumor invasion, TNM stage, and poor OS of GC patients. SOX9 may be a potential prognostic factor for GC patients but needs further study.Systematic Review RegistrationPROSPERO, ID NUMBER 275712.


2017 ◽  
Author(s):  
Federica Rosetta

Watch the VIDEO here.Within the Open Science discussions, the current call for “reproducibility” comes from the raising awareness that results as presented in research papers are not as easily reproducible as expected, or even contradicted those original results in some reproduction efforts. In this context, transparency and openness are seen as key components to facilitate good scientific practices, as well as scientific discovery. As a result, many funding agencies now require the deposit of research data sets, institutions improve the training on the application of statistical methods, and journals begin to mandate a high level of detail on the methods and materials used. How can researchers be supported and encouraged to provide that level of transparency? An important component is the underlying research data, which is currently often only partly available within the article. At Elsevier we have therefore been working on journal data guidelines which clearly explain to researchers when and how they are expected to make their research data available. Simultaneously, we have also developed the corresponding infrastructure to make it as easy as possible for researchers to share their data in a way that is appropriate in their field. To ensure researchers get credit for the work they do on managing and sharing data, all our journals support data citation in line with the FORCE11 data citation principles – a key step in the direction of ensuring that we address the lack of credits and incentives which emerged from the Open Data analysis (Open Data - the Researcher Perspective https://www.elsevier.com/about/open-science/research-data/open-data-report ) recently carried out by Elsevier together with CWTS. Finally, the presentation will also touch upon a number of initiatives to ensure the reproducibility of software, protocols and methods. With STAR methods, for instance, methods are submitted in a Structured, Transparent, Accessible Reporting format; this approach promotes rigor and robustness, and makes reporting easier for the author and replication easier for the reader.


2020 ◽  
Author(s):  
Shanmei Jiang ◽  
Yin He ◽  
Mengyuan Li ◽  
Xiaosheng Wang

Abstract Objectives The cell cycle pathway regulating cell proliferation is overactivated in various cancers. Immune evasion is another important mechanism for tumor cell hyperproliferation. Nevertheless, the relationship between cell cycle and tumor immunity remains not fully understood. Materials and Methods Using the cancer genomics datasets for 10 cancer cohorts from the Cancer Genome Atlas (TCGA) program, we investigated the association between cell cycle activity (CCA) and anti-tumor immune signatures. We also explored the association between CCA and PD-L1 expression in these cancer cohorts. Moreover, we investigated the association between CCA and immunotherapy response in several cancer cohorts receiving immunotherapy. Results CCA likely exhibited positive associations with anti-tumor immune signatures (CD8+ T cell infiltration and immune cytolytic activity) in these cancer cohorts. The strong positive associations of CCA with DNA damage repair pathways and with tumor mutation load may explain the positive associations between CCA and anti-tumor immune signatures. Moreover, CCA displayed significant positive correlations with PD-L1 expression. Finally, we found that the enhanced CCA tended to be associated with unfavorable clinical outcomes in the TCGA cancer cohorts, though such association was not observed in the cancer cohorts receiving immune checkpoint blockade therapy. Conclusions CCA has significant positive associations with both anti-tumor immune signatures and tumor immune-suppressive signatures in diverse cancer types. Our findings provide new insights into cancer biology and potential clinical implications for cancer immunotherapy.


2020 ◽  
pp. 958-971
Author(s):  
Marcel Ramos ◽  
Ludwig Geistlinger ◽  
Sehyun Oh ◽  
Lucas Schiffer ◽  
Rimsha Azhar ◽  
...  

PURPOSE Investigations of the molecular basis for the development, progression, and treatment of cancer increasingly use complementary genomic assays to gather multiomic data, but management and analysis of such data remain complex. The cBioPortal for cancer genomics currently provides multiomic data from > 260 public studies, including The Cancer Genome Atlas (TCGA) data sets, but integration of different data types remains challenging and error prone for computational methods and tools using these resources. Recent advances in data infrastructure within the Bioconductor project enable a novel and powerful approach to creating fully integrated representations of these multiomic, pan-cancer databases. METHODS We provide a set of R/Bioconductor packages for working with TCGA legacy data and cBioPortal data, with special considerations for loading time; efficient representations in and out of memory; analysis platform; and an integrative framework, such as MultiAssayExperiment. Large methylation data sets are provided through out-of-memory data representation to provide responsive loading times and analysis capabilities on machines with limited memory. RESULTS We developed the curatedTCGAData and cBioPortalData R/Bioconductor packages to provide integrated multiomic data sets from the TCGA legacy database and the cBioPortal web application programming interface using the MultiAssayExperiment data structure. This suite of tools provides coordination of diverse experimental assays with clinicopathological data with minimal data management burden, as demonstrated through several greatly simplified multiomic and pan-cancer analyses. CONCLUSION These integrated representations enable analysts and tool developers to apply general statistical and plotting methods to extensive multiomic data through user-friendly commands and documented examples.


2020 ◽  
Vol 21 (17) ◽  
pp. 6087
Author(s):  
Yunzhen Wei ◽  
Limeng Zhou ◽  
Yingzhang Huang ◽  
Dianjing Guo

Long noncoding RNA (lncRNA)/microRNA(miRNA)/mRNA triplets contribute to cancer biology. However, identifying significative triplets remains a major challenge for cancer research. The dynamic changes among factors of the triplets have been less understood. Here, by integrating target information and expression datasets, we proposed a novel computational framework to identify the triplets termed as “lncRNA-perturbated triplets”. We applied the framework to five cancer datasets in The Cancer Genome Atlas (TCGA) project and identified 109 triplets. We showed that the paired miRNAs and mRNAs were widely perturbated by lncRNAs in different cancer types. LncRNA perturbators and lncRNA-perturbated mRNAs showed significantly higher evolutionary conservation than other lncRNAs and mRNAs. Importantly, the lncRNA-perturbated triplets exhibited high cancer specificity. The pan-cancer perturbator OIP5-AS1 had higher expression level than that of the cancer-specific perturbators. These lncRNA perturbators were significantly enriched in known cancer-related pathways. Furthermore, among the 25 lncRNA in the 109 triplets, lncRNA SNHG7 was identified as a stable potential biomarker in lung adenocarcinoma (LUAD) by combining the TCGA dataset and two independent GEO datasets. Results from cell transfection also indicated that overexpression of lncRNA SNHG7 and TUG1 enhanced the expression of the corresponding mRNA PNMA2 and CDC7 in LUAD. Our study provides a systematic dissection of lncRNA-perturbated triplets and facilitates our understanding of the molecular roles of lncRNAs in cancers.


Sign in / Sign up

Export Citation Format

Share Document