scholarly journals Data reuse and the open data citation advantage

Author(s):  
Heather Piwowar ◽  
Todd J Vision

BACKGROUND: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation boost”. Furthermore, little is known about patterns in data reuse over time and across datasets. METHOD AND RESULTS: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation boost varied with date of dataset deposition: a citation boost was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. CONCLUSION: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation boost are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

2013 ◽  
Author(s):  
Heather Piwowar ◽  
Todd J Vision

BACKGROUND: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation boost”. Furthermore, little is known about patterns in data reuse over time and across datasets. METHOD AND RESULTS: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation boost varied with date of dataset deposition: a citation boost was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. CONCLUSION: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation boost are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.


PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e8682
Author(s):  
Yi-Shian Peng ◽  
Chia-Wei Tang ◽  
Yi-Yun Peng ◽  
Hung Chang ◽  
Chien-Lung Chen ◽  
...  

Background Alzheimer’s disease (AD) is a prevalent progressive neurodegenerative human disease whose cause remains unclear. Numerous initially highly hopeful anti-AD drugs based on the amyloid-β (Aβ) hypothesis of AD have failed recent late-phase tests. Natural aging (AG) is a high-risk factor for AD. Here, we aim to gain insights in AD that may lead to its novel therapeutic treatment through conducting meta-analyses of gene expression microarray data from AG and AD-affected brain. Methods Five sets of gene expression microarray data from different regions of AD (hereafter, ALZ when referring to data)-affected brain, and one set from AG, were analyzed by means of the application of the methods of differentially expressed genes and differentially co-expressed gene pairs for the identification of putatively disrupted biological pathways and associated abnormal molecular contents. Results Brain-region specificity among ALZ cases and AG-ALZ differences in gene expression and in KEGG pathway disruption were identified. Strong heterogeneity in AD signatures among the five brain regions was observed: HC/PC/SFG showed clear and pronounced AD signatures, MTG moderately so, and EC showed essentially none. There were stark differences between ALZ and AG. OXPHOS and Proteasome were the most disrupted pathways in HC/PC/SFG, while AG showed no OXPHOS disruption and relatively weak Proteasome disruption in AG. Metabolic related pathways including TCA cycle and Pyruvate metabolism were disrupted in ALZ but not in AG. Three pathogenic infection related pathways were disrupted in ALZ. Many cancer and signaling related pathways were shown to be disrupted AG but far less so in ALZ, and not at all in HC. We identified 54 “ALZ-only” differentially expressed genes, all down-regulated and which, when used to augment the gene list of the KEGG AD pathway, made it significantly more AD-specific.


Sign in / Sign up

Export Citation Format

Share Document