scholarly journals Data trajectories: tracking reuse of published data for transitive credit attribution

2016 ◽  
Vol 11 (1) ◽  
pp. 1-16 ◽  
Author(s):  
Paolo Missier

The ability to measure the use and impact of published data sets is key to the success of the open data/open science paradigm. A direct measure of impact would require tracking data (re)use in the wild, which is difficult to achieve. This is therefore commonly replaced by simpler metrics based on data download and citation counts. In this paper we describe a scenario where it is possible to track the trajectory of a dataset after its publication, and show how this enables the design of accurate models for ascribing credit to data originators. A Data Trajectory (DT) is a graph that encodes knowledge of how, by whom, and in which context data has been re-used, possibly after several generations. We provide a theoretical model of DTs that is grounded in the W3C PROV data model for provenance, and we show how DTs can be used to automatically propagate a fraction of the credit associated with transitively derived datasets, back to original data contributors. We also show this model of transitive credit in action by means of a Data Reuse Simulator. In the longer term, our ultimate hope is that credit models based on direct measures of data reuse will provide further incentives to data publication. We conclude by outlining a research agenda to address the hard questions of creating, collecting, and using DTs systematically across a large number of data reuse instances in the wild.

2018 ◽  
Vol 6 (2) ◽  
pp. 144-156 ◽  
Author(s):  
Katherine Cook ◽  
Canan Çakirlar ◽  
Timothy Goddard ◽  
Robert Carl DeMuth ◽  
Joshua Wells

ABSTRACTDigital literacy has been cited as one of the primary challenges to ensuring data reuse and increasing the value placed on open science. Incorporating published data into classrooms and training is at the core of tackling this issue. This article presents case studies in teaching with different published data platforms, in three different countries (the Netherlands, Canada, and the United States), to students at different levels and with differing skill levels. In outlining their approaches, successes, and failures in teaching with open data, it is argued that collaboration with data publishers is critical to improving data reuse and education. Moreover, increased opportunities for digital skills training and scaffolding across program curriculum are necessary for managing the learning curve and teaching students the values of open science.


2017 ◽  
Author(s):  
Federica Rosetta

Watch the VIDEO here.Within the Open Science discussions, the current call for “reproducibility” comes from the raising awareness that results as presented in research papers are not as easily reproducible as expected, or even contradicted those original results in some reproduction efforts. In this context, transparency and openness are seen as key components to facilitate good scientific practices, as well as scientific discovery. As a result, many funding agencies now require the deposit of research data sets, institutions improve the training on the application of statistical methods, and journals begin to mandate a high level of detail on the methods and materials used. How can researchers be supported and encouraged to provide that level of transparency? An important component is the underlying research data, which is currently often only partly available within the article. At Elsevier we have therefore been working on journal data guidelines which clearly explain to researchers when and how they are expected to make their research data available. Simultaneously, we have also developed the corresponding infrastructure to make it as easy as possible for researchers to share their data in a way that is appropriate in their field. To ensure researchers get credit for the work they do on managing and sharing data, all our journals support data citation in line with the FORCE11 data citation principles – a key step in the direction of ensuring that we address the lack of credits and incentives which emerged from the Open Data analysis (Open Data - the Researcher Perspective https://www.elsevier.com/about/open-science/research-data/open-data-report ) recently carried out by Elsevier together with CWTS. Finally, the presentation will also touch upon a number of initiatives to ensure the reproducibility of software, protocols and methods. With STAR methods, for instance, methods are submitted in a Structured, Transparent, Accessible Reporting format; this approach promotes rigor and robustness, and makes reporting easier for the author and replication easier for the reader.


2001 ◽  
Vol 23 (2) ◽  
pp. 167 ◽  
Author(s):  
SD Whiting

THERE are few published studies of dive times of dugongs (Dugong dugon). Direct observations are problematic because D. dugon are shy and difficult to observe in the wild from boats without creating observer effects. Time Depth Recorders (TDR?s) can record dive and surface times during dive behaviour, but there are no published data as yet for D. dugon using this technology. Although studies on dive times using Time Depth Recorders (TDR?s) result in larger data sets, their results are difficult to relate to particular behaviours such as foraging. This paper provides submergence and surface interval times for D. dugon obtained by direct observations in Darwin Harbour. Direct observations, although time consuming, can produce important information related to the ecology of D. dugon.


2013 ◽  
Author(s):  
Heather Piwowar ◽  
Todd J Vision

BACKGROUND: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation boost”. Furthermore, little is known about patterns in data reuse over time and across datasets. METHOD AND RESULTS: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation boost varied with date of dataset deposition: a citation boost was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. CONCLUSION: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation boost are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.


2015 ◽  
Vol 31 (4) ◽  
pp. 737-761 ◽  
Author(s):  
Matthias Templ

Abstract Scientific- or public-use files are typically produced by applying anonymisation methods to the original data. Anonymised data should have both low disclosure risk and high data utility. Data utility is often measured by comparing well-known estimates from original data and anonymised data, such as comparing their means, covariances or eigenvalues. However, it is a fact that not every estimate can be preserved. Therefore the aim is to preserve the most important estimates, that is, instead of calculating generally defined utility measures, evaluation on context/data dependent indicators is proposed. In this article we define such indicators and utility measures for the Structure of Earnings Survey (SES) microdata and proper guidelines for selecting indicators and models, and for evaluating the resulting estimates are given. For this purpose, hundreds of publications in journals and from national statistical agencies were reviewed to gain insight into how the SES data are used for research and which indicators are relevant for policy making. Besides the mathematical description of the indicators and a brief description of the most common models applied to SES, four different anonymisation procedures are applied and the resulting indicators and models are compared to those obtained from the unmodified data. The disclosure risk is reported and the data utility is evaluated for each of the anonymised data sets based on the most important indicators and a model which is often used in practice.


Author(s):  
Heather Piwowar ◽  
Todd J Vision

BACKGROUND: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation boost”. Furthermore, little is known about patterns in data reuse over time and across datasets. METHOD AND RESULTS: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation boost varied with date of dataset deposition: a citation boost was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. CONCLUSION: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation boost are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.


2021 ◽  
pp. 179-210
Author(s):  
Frank Miedema

AbstractMany initiatives addressing different types of problems of the practice of science and research have been described or cited in this book. Some were one-issue local actions, some took a broader approach at the national and some at EU level. Some stayed on, others faded after a few years. Many of the issues addressed by these movements and initiatives were part of the system of science and appeared to be systemically interdependent. This is how they converged and precipitated in the movement of Open Science, somewhere at the beginning of the second decade of this century. I discuss the major move that was made since 2015 in the EU to embrace the Open Science practice as the way science and research are being done in Europe. This elicited tensions at first foremost relate to uncertainty regarding scholarly publishing, of how and where we publish open access. But also, with respect to what immediate sharing of data and results in daily practice of researchers means, how we value and give credit for papers and published data sets. It thus poses the question of how, if at all, we must compare incomparable academic work, how we get credit and build reputations in this new open practice of science. It is indeed believed that Open Science with its practice of responsible science will be a major contribution to address the dominant problems in science that we have analysed thus far, or at least will help to mitigate them. Open Science holds a promise to take science to the next phase as outlined in the previous chapters. That is not a romantic naive longing for the science that once was. It will be a truly novel way, but realistic way of doing scientific inquiry according to the pragmatic narrative pointed out.The Transition to Open Science as can be anticipated from the analyses above will not be trivial. The recent discussions have already shown that the transition to Open Science, even between EU member states, is a very different thing because of specific national, societal and academic contexts.I will conclude this chapter reporting some of my first-hand experiences, in Brussels and during visits to several EU member states in the course of a Mutual Learning Exercise, but also encounters in North America, South East Asia and South Africa where we in the past years have discussed Open Science. Although we know science and scholarship have many forms and flavours and that wherever you go, there is not one scientific community. For me discussing the Transition to Open Science in the past four years was really a Learning Exercise, an amazing, mostly encouraging, but many times quite shocking, even saddening adventure.


1994 ◽  
Vol 24 (3) ◽  
pp. 707-718 ◽  
Author(s):  
Paul Bebbington ◽  
Liz Kuipers

SynopsisWe analysed aggregate data from 25 studies linking Expressed Emotion (EE) and schizophrenia. We had access to original data sets from 17 studies, and used published data from the remainder. This provided us with 1346 cases from around the world. The association of EE with relapse was overwhelming, and was maintained whatever the geographical location. The predictive capacity of EE was virtually identical in men and women. While high contact with a high EE relative increased the risk of relapse, the opposite was true in low EE households. Medication and EE were independently related to relapse, and thus EE status has no bearing on the decision to prescribe. Our findings were confirmed using log–linear analysis.


2018 ◽  
Author(s):  
Peter Branney ◽  
Kate Reid ◽  
Nollaig Frost ◽  
Susan Coan ◽  
Amy Mathieson ◽  
...  

To date, open science, and particularly open data, in Psychology, has focused on quantitative research. This paper aims to explore ethical and practical issues encountered by UK-based psychologists utilising open qualitative datasets. Semi-structured telephone interviews with eight qualitative psychologists were explored using a framework analysis. From the findings, we offer up a context-consent meta-framework as a resource to help in the design of studies sharing their data and/or studies using open data. We recommend ‘secondary’ studies conduct archaeologies of context and consent to examine if the data available is suitable for their research questions. In conclusion, this research is the first we know of in the study of ‘doing’ (or not doing) open science, which could be repeated to develop a longitudinal picture or complemented with additional approaches, such as observational studies of how context and consent are negotiated in pre-registered studies and open data.


2016 ◽  
Author(s):  
Bradly Alicea

ABSTRACTParticipation in open data initiatives require two semi-independent actions: the sharing of data produced by a researcher or group, and a consumer of shared data. Consumers of shared data range from people interested in validating the results of a given study to people who actively transform the available data. These data transformers are of particular interest because they add value to the shared data set through the discovery of new relationships and information which can in turn be shared with the same community. The complex and often reciprocal relationship between producers and consumers can be better understood using game theory, namely by using three variations of the Prisoners’ Dilemma (PD): a classical PD payoff matrix, a simulation of the PD n-person iterative model that tests three hypotheses, and an Ideological Game Theory (IGT) model used to formulate how sharing strategies might be implemented in a specific institutional culture. To motivate these analyses, data sharing is presented as a trade-off between economic and social payoffs. This is demonstrated as a series of payoff matrices describing situations ranging from ubiquitous acceptance of Open Science principles to a community standard of complete non-cooperation. Further context is provided through the IGT model, which allows from the modeling of cultural biases and beliefs that influence open science decision-making. A vision for building a CC-BY economy are then discussed using an approach called econosemantics, which complements the treatment of data sharing as a complex system of transactions enabled by social capital.


Sign in / Sign up

Export Citation Format

Share Document