scholarly journals Data reuse and scholarly reward: understanding practice and building infrastructure

Author(s):  
Todd J Vision ◽  
Heather A Piwowar

Recently introduced funding agency policies seek to increase the availability of data from individual published studies for reuse by the research community at large. The success of such policies can be measured both by data input (“is useful data being made available?”) and research output (“are these data being reused by others?”). A key determinant of data input is the extent to which data producers receive adequate professional credit for making data available. One of us (HP) previously reported a large citation difference for published microarray studies with and without data available in a public repository. Analysis of a much larger sample, with more covariates, provides a more reliable estimate of this citation boost, as well as additional insights into patterns of reuse and how the availability of data affects publication impact. A more recent study tracking the reuse of 100 datasets from each of ten different primary data repositories reveals large variation in patterns of reuse and citation. Our findings (a) illuminate ways in which the reuses of archived data tend to differ in purpose from that of the original producers; (b) inform data archiving policy, such as how long data embargoes need to be in order to protect the proprietary interests of producers; (c) and allow us to answer the vexing question of what the return on investment is for data archiving. In conducting these studies, we have become aware of gaps in data citation practice and infrastructure that limit the extent to which researchers receive credit for their contributions. We describe early efforts to bake good data citation and usage tracking into cyberinfrastructure as part of DataONE, the Data Observation Network for Earth. Finally, we introduce total-impact, a tool that allows researchers to track the diverse impacts of all their research outputs, including data, and empowers them to be recognized for their scholarly work on their own terms. Software and Data Availability: Research software and data: https://github.com/hpiwowar (CCZero for data where possible, MIT for code); Dryad: new BSD license: http://code.google.com/p/dryad; DataONE: Apache license: http://www.dataone.org/developer-resources; total-impact: MIT license: https://github.com/total-impact. This is an abstract that was submitted to the iEvoBio 2012 conference, held on July 10-11, 2012, in Ottawa, Canada.

2013 ◽  
Author(s):  
Todd J Vision ◽  
Heather A Piwowar

Recently introduced funding agency policies seek to increase the availability of data from individual published studies for reuse by the research community at large. The success of such policies can be measured both by data input (“is useful data being made available?”) and research output (“are these data being reused by others?”). A key determinant of data input is the extent to which data producers receive adequate professional credit for making data available. One of us (HP) previously reported a large citation difference for published microarray studies with and without data available in a public repository. Analysis of a much larger sample, with more covariates, provides a more reliable estimate of this citation boost, as well as additional insights into patterns of reuse and how the availability of data affects publication impact. A more recent study tracking the reuse of 100 datasets from each of ten different primary data repositories reveals large variation in patterns of reuse and citation. Our findings (a) illuminate ways in which the reuses of archived data tend to differ in purpose from that of the original producers; (b) inform data archiving policy, such as how long data embargoes need to be in order to protect the proprietary interests of producers; (c) and allow us to answer the vexing question of what the return on investment is for data archiving. In conducting these studies, we have become aware of gaps in data citation practice and infrastructure that limit the extent to which researchers receive credit for their contributions. We describe early efforts to bake good data citation and usage tracking into cyberinfrastructure as part of DataONE, the Data Observation Network for Earth. Finally, we introduce total-impact, a tool that allows researchers to track the diverse impacts of all their research outputs, including data, and empowers them to be recognized for their scholarly work on their own terms. Software and Data Availability: Research software and data: https://github.com/hpiwowar (CCZero for data where possible, MIT for code); Dryad: new BSD license: http://code.google.com/p/dryad; DataONE: Apache license: http://www.dataone.org/developer-resources; total-impact: MIT license: https://github.com/total-impact. This is an abstract that was submitted to the iEvoBio 2012 conference, held on July 10-11, 2012, in Ottawa, Canada.


10.2196/17687 ◽  
2020 ◽  
Vol 4 (8) ◽  
pp. e17687
Author(s):  
Kristina K Gagalova ◽  
M Angelica Leon Elizalde ◽  
Elodie Portales-Casamar ◽  
Matthias Görges

Background Integrated data repositories (IDRs), also referred to as clinical data warehouses, are platforms used for the integration of several data sources through specialized analytical tools that facilitate data processing and analysis. IDRs offer several opportunities for clinical data reuse, and the number of institutions implementing an IDR has grown steadily in the past decade. Objective The architectural choices of major IDRs are highly diverse and determining their differences can be overwhelming. This review aims to explore the underlying models and common features of IDRs, provide a high-level overview for those entering the field, and propose a set of guiding principles for small- to medium-sized health institutions embarking on IDR implementation. Methods We reviewed manuscripts published in peer-reviewed scientific literature between 2008 and 2020, and selected those that specifically describe IDR architectures. Of 255 shortlisted articles, we found 34 articles describing 29 different architectures. The different IDRs were analyzed for common features and classified according to their data processing and integration solution choices. Results Despite common trends in the selection of standard terminologies and data models, the IDRs examined showed heterogeneity in the underlying architecture design. We identified 4 common architecture models that use different approaches for data processing and integration. These different approaches were driven by a variety of features such as data sources, whether the IDR was for a single institution or a collaborative project, the intended primary data user, and purpose (research-only or including clinical or operational decision making). Conclusions IDR implementations are diverse and complex undertakings, which benefit from being preceded by an evaluation of requirements and definition of scope in the early planning stage. Factors such as data source diversity and intended users of the IDR influence data flow and synchronization, both of which are crucial factors in IDR architecture planning.


2021 ◽  
Author(s):  
Iain Hrynaszkiewicz ◽  
James Harney ◽  
Lauren Cadwallader

PLOS has long supported Open Science. One of the ways in which we do so is via our stringent data availability policy established in 2014. Despite this policy, and more data sharing policies being introduced by other organizations, best practices for data sharing are adopted by a minority of researchers in their publications. Problems with effective research data sharing persist and these problems have been quantified by previous research as a lack of time, resources, incentives, and/or skills to share data. In this study we built on this research by investigating the importance of tasks associated with data sharing, and researchers’ satisfaction with their ability to complete these tasks. By investigating these factors we aimed to better understand opportunities for new or improved solutions for sharing data. In May-June 2020 we surveyed researchers from Europe and North America to rate tasks associated with data sharing on (i) their importance and (ii) their satisfaction with their ability to complete them. We received 728 completed and 667 partial responses. We calculated mean importance and satisfaction scores to highlight potential opportunities for new solutions to and compare different cohorts.Tasks relating to research impact, funder compliance, and credit had the highest importance scores. 52% of respondents reuse research data but the average satisfaction score for obtaining data for reuse was relatively low. Tasks associated with sharing data were rated somewhat important and respondents were reasonably well satisfied in their ability to accomplish them. Notably, this included tasks associated with best data sharing practice, such as use of data repositories. However, the most common method for sharing data was in fact via supplemental files with articles, which is not considered to be best practice.We presume that researchers are unlikely to seek new solutions to a problem or task that they are satisfied in their ability to accomplish, even if many do not attempt this task. This implies there are few opportunities for new solutions or tools to meet these researcher needs. Publishers can likely meet these needs for data sharing by working to seamlessly integrate existing solutions that reduce the effort or behaviour change involved in some tasks, and focusing on advocacy and education around the benefits of sharing data. There may however be opportunities - unmet researcher needs - in relation to better supporting data reuse, which could be met in part by strengthening data sharing policies of journals and publishers, and improving the discoverability of data associated with published articles.


2015 ◽  
Author(s):  
Joan Starr ◽  
Eleni Castro ◽  
Mercè Crosas ◽  
Michel Dumontier ◽  
Robert R. Downs ◽  
...  

Reproducibility and reusability of research results is an important concern in scientific communication and science policy. A foundational element of reproducibility and reusability is the open and persistently available presentation of research data. However, many common approaches for primary data publication in use today do not achieve sufficient long-term robustness, openness, accessibility or uniformity. Nor do they permit comprehensive exploitation by modern Web technologies. This has led to several authoritative studies recommending uniform direct citation of data archived in persistent repositories. Data are to be considered as first-class scholarly objects, and treated similarly in many ways to cited and archived scientific and scholarly literature. Here we briefly review the most current and widely agreed set of principle-based recommendations for scholarly data citation, the Joint Declaration of Data Citation Principles (JDDCP). We then present a framework for operationalizing the JDDCP; and a set of initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data. The main target audience for the common implementation guidelines in this article consists of publishers, scholarly organizations, and persistent data repositories, including technical staff members in these organizations. But ordinary researchers can also benefit from these recommendations. The guidance provided here is intended to help achieve widespread, uniform human and machine accessibility of deposited data, in support of significantly improved verification, validation, reproducibility and re-use of scholarly/scientific data.


2020 ◽  
Vol 3 (1) ◽  
Author(s):  
Stephanie D. Jurburg ◽  
Maximilian Konzack ◽  
Nico Eisenhauer ◽  
Anna Heintz-Buschart

AbstractAs DNA sequencing has become more popular, the public genetic repositories where sequences are archived have experienced explosive growth. These repositories now hold invaluable collections of sequences, e.g., for microbial ecology, but whether these data are reusable has not been evaluated. We assessed the availability and state of 16S rRNA gene amplicon sequences archived in public genetic repositories (SRA, EBI, and DDJ). We screened 26,927 publications in 17 microbiology journals, identifying 2015 16S rRNA gene sequencing studies. Of these, 7.2% had not made their data public at the time of analysis. Among a subset of 635 studies sequencing the same gene region, 40.3% contained data which was not available or not reusable, and an additional 25.5% contained faults in data formatting or data labeling, creating obstacles for data reuse. Our study reveals gaps in data availability, identifies major contributors to data loss, and offers suggestions for improving data archiving practices.


2020 ◽  
Vol 9 ◽  
pp. 03-16
Author(s):  
Marc Garellek ◽  
Adrian Simpson ◽  
Timo B. Roettger ◽  
Daniel Recasens ◽  
Oliver Niebuhr ◽  
...  

It is not yet standard practice in phonetics to provide access to audio files along with submissions to journals. This is paradoxical in view of the importance of data for phonetic research: from audio signals to the whole range of data acquired in phonetic experiments. The phonetic sciences stand to gain greatly from data availability: what is at stake is no less than reproducibility and cumulative progress. We will argue that a collective turn to Open Science holds great promise for phonetics. First, simple reflections on why access to primary data matters are recapitulated and proposed as a basis for consensus. Next, possible drawbacks of data availability are addressed. Finally, we argue that data curation and archiving are to be recognized as part of the same activity that results in the publication of research papers, rather than attempting to build a parallel system to incentivize data archiving by itself.


2015 ◽  
Author(s):  
Joan Starr ◽  
Eleni Castro ◽  
Mercè Crosas ◽  
Michel Dumontier ◽  
Robert R. Downs ◽  
...  

Reproducibility and reusability of research results is an important concern in scientific communication and science policy. A foundational element of reproducibility and reusability is the open and persistently available presentation of research data. However, many common approaches for primary data publication in use today do not achieve sufficient long-term robustness, openness, accessibility or uniformity. Nor do they permit comprehensive exploitation by modern Web technologies. This has led to several authoritative studies recommending uniform direct citation of data archived in persistent repositories. Data are to be considered as first-class scholarly objects, and treated similarly in many ways to cited and archived scientific and scholarly literature. Here we briefly review the most current and widely agreed set of principle-based recommendations for scholarly data citation, the Joint Declaration of Data Citation Principles (JDDCP). We then present a framework for operationalizing the JDDCP; and a set of initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data. The main target audience for the common implementation guidelines in this article consists of publishers, scholarly organizations, and persistent data repositories, including technical staff members in these organizations. But ordinary researchers can also benefit from these recommendations. The guidance provided here is intended to help achieve widespread, uniform human and machine accessibility of deposited data, in support of significantly improved verification, validation, reproducibility and re-use of scholarly/scientific data.


Author(s):  
José Augusto Salim ◽  
Paula Zermoglio ◽  
Debora Drucker ◽  
Filipi Soares ◽  
Antonio Saraiva ◽  
...  

Human demands on resources such as food and energy are increasing through time while global challenges such as climate change and biodiversity loss are becoming more complex to overcome, as well as more widely acknowledged by societies and governments. Reports from initiatives like the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES) have demanded quick and reliable access to high-quality spatial and temporal data of species occurrences, their interspecific relations and the effects of the environment on biotic interactions. Mapping species interactions is crucial to understanding and conserving ecosystem functioning and all the services it can provide (Tylianakis et al. 2010, Slade et al. 2017). Detailed data has the potential to improve our knowledge about ecological and evolutionary processes guided by interspecific interactions, as well as to assist in planning and decision making for biodiversity conservation and restoration (Menz et al. 2011). Although a great effort has been made to successfully standardize and aggregate species occurrence data, a formal standard to support biotic interaction data sharing and interoperability is still lacking. There are different biological interactions that can be studied, such as predator-prey, host-parasite and pollinator-plant and there is a variety of data practices and data representation procedures that can be used. Plant-pollinator interactions are recognized in many sources from the scientific literature (Abrol 2012, Ollerton 2021) for the importance of ecosystem functioning and sustainable agriculture. Primary data about pollination are becoming increasingly available online and can be accessed from a great number of data repositories. While a vast quantity of data on interactions, and on pollination in particular, is available, data are not integrated among sources, largely because of a lack of appropriate standards. We present a vocabulary of terms for sharing plant-pollinator interactions using one of the existing extensions to the Darwin Core standard (Wieczorek et al. 2012). In particular, the vocabulary is meant to be used for the term measurementType of the Extended Measurement Or Facts extension. The vocabulary was developed by a community of specialists in pollination biology and information science, including members of the TDWG Biological Interaction Data Interest Group, during almost four years of collaborative work. The vocabulary introduces 40 new terms, comprising many aspects of plant-pollinator interactions, and can be used to capture information produced by studies with different approaches and scales. The plant-pollinator interactions vocabulary is mainly a set of terms that can be both understood by people or interpreted by machines. The plant-pollinator vocabulary is composed of a defining a set of terms and descriptive documents explaining how the vocabulary is to be used. The terms in the vocabulary are divided into six categories: Animal, Plants, Flower, Interaction, Reproductive Success and Nectar Dynamics. The categories are not formally part of the vocabulary, they are used only to organize the vocabulary and to facilitate understanding by humans. We expect that the plant-pollinator vocabulary will contribute to data aggregation from a variety of sources worldwide at higher levels than we have experienced, significantly amplify plant-pollinator data availability for global synthesis, and contribute to knowledge in conservation and sustainable use of biodiversity.


2020 ◽  
Author(s):  
Kristina K Gagalova ◽  
M Angelica Leon Elizalde ◽  
Elodie Portales-Casamar ◽  
Matthias Görges

BACKGROUND Integrated data repositories (IDRs), also referred to as clinical data warehouses, are platforms used for the integration of several data sources through specialized analytical tools that facilitate data processing and analysis. IDRs offer several opportunities for clinical data reuse, and the number of institutions implementing an IDR has grown steadily in the past decade. OBJECTIVE The architectural choices of major IDRs are highly diverse and determining their differences can be overwhelming. This review aims to explore the underlying models and common features of IDRs, provide a high-level overview for those entering the field, and propose a set of guiding principles for small- to medium-sized health institutions embarking on IDR implementation. METHODS We reviewed manuscripts published in peer-reviewed scientific literature between 2008 and 2020, and selected those that specifically describe IDR architectures. Of 255 shortlisted articles, we found 34 articles describing 29 different architectures. The different IDRs were analyzed for common features and classified according to their data processing and integration solution choices. RESULTS Despite common trends in the selection of standard terminologies and data models, the IDRs examined showed heterogeneity in the underlying architecture design. We identified 4 common architecture models that use different approaches for data processing and integration. These different approaches were driven by a variety of features such as data sources, whether the IDR was for a single institution or a collaborative project, the intended primary data user, and purpose (research-only or including clinical or operational decision making). CONCLUSIONS IDR implementations are diverse and complex undertakings, which benefit from being preceded by an evaluation of requirements and definition of scope in the early planning stage. Factors such as data source diversity and intended users of the IDR influence data flow and synchronization, both of which are crucial factors in IDR architecture planning.


2015 ◽  
Author(s):  
Joan Starr ◽  
Eleni Castro ◽  
Mercè Crosas ◽  
Michel Dumontier ◽  
Robert R. Downs ◽  
...  

Reproducibility and reusability of research results is an important concern in scientific communication and science policy. A foundational element of reproducibility and reusability is the open and persistently available presentation of research data. However, many common approaches for primary data publication in use today do not achieve sufficient long-term robustness, openness, accessibility or uniformity. Nor do they permit comprehensive exploitation by modern Web technologies. This has led to several authoritative studies recommending uniform direct citation of data archived in persistent repositories. Data are to be considered as first-class scholarly objects, and treated similarly in many ways to cited and archived scientific and scholarly literature. Here we briefly review the most current and widely agreed set of principle-based recommendations for scholarly data citation, the Joint Declaration of Data Citation Principles (JDDCP). We then present a framework for operationalizing the JDDCP; and a set of initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data. The main target audience for the common implementation guidelines in this article consists of publishers, scholarly organizations, and persistent data repositories, including technical staff members in these organizations. But ordinary researchers can also benefit from these recommendations. The guidance provided here is intended to help achieve widespread, uniform human and machine accessibility of deposited data, in support of significantly improved verification, validation, reproducibility and re-use of scholarly/scientific data.


Sign in / Sign up

Export Citation Format

Share Document