Roundtable: Challenges in repeatable experiments and reproducible research in data science

2021 ◽  
Vol 13 (2) ◽  
pp. 100-108
Author(s):  
K.V. Vorontsov ◽  
V.I. Iglovikov ◽  
V.V. Strijov ◽  
A.E. Ustuzhanin ◽  
A.S. Khritankov
Author(s):  
Yi Liu ◽  
Benjamin Elsworth ◽  
Pau Erola ◽  
Valeriia Haberland ◽  
Gibran Hemani ◽  
...  

Abstract Motivation The wealth of data resources on human phenotypes, risk factors, molecular traits and therapeutic interventions presents new opportunities for population health sciences. These opportunities are paralleled by a growing need for data integration, curation and mining to increase research efficiency, reduce mis-inference and ensure reproducible research. Results We developed EpiGraphDB (https://epigraphdb.org/), a graph database containing an array of different biomedical and epidemiological relationships and an analytical platform to support their use in human population health data science. In addition, we present three case studies that illustrate the value of this platform. The first uses EpiGraphDB to evaluate potential pleiotropic relationships, addressing mis-inference in systematic causal analysis. In the second case study, we illustrate how protein–protein interaction data offer opportunities to identify new drug targets. The final case study integrates causal inference using Mendelian randomization with relationships mined from the biomedical literature to ‘triangulate’ evidence from different sources. Availability and implementation The EpiGraphDB platform is openly available at https://epigraphdb.org. Code for replicating case study results is available at https://github.com/MRCIEU/epigraphdb as Jupyter notebooks using the API, and https://mrcieu.github.io/epigraphdb-r using the R package. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Vicky Steeves

This is a self-archived version of an article published in Collaborative Librarianship. The content of this article is not different from what is in the journal (found here: http://digitalcommons.du.edu/collaborativelibrarianship/vol9/iss2/4)Recommended CitationSteeves, Vicky (2017) "Reproducibility Librarianship," Collaborative Librarianship: Vol. 9 : Iss. 2 , Article 4. Available at: https://digitalcommons.du.edu/collaborativelibrarianship/vol9/iss2/4Over the past few years, research reproducibility has been increasingly highlighted as a multifaceted challenge across many disciplines. There are socio-cultural obstacles as well as a constantly changing technical landscape that make replicating and reproducing research extremely difficult. Researchers face challenges in reproducing research across different operating systems and different versions of software, to name just a few of the many technical barriers. The prioritization of citation counts and journal prestige has undermined incentives to make research reproducible.While libraries have been building support around research data management and digital scholarship, reproducibility is an emerging area that has yet to be systematically addressed. To respond to this, New York University (NYU) created the position of Librarian for Research Data Management and Reproducibility (RDM & R), a dual appointment between the Center for Data Science (CDS) and the Division of Libraries. This report will outline the role of the RDM & R librarian, paying close attention to the collaboration between the CDS and Libraries to bring reproducible research practices into the norm.


Author(s):  
Christian González-Martel ◽  
José M. Cazorla-Artiles ◽  
Carlos J. Pérez-González

The increasing availability of open data resources provides opportunities for research and data science. It is necessary to develope tools that take advantage of the full potential of new information resources. In this work we developed the package for R istacr that provides a collection of eurostat functions to be able to consult and discard the data that Eurostat, including functions to retrieve, download and manipulate the data set available through the ISTAC BASE API of the Canary Institute of Statistics (ISTAC). In addition, A Shiny app was designed for a responsive visulization of the data. This develope is part of the growing demand for open data and ecosystems dedicated to reproducible research in computational social science and digital humanities. With this interest, this package has been included within rOpenSpain, a project that aims to promote transparent research methods mainly through the use of free software and open data in Spain.


2019 ◽  
Author(s):  
Louise J. Slater ◽  
Guillaume Thirel ◽  
Shaun Harrigan ◽  
Olivier Delaigue ◽  
Alexander Hurley ◽  
...  

Abstract. The open-source programming language R has gained a central place in the hydrological sciences over the last decade, driven by the availability of diverse hydro-meteorological data archives and the development of open-source computational tools. The growth of R's usage in hydrology is reflected in the number of newly published hydrological packages, the strengthening of online user communities, and the popularity of training courses and events. In this paper, we explore the benefits and advantages of R's usage in hydrology, such as: the democratization of data science and numerical literacy, the enhancement of reproducible research and open science, the access to statistical tools, the ease of connecting R to and from other languages, and the support provided by a growing community. This paper provides an overview of important packages at every step of the hydrological workflow, from the retrieval of hydro-meteorological data, to spatial analysis and cartography, hydrological modelling, statistics, and the design of static and dynamic visualizations, presentations and documents. We discuss some of the challenges that arise when using R in hydrology and useful tools to overcome them, including the use of hydrological libraries, documentation and vignettes (long-form guides that illustrate how to use packages); the role of Integrated Development Environments (IDEs); and the challenges of Big Data and parallel computing in hydrology. Last, this paper provides a roadmap for R's future within hydrology, with R packages as a driver of progress in the hydrological sciences, Application Programming Interfaces (APIs) providing new avenues for data acquisition and provision, enhanced teaching of hydrology in R, and the continued growth of the community via short courses and events.


2019 ◽  
Vol 23 (7) ◽  
pp. 2939-2963 ◽  
Author(s):  
Louise J. Slater ◽  
Guillaume Thirel ◽  
Shaun Harrigan ◽  
Olivier Delaigue ◽  
Alexander Hurley ◽  
...  

Abstract. The open-source programming language R has gained a central place in the hydrological sciences over the last decade, driven by the availability of diverse hydro-meteorological data archives and the development of open-source computational tools. The growth of R's usage in hydrology is reflected in the number of newly published hydrological packages, the strengthening of online user communities, and the popularity of training courses and events. In this paper, we explore the benefits and advantages of R's usage in hydrology, such as the democratization of data science and numerical literacy, the enhancement of reproducible research and open science, the access to statistical tools, the ease of connecting R to and from other languages, and the support provided by a growing community. This paper provides an overview of a typical hydrological workflow based on reproducible principles and packages for retrieval of hydro-meteorological data, spatial analysis, hydrological modelling, statistics, and the design of static and dynamic visualizations and documents. We discuss some of the challenges that arise when using R in hydrology and useful tools to overcome them, including the use of hydrological libraries, documentation, and vignettes (long-form guides that illustrate how to use packages); the role of integrated development environments (IDEs); and the challenges of big data and parallel computing in hydrology. Lastly, this paper provides a roadmap for R's future within hydrology, with R packages as a driver of progress in the hydrological sciences, application programming interfaces (APIs) providing new avenues for data acquisition and provision, enhanced teaching of hydrology in R, and the continued growth of the community via short courses and events.


2019 ◽  
Vol 1 (4) ◽  
pp. 381-392 ◽  
Author(s):  
Bei Yu ◽  
Xiao Hu

Reproducibility is a cornerstone of scientific research. Data science is not an exception. In recent years scientists were concerned about a large number of irreproducible studies. Such reproducibility crisis in science could severely undermine public trust in science and science-based public policy. Recent efforts to promote reproducible research mainly focused on matured scientists and much less on student training. In this study, we conducted action research on students in data science to evaluate to what extent students are ready for communicating reproducible data analysis. The results show that although two-thirds of the students claimed they were able to reproduce results in peer reports, only one-third of reports provided all necessary information for replication. The actual replication results also include conflicting claims; some lacked comparisons of original and replication results, indicating that some students did not share a consistent understanding of what reproducibility means and how to report replication results. The findings suggest that more training is needed to help data science students communicating reproducible data analysis.


2020 ◽  
Author(s):  
Yi Liu ◽  
Benjamin Elsworth ◽  
Pau Erola ◽  
Valeriia Haberland ◽  
Gibran Hemani ◽  
...  

AbstractMotivationThe wealth of data resources on human phenotypes, risk factors, molecular traits and therapeutic interventions presents new opportunities for population health sciences. These opportunities are paralleled by a growing need for data integration, curation and mining to increase research efficiency, reduce mis-inference and ensure reproducible research.ResultsWe developed EpiGraphDB (https://epigraphdb.org/), a graph database containing an array of different biomedical and epidemiological relationships and an analytical platform to support their use in human population health data science. In addition, we present three case studies that illustrate the value of this platform. The first uses EpiGraphDB to evaluate potential pleiotropic relationships, addressing mis-inference in systematic causal analysis. In the second case study we illustrate how protein-protein interaction data offer opportunities to identify new drug targets. The final case study integrates causal inference using Mendelian randomization with relationships mined from the biomedical literature to “triangulate” evidence from different sources.AvailabilityThe EpiGraphDB platform is openly available at https://epigraphdb.org. Code for replicating case study results is available at https://github.com/MRCIEU/epigraphdb as Jupyter notebooks using the API, and https://mrcieu.github.io/epigraphdb-r using the R [email protected], [email protected], [email protected]


Author(s):  
Charles Bouveyron ◽  
Gilles Celeux ◽  
T. Brendan Murphy ◽  
Adrian E. Raftery

Sign in / Sign up

Export Citation Format

Share Document