Reproducible Analyses in Education Research

2021 ◽  
Vol 45 (1) ◽  
pp. 195-222
Author(s):  
Brandon LeBeau ◽  
Scott Ellison ◽  
Ariel M. Aloe

A reproducible analysis is one in which an independent entity, using the same data and the same statistical code, would obtain the exact same result as the previous analyst. Reproducible analyses utilize script-based analyses and open data to aid in the reproduction of the analysis. A reproducible analysis does not ensure the same results are obtained if another sample of data is obtained, often referred to as replicability. Reproduction and replication of studies are discussed as well as the overwhelming benefits of creating a reproducible analysis workflow. A tool is proposed to aid in the evaluation of studies to describe which element in a study has a strong reproducible workflow and areas that could be improved. This tool is meant to serve as a discussion tool, not to rank studies or devalue studies that are unable to share data or statistical code. Finally, discussion surrounding reproducibility for qualitative studies are discussed along with unique challenges for adopting a reproducible analysis framework.

2020 ◽  
Author(s):  
Brandon C LeBeau ◽  
Scott Ellison ◽  
Ariel M Aloe

A reproducible analysis is one in which an independent entity, using the same data and the same statistical code, would obtain the exact same result as the previous analyst. Reproducible analyses utilize script based analyses and open data to aid in the reproduction of the analysis. A reproducible analysis does not ensure the same results are obtained if another sample of data are obtained, often referred to replicability. Reproduction and replication of studies are discussed as well as the overwhelming benefits of creating a reproducible analysis workflow. A tool is proposed to aid in the evaluation of studies to describe which elements a study has a strong reproducible workflow and areas that could be improved. This tool is meant to serve as a discussion tool, not to rank studies or devalue studies that are unable to share data or statistical code. Finally, discussion surrounding reproducibility for qualitative studies are discussed along with unique challenges for adopting a reproducible analysis framework.


Electronics ◽  
2021 ◽  
Vol 10 (5) ◽  
pp. 621
Author(s):  
Giuseppe Psaila ◽  
Paolo Fosci

Internet technology and mobile technology have enabled producing and diffusing massive data sets concerning almost every aspect of day-by-day life. Remarkable examples are social media and apps for volunteered information production, as well as Open Data portals on which public administrations publish authoritative and (often) geo-referenced data sets. In this context, JSON has become the most popular standard for representing and exchanging possibly geo-referenced data sets over the Internet.Analysts, wishing to manage, integrate and cross-analyze such data sets, need a framework that allows them to access possibly remote storage systems for JSON data sets, to retrieve and query data sets by means of a unique query language (independent of the specific storage technology), by exploiting possibly-remote computational resources (such as cloud servers), comfortably working on their PC in their office, more or less unaware of real location of resources. In this paper, we present the current state of the J-CO Framework, a platform-independent and analyst-oriented software framework to manipulate and cross-analyze possibly geo-tagged JSON data sets. The paper presents the general approach behind the J-CO Framework, by illustrating the query language by means of a simple, yet non-trivial, example of geographical cross-analysis. The paper also presents the novel features introduced by the re-engineered version of the execution engine and the most recent components, i.e., the storage service for large single JSON documents and the user interface that allows analysts to comfortably share data sets and computational resources with other analysts possibly working in different places of the Earth globe. Finally, the paper reports the results of an experimental campaign, which show that the execution engine actually performs in a more than satisfactory way, proving that our framework can be actually used by analysts to process JSON data sets.


Author(s):  
M. A. Brovelli ◽  
D. Oxoli ◽  
M. A. Zurbarán

During the past years Web 2.0 technologies have caused the emergence of platforms where users can share data related to their activities which in some cases are then publicly released with open licenses. Popular categories for this include community platforms where users can upload GPS tracks collected during slow travel activities (e.g. hiking, biking and horse riding) and platforms where users share their geolocated photos. However, due to the high heterogeneity of the information available on the Web, the sole use of these user-generated contents makes it an ambitious challenge to understand slow mobility flows as well as to detect the most visited locations in a region. Exploiting the available data on community sharing websites allows to collect near real-time open data streams and enables rigorous spatial-temporal analysis. This work presents an approach for collecting, unifying and analysing pointwise geolocated open data available from different sources with the aim of identifying the main locations and destinations of slow mobility activities. For this purpose, we collected pointwise open data from the Wikiloc platform, Twitter, Flickr and Foursquare. The analysis was confined to the data uploaded in Lombardy Region (Northern Italy) – corresponding to millions of pointwise data. Collected data was processed through the use of Free and Open Source Software (FOSS) in order to organize them into a suitable database. This allowed to run statistical analyses on data distribution in both time and space by enabling the detection of users’ slow mobility preferences as well as places of interest at a regional scale.


2019 ◽  
Author(s):  
Mithu Lucraft ◽  
Samuel Winthrop

Good data sharing can make research more productive, more likely to be cited, and unlock innovation for the good of society. In 2019, a Springer Nature white paper (Lucraft et al. 2019), based on surveys with more than 11,000 researchers internationally, set out key challenges in data management and data sharing. We found: Data sharing is increasing: more than 64% of researchers in a 2018 survey said they made their data openly available. The majority of researchers see data sharing as important: across three surveys, when asked about the importance of making data discoverable, researchers gave an average rating of 7.5 out of 10. Data sharing and planning is currently suboptimal: The majority of the research community are not yet managing or sharing data in ways that make it findable, accessible or reusable. Increasingly, funders and other expert groups (European Commission 2018) are emphasising the need for data that is FAIR (findable, accessible, interoperable and re-useable). To shift the needle on data sharing and to reap the benefits from more widely-available open data, collaborative action is required. In this presentation, we will discuss the five measures we believe are needed to make data sharing the norm: Clear policy: from funders, institutions, journals/publishers, and research communities themselves. Better credit: to make data sharing worth a researcher’s time. Explicit funding: for data management and data sharing, as well as data publishing. Practical help: for organising data, finding appropriate repositories, and provision of faster, easier routes to share data. Training and education: to answer common questions from researchers on data sharing and to help build skills and knowledge. We will draw on evidence and case studies from across Europe and beyond, as well as further feedback from our market research.


2021 ◽  
Author(s):  
Douglas F Porter ◽  
Raghav M Garg ◽  
Robin M Meyers ◽  
Weili Miao ◽  
Luca Ducoli ◽  
...  

The easyCLIP protocol describes a method for both normal CLIP library construction and the absolute quantification of RNA cross-linking rates, data which could be usefully combined to analyze RNA-protein interactions. Using these cross-linking metrics, significant interactions could be defined relative to a set of random non-RBPs. The original easyCLIP protocol did not use index reads, required custom sequencing primers, and did not have an easily reproducible analysis workflow. This short paper attempts to amend these deficiencies. It also includes some additional technical experiments and investigates the usage of alternative adapters. The results here are intended to allow more options to easily perform and analyze easyCLIP.


2017 ◽  
Author(s):  
Michele B. Nuijten ◽  
Jeroen Borghuis ◽  
Coosje Lisabet Sterre Veldkamp ◽  
Linda Dominguez Alvarez ◽  
Marcel A. L. M. van Assen ◽  
...  

In this paper, we present three retrospective observational studies that investigate the relation between data sharing and statistical reporting inconsistencies. Previous research found that reluctance to share data was related to a higher prevalence of statistical errors, often in the direction of statistical significance (Wicherts, Bakker, & Molenaar, 2011). We therefore hypothesized that journal policies about data sharing and data sharing itself would reduce these inconsistencies. In Study 1, we compared the prevalence of reporting inconsistencies in two similar journals on decision making with different data sharing policies. In Study 2, we compared reporting inconsistencies in psychology articles published in PLOS journals (with a data sharing policy) and Frontiers in Psychology (without a stipulated data sharing policy). In Study 3, we looked at papers published in the journal Psychological Science to check whether papers with or without an Open Practice Badge differed in the prevalence of reporting errors. Overall, we found no relationship between data sharing and reporting inconsistencies. We did find that journal policies on data sharing are extremely effective in promoting data sharing. We argue that open data is essential in improving the quality of psychological science, and we discuss ways to detect and reduce reporting inconsistencies in the literature.


2021 ◽  
Author(s):  
Joy Hsu ◽  
Ramya Ravichandran ◽  
Edwin Zhang ◽  
Christine Keung

2020 ◽  
Vol 2 (4) ◽  
Author(s):  
Michelle Badri ◽  
Zachary D Kurtz ◽  
Richard Bonneau ◽  
Christian L Müller

Abstract Estimation of statistical associations in microbial genomic survey count data is fundamental to microbiome research. Experimental limitations, including count compositionality, low sample sizes and technical variability, obstruct standard application of association measures and require data normalization prior to statistical estimation. Here, we investigate the interplay between data normalization, microbial association estimation and available sample size by leveraging the large-scale American Gut Project (AGP) survey data. We analyze the statistical properties of two prominent linear association estimators, correlation and proportionality, under different sample scenarios and data normalization schemes, including RNA-seq analysis workflows and log-ratio transformations. We show that shrinkage estimation, a standard statistical regularization technique, can universally improve the quality of taxon–taxon association estimates for microbiome data. We find that large-scale association patterns in the AGP data can be grouped into five normalization-dependent classes. Using microbial association network construction and clustering as downstream data analysis examples, we show that variance-stabilizing and log-ratio approaches enable the most taxonomically and structurally coherent estimates. Taken together, the findings from our reproducible analysis workflow have important implications for microbiome studies in multiple stages of analysis, particularly when only small sample sizes are available.


2020 ◽  
Vol 245 ◽  
pp. 06041
Author(s):  
Diego Rodríguez ◽  
Rokas Mačiulaitis ◽  
Jan Okraska ◽  
Tibor Šimko

We introduce the feasibility of running hybrid analysis pipelines in the REANA reproducible analysis platform. The REANA platform allows researchers to specify declarative computational workflow steps describing the analysis process and to execute analysis workload on remote containerised compute clouds. We have designed an abstract job controller component permitting to execute different parts of the analysis workflow on different compute backends, such as HTCondor, Kubernetes and SLURM. We have prototyped the designed solution including the job execution, job monitoring, and input/output file staging mechanism between the various compute backends. We have tested the prototype using several particle physics model analyses. The present work introduces support for hybrid analysis workflows in the REANA reproducible analysis platform and paves the way towards studying underlying performance advantages and challenges associated with hybrid analysis patterns in complex particle physics data analyses.


2020 ◽  
Vol 245 ◽  
pp. 08023
Author(s):  
Farid Ould-Saada

The ATLAS Collaboration at the Large Hadron Collider is releasing a new set of recorded and simulated data samples at a centre-of-mass energy of 13 TeV collected in pp collisions at the LHC. This new dataset was designed after an in-depth review of the usage of the previous release of samples at 8 TeV. That review showed that capacity-building is one of the most important and abundant uses of public ATLAS samples. To fulfil the requirements of the community and at the same time attract new users and use cases, we developed real analysis software based on ROOT in two of the most popular programming languages: C++ and Python. These so-called analysis frameworks are complex enough to reproduce with reasonable accuracy the results -figures and final yields- of published ATLAS Collaboration physics papers, but still light enough to be run on commodity hardware. With the computers that university students and regular classrooms typically have, students can explore LHC data with similar techniques to those used by current ATLAS analysers. We present the development path and the final result of these analysis frameworks, their products and how they are distributed to final users inside and outside the ATLAS community.


Sign in / Sign up

Export Citation Format

Share Document