scholarly journals Packaging data analytical work reproducibly using R (and friends)

Author(s):  
Ben Marwick ◽  
Carl Boettiger ◽  
Lincoln Mullen

Computers are a central tool in the research process, enabling complex and large scale data analysis. As computer-based research has increased in complexity, so have the challenges of ensuring that this research is reproducible. To address this challenge, we review the concept of the research compendium as a solution for providing a standard and easily recognisable way for organising the digital materials of a research project to enable other researchers to inspect, reproduce, and extend the research. We investigate how the structure and tooling of software packages of the R programming language are being used to produce research compendia in a variety of disciplines. We also describe how software engineering tools and services are being used by researchers to streamline working with research compendia. Using real-world examples, we show how researchers can improve the reproducibility of their work using research compendia based on R packages and related tools.

Author(s):  
Ben Marwick ◽  
Carl Boettiger ◽  
Lincoln Mullen

Computers are a central tool in the research process, enabling complex and large scale data analysis. As computer-based research has increased in complexity, so have the challenges of ensuring that this research is reproducible. To address this challenge, we review the concept of the research compendium as a solution for providing a standard and easily recognisable way for organising the digital materials of a research project to enable other researchers to inspect, reproduce, and extend the research. We investigate how the structure and tooling of software packages of the R programming language are being used to produce research compendia in a variety of disciplines. We also describe how software engineering tools and services are being used by researchers to streamline working with research compendia. Using real-world examples, we show how researchers can improve the reproducibility of their work using research compendia based on R packages and related tools.


Author(s):  
Ben Marwick ◽  
Carl Boettiger ◽  
Lincoln Mullen

Computers are a central tool in the research process, enabling complex and large scale data analysis. As computer-based research has increased in complexity, so have the challenges of ensuring that this research is reproducible. To address this challenge, we review the concept of the research compendium as a solution for providing a standard and easily recognisable way for organising the digital materials of a research project to enable other researchers to inspect, reproduce, and extend the research. We investigate how the structure and tooling of software packages of the R programming language are being used to produce research compendia in a variety of disciplines. We also describe how software engineering tools and services are being used by researchers to streamline working with research compendia. Using real-world examples, we show how researchers can improve the reproducibility of their work using research compendia based on R packages and related tools.


2017 ◽  
Vol 9 (1) ◽  
pp. 65-78
Author(s):  
Konrad Grzanek

Abstract Dynamic typing of R programming language may issue some quality problems in large scale data-science and machine-learning projects for which the language is used. Following our efforts on providing gradual typing library for Clojure we come with a package chR - a library that offers functionality of run-time type-related checks in R. The solution is not only a dynamic type checker, it also helps to systematize thinking about types in the language, at the same time offering high expressivenes and full adherence to functional programming style.


Author(s):  
Roger S. Bivand

Abstract Twenty years have passed since Bivand and Gebhardt (J Geogr Syst 2(3):307–317, 2000. 10.1007/PL00011460) indicated that there was a good match between the then nascent open-source R programming language and environment and the needs of researchers analysing spatial data. Recalling the development of classes for spatial data presented in book form in Bivand et al. (Applied spatial data analysis with R. Springer, New York, 2008, Applied spatial data analysis with R, 2nd edn. Springer, New York, 2013), it is important to present the progress now occurring in representation of spatial data, and possible consequences for spatial data handling and the statistical analysis of spatial data. Beyond this, it is imperative to discuss the relationships between R-spatial software and the larger open-source geospatial software community on whose work R packages crucially depend.


Author(s):  
Glenn-Peter Sætre ◽  
Mark Ravinet

Evolutionary genetics is the study of how genetic variation leads to evolutionary change. With the recent explosion in the availability of whole genome sequence data, vast quantities of genetic data are being generated at an ever-increasing pace with the result that programming has become an essential tool for researchers. Most importantly, a thorough understanding of evolutionary principles is essential for making sense of this genetic data. This up-to-date textbook covers all the major components of modern evolutionary genetics, carefully explaining fundamental processes such as mutation, natural selection, genetic drift, and speciation, together with their consequences. In addition to the text, study questions are provided to motivate the reader to think and reflect on the concepts in each chapter. Practical experience is essential when it comes to developing an understanding of how to use genetic data to analyze and address interesting questions in the life sciences and how to interpret results in meaningful ways. Throughout the book, a series of online, computer-based tutorials serves as an introduction to programming and analysis of evolutionary genetic data centered on the R programming language, which stands out as an ideal all-purpose platform to handle and analyze such data. The book and its online materials take full advantage of the authors’ own experience in working in a post-genomic revolution world, and introduce readers to the plethora of molecular and analytical methods that have only recently become available.


2014 ◽  
Vol 513-517 ◽  
pp. 1752-1755 ◽  
Author(s):  
Chun Liu ◽  
Kun Tan

For a safety critical computer, large-scale data like database which has to be transferred in an instant time cannot be voted directly. This paper proposes a database update algorithm for safety critical computer based on status vote,which is to vote the database status instead of database itself. This algorithm can solve the problem of voting too much data in a short time, and compare versions of database of different modules in real time. A Markov model is built to calculate the safety and reliability of this algorithm. The results show that this algorithm meets the update requirement of safety critical computer. 1. Communication protocol for database update 1.1 TFTP protocol TFTP is a simple protocol for transporting document. It usually uses the UDP protocol to realize but the TFTP does not require the specific agreement of implementation and can implement with TCP in special occasions. [This agreement is designed for small file transferring, so it doesn't have function many FTP usually does; it can only acquire or write the file from the server and not able tot list directory, not authenticate. It transfers 8 bits of data with three models: netascii, the eight-bit ASCII form; octet, the eight-bit source data type; mail, no longer supported, it returns the data back directly to the user rather than saved as a file. 1.2 SRTP Ethernet security real-time data transfer protocol


Author(s):  
Polina Lemenkova

The main purpose of this article is to present the use of R programming language in cartographic visualization demonstrating using machine learning methods in geographic education. Current trends in education technologies are largely influenced by the possibilities of distance-learning, e-learning and selflearning. In view of this, the main tendencies in modern geographic education include active use of open source GIS and publicly available free geospatial datasets that can be used by students for cartographic exercises, data visualization and mapping, both at intermediate and advanced levels. This paper contributes to the development of these methods and is fully based on the datasets and tools available for every student: the R programming language and the free open source datasets. The case study demonstrated in this paper show the examples of both physical geographic mapping (geomorphology) and socio-economic geography (regional mapping) which can be used in the classes and in self-learning. The objective of this research includes geomorphological modelling of the terrain relief in Italy and regional mapping. The data include dem SRTM90 and datasets on regional borders of Italy embedded in R packages 'maps' and 'mapdata'. Modelling references to the characteristics of slope, aspect, hillshade and elevation, their visualization using R packages: 'raster' and 'tmap'. Regional mapping of Italy was made using main package 'ggmap' with the 'ggplot2' as a wrapper. The results present five thematic maps (slope, aspect, hillshade, elevation and regions of Italy) created in R language. Traditionally used in statistical analysis, R is less known as a perfect tool in geographic education. This paper contributes to the development of methods in geographic education by presenting new technologies of the machine learning methods of mapping.


2021 ◽  
Author(s):  
Daniel Lüdecke ◽  
Indrajeet Patil ◽  
Mattan S. Ben-Shachar ◽  
Brenton M. Wiernik ◽  
Philip Waggoner ◽  
...  

The see package is embedded in the easystats ecosystem, a collection of R packages that operate in synergy to provide a consistent and intuitive syntax when working with statistical models in the R programming language (R Core Team, 2021). Most easystats packages return comprehensive numeric summaries of model parameters and performance. The see package complements these numeric summaries with a host of functions and tools to produce a range of publication-ready visualizations for model parameters, predictions, and performance diagnostics. As a core pillar of easystats, the see package helps users to utilize visualization for more informative, communicable, and well-rounded scientific reporting.


2020 ◽  
Author(s):  
Maxime Meylan ◽  
Etienne Becht ◽  
Catherine Sautès-Fridman ◽  
Aurélien de Reyniès ◽  
Wolf H. Fridman ◽  
...  

AbstractSummaryWe previously reported MCP-counter and mMCP-counter, methods that allow precise estimation of the immune and stromal composition of human and murine samples from bulk transcriptomic data, but they were only distributed as R packages. Here, we report webMCP-counter, a user-friendly web interface to allow all users to use these methods, regardless of their proficiency in the R programming language.Availability and ImplementationFreely available from http://134.157.229.105:3838/webMCP/. Website developed with the R package shiny. Source code available from GitHub: https://github.com/FPetitprez/webMCP-counter.


2018 ◽  
Vol 2 ◽  
pp. e26060
Author(s):  
Pamela Soltis

Digitized natural history data are enabling a broad range of innovative studies of biodiversity. Large-scale data aggregators such as Global Biodiversity Information facility (GBIF) and Integrated Digitized Biocollections (iDigBio) provide easy, global access to millions of specimen records contributed by thousands of collections. A developing community of eager users of specimen data – whether locality, image, trait, etc. – is perhaps unaware of the effort and resources required to curate specimens, digitize information, capture images, mobilize records, serve the data, and maintain the infrastructure (human and cyber) to support all of these activities. Tracking of specimen information throughout the research process is needed to provide appropriate attribution to the institutions and staff that have supplied and served the records. Such tracking may also allow for annotation and comment on particular records or collections by the global community. Detailed data tracking is also required for open, reproducible science. Despite growing recognition of the value and need for thorough data tracking, both technical and sociological challenges continue to impede progress. In this talk, I will present a brief vision of how application of a DOI to each iteration of a data set in a typical research project could provide attribution to the provider, opportunity for comment and annotation of records, and the foundation for reproducible science based on natural history specimen records. Sociological change – such as journal requirements for data deposition of all iterations of a data set – can be accomplished using community meetings and workshops, along with editorial efforts, as were applied to DNA sequence data two decades ago.


Sign in / Sign up

Export Citation Format

Share Document