scholarly journals Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science

2020 ◽  
Author(s):  
Daniel Nüst ◽  
Vanessa Sochat ◽  
Ben Marwick ◽  
Stephen Eglen ◽  
Tim Head ◽  
...  

Computational science has been greatly improved by the use of containers for packaging software and data dependencies. In a scholarly context, the main drivers for using these containers are transparency and support of reproducibility; in turn, a workflow’s reproducibility can be greatly affected by the choices that are made with respect to building containers. In many cases, the build process for the container’s image is created from instructions provided in a Dockerfile format. In support of this approach, we present a set of rules to help researchers write understandable Dockerfiles for typical data science workflows. By following the rules in this article, researchers can create containers suitable for sharing with fellow scientists, for including in scholarly communication such as education or scientific papers, and for effective and sustainable personal workflows.

2020 ◽  
Vol 16 (11) ◽  
pp. e1008316
Author(s):  
Daniel Nüst ◽  
Vanessa Sochat ◽  
Ben Marwick ◽  
Stephen J. Eglen ◽  
Tim Head ◽  
...  

Computational science has been greatly improved by the use of containers for packaging software and data dependencies. In a scholarly context, the main drivers for using these containers are transparency and support of reproducibility; in turn, a workflow’s reproducibility can be greatly affected by the choices that are made with respect to building containers. In many cases, the build process for the container’s image is created from instructions provided in a Dockerfile format. In support of this approach, we present a set of rules to help researchers write understandable Dockerfiles for typical data science workflows. By following the rules in this article, researchers can create containers suitable for sharing with fellow scientists, for including in scholarly communication such as education or scientific papers, and for effective and sustainable personal workflows.


Stats ◽  
2021 ◽  
Vol 4 (1) ◽  
pp. 71-85
Author(s):  
Hossein Hassani ◽  
Mohammad Reza Yeganegi ◽  
Xu Huang

Fusing nature with computational science has been proved paramount importance and researchers have also shown growing enthusiasm on inventing and developing nature inspired algorithms for solving complex problems across subjects. Inevitably, these advancements have rapidly promoted the development of data science, where nature inspired algorithms are changing the traditional way of data processing. This paper proposes the hybrid approach, namely SSA-GA, which incorporates the optimization merits of genetic algorithm (GA) for the advancements of Singular Spectrum Analysis (SSA). This approach further boosts the performance of SSA forecasting via better and more efficient grouping. Given the performances of SSA-GA on 100 real time series data across various subjects, this newly proposed SSA-GA approach is proved to be computationally efficient and robust with improved forecasting performance.


2021 ◽  
pp. 1-5
Author(s):  
Cosima Meyer

ABSTRACT This article introduces how to teach an interactive, one-semester-long statistics and programming class. The setting also can be applied to shorter and longer classes as well as introductory and advanced courses. I propose a project-based seminar that also encompasses elements of an inverted classroom. As a result of this combination, the seminar supports students’ learning progress and also creates engaging virtual classes. To demonstrate how to apply a project-based seminar setting to teaching statistics and programming classes, I use an introductory class to data wrangling and management with the statistical software program R. Students are guided through a typical data science workflow that requires data management and data wrangling and concludes with visualizing and presenting first research results during a simulated mini-conference.


2019 ◽  
Vol 43 (2) ◽  
pp. 256-264 ◽  
Author(s):  
Jane Cho

Purpose Based on the data from Figshare repositories, the purpose of this paper is to analyze which research data are actively produced and shared in the interdisciplinary field of library and information science (LIS). Design/methodology/approach Co-occurrence analysis was performed on keywords assigned to research data in the field of LIS, which were archived in the Figshare repository. By analyzing the keyword network using the pathfinder algorithm, the study identifies key areas where data production is actively conducted in LIS, and examines how these results differ from the conventional intellectual structure of LIS based on co-citation or bibliographic coupling analysis. Findings Four major domains – Open Access, Scholarly Communication, Data Science and Informatics – and 15 sub-domains were created. The keywords with the highest global influence appeared as follows, in descending order: “open access,” “scholarly communication” and “altmetrics.” Originality/value This is the first study to understand the key areas that actively produce and utilize data in the LIS field.


2020 ◽  
Vol 16 (10) ◽  
pp. e1008226
Author(s):  
Alise Ponsero ◽  
Ryan Bartelme ◽  
Gustavo de Oliveira Almeida ◽  
Alex Bigelow ◽  
Reetu Tuteja ◽  
...  
Keyword(s):  

2019 ◽  
Vol 29 (Supplement_4) ◽  
Author(s):  
M Mariani ◽  
R Pastorino ◽  
W Ricciardi ◽  
S Boccia

Abstract Background Precision health aims to prevent and predict illness, maintaining health and quality of life for as long as possible, by drawing on the new technological and data science tools to translate volumes of research and clinical data into information that citizens, patients and doctors can use. Objective The ExACT consortium, funded by the Marie Curie Research and Innovation Staff Exchange (RISE) 2017 - Horizon 2020, is aimed at building a community of academic and non-academic institutions that generates high quality, multidisciplinary collaboration by exchanging knowledge in research and training activities on precision health. Results From 2019 to 2023, 74 secondments are foreseen; staff involved will be trained on precision health research topics unavailable at their home institutions. The research topics include 5 domains: Integration of Big Data and digital solutions into healthcare systems; design and promotion of innovative citizen engagement models; education of healthcare professionals and leadership; HTA in precision health; Ethical-legal, social, organisational and policy issues surrounding precision health. Conclusions Secondees will produce key reports, policy recommendations, scientific papers, and informative materials for citizens, fostering public-private interplay and fostering integration of precision health in the EU health systems, contributing to better health for EU citizens. Key messages Once the secondees are back in their home institution, they will use competences acquired during the secondment to advance the research, and transfer the knowledge to the home organization. Sharing knowledge,building synergies and expertise and encouraging best practices,among top-level institutions,will stimulate translational effort for implementing precision health in EU health system.


2021 ◽  
pp. 014473942110049
Author(s):  
Michael Overton ◽  
Stephen Kleinschmit

Public administration is struggling to contend with a substantial shift in practice fueled by the accelerating adoption of information technology. New skills, competencies and pedagogies are required by the field to help overcome the data-skills gap. As a means to address these deficiencies, we introduce the Data Science Literacy Framework, a heuristic for incorporating data science principles into public administration programs. The framework suggests that data literacy is the dominant principle underlying a shift in professional practice, accentuated by an understanding of computational science, statistical methodology, and data-adjacent domain knowledge. A combination of new and existing skills meshed into public administration curriculums help implement these principles and advance public administration education.


2021 ◽  
Vol 17 (2) ◽  
pp. e1008628
Author(s):  
Micaela S. Parker ◽  
Arlyn E. Burgess ◽  
Philip E. Bourne

Author(s):  
Anit N Roy ◽  
Jais Jose ◽  
Aswin Sunil ◽  
Neha Gautam ◽  
Deepa Nathalia ◽  
...  

The sudden pervasive of severe acute respiratory syndrome Covid-19 has been leading the universe into a prominent crisis. It has influenced each zone, for example, industrial area, horticultural zone, Public transportation, economic zone, and so on. So as to see how Covid-19 affected the globe, we conducted an investigation characterizing the effects of the pandemic over the world using Machine Learning (ML) method. Prediction is a typical data science exercise that helps the administration with function planning, objective setting, and anomaly detection. We propose an additive regression model with interpretable parameters that can be naturally balanced by experts with domain intuition about the time series. We focus on global data beginning from 22nd January 2020, till 26th April 2020 and performed dynamic map visualization of Covid-19 expansion globally by date wise and predicting the spread of virus on all countries and continents. The major advantages of this work include accurate analysis of country-wise as well as province/state-wise confirmed cases, recovered cases, deaths, prediction of pandemic viral attack and how far it is expanding globally.


Sign in / Sign up

Export Citation Format

Share Document