scholarly journals DATABOOK : a standardised framework for dynamic documentation of algorithm design during Data Science projects

2021 ◽  
Vol 45 (2) ◽  
Author(s):  
Anna Nesvijevskaia

This paper proposes a standard documentation framework for Data Science projects, called Databook. It is a result of five years of action-research on multiple projects in several sectors of activity in France, and of a confrontation of standard theoretical Data Science processes, such as CRISP_DM, with the reality of the field. As a vector for knowledge sharing and capitalisation, the Databook has been identified as one of the main facilitators of Human Data Mediation. Transformed into an operational prototype of simple and minimalist documentation, it has since been tested then on about a hundred Data Science projects, has proven its benefits for the internal and external efficiency of Data Science projects, and can be turned into a more ambitious standard framework for data patrimony valorisation and data quality governance.


2021 ◽  
Vol 22 (S6) ◽  
Author(s):  
Yasmine Mansour ◽  
Annie Chateau ◽  
Anna-Sophie Fiston-Lavier

Abstract Background Meiotic recombination is a vital biological process playing an essential role in genome's structural and functional dynamics. Genomes exhibit highly various recombination profiles along chromosomes associated with several chromatin states. However, eu-heterochromatin boundaries are not available nor easily provided for non-model organisms, especially for newly sequenced ones. Hence, we miss accurate local recombination rates necessary to address evolutionary questions. Results Here, we propose an automated computational tool, based on the Marey maps method, allowing to identify heterochromatin boundaries along chromosomes and estimating local recombination rates. Our method, called BREC (heterochromatin Boundaries and RECombination rate estimates) is non-genome-specific, running even on non-model genomes as long as genetic and physical maps are available. BREC is based on pure statistics and is data-driven, implying that good input data quality remains a strong requirement. Therefore, a data pre-processing module (data quality control and cleaning) is provided. Experiments show that BREC handles different markers' density and distribution issues. Conclusions BREC's heterochromatin boundaries have been validated with cytological equivalents experimentally generated on the fruit fly Drosophila melanogaster genome, for which BREC returns congruent corresponding values. Also, BREC's recombination rates have been compared with previously reported estimates. Based on the promising results, we believe our tool has the potential to help bring data science into the service of genome biology and evolution. We introduce BREC within an R-package and a Shiny web-based user-friendly application yielding a fast, easy-to-use, and broadly accessible resource. The BREC R-package is available at the GitHub repository https://github.com/GenomeStructureOrganization.



2017 ◽  
Author(s):  
Amelia McNamara ◽  
Nicholas J Horton

Data wrangling is a critical foundation of data science, and wrangling of categorical data is an important component of this process. However, categorical data can introduce unique issues in data wrangling, particularly in real-world settings with collaborators and periodically-updated dynamic data. This paper discusses common problems arising from categorical variable transformations in R, demonstrates the use of factors, and suggests approaches to address data wrangling challenges. For each problem, we present at least two strategies for management, one in base R and the other from the ‘tidyverse.’ We consider several motivating examples, suggest defensive coding strategies, and outline principles for data wrangling to help ensure data quality and sound analysis.



2021 ◽  
Vol 251 ◽  
pp. 04010
Author(s):  
Thomas Britton ◽  
David Lawrence ◽  
Kishansingh Rajput

Data quality monitoring is critical to all experiments impacting the quality of any physics results. Traditionally, this is done through an alarm system, which detects low level faults, leaving higher level monitoring to human crews. Artificial Intelligence is beginning to find its way into scientific applications, but comes with difficulties, relying on the acquisition of new skill sets, either through education or acquisition, in data science. This paper will discuss the development and deployment of the Hydra monitoring system in production at Gluex. It will show how “off-the-shelf” technologies can be rapidly developed, as well as discuss what sociological hurdles must be overcome to successfully deploy such a system. Early results from production running of Hydra will also be shared as well as a future outlook for development of Hydra.



2018 ◽  
Vol 1 (1) ◽  
pp. 139-156 ◽  
Author(s):  
Wen-wen Tung ◽  
Ashrith Barthur ◽  
Matthew C. Bowers ◽  
Yuying Song ◽  
John Gerth ◽  
...  


Author(s):  
Vineet Raina ◽  
Srinath Krishnamurthy




2020 ◽  
Author(s):  
Yasmine Mansour ◽  
Annie Chateau ◽  
Anna-Sophie Fiston-Lavier

AbstractMotivationMeiotic recombination is a vital biological process playing an essential role in genomes structural and functional dynamics. Genomes exhibit highly various recombination profiles along chromosomes associated with several chromatin states. However, eu-heterochromatin boundaries are not available nor easily provided for non-model organisms, especially for newly sequenced ones. Hence, we miss accurate local recombination rates, necessary to address evolutionary questions.ResultsHere, we propose an automated computational tool, based on the Marey maps method, allowing to identify heterochromatin boundaries along chromosomes and estimating local recombination rates. Our method, called BREC (heterochromatin Boundaries and RECombination rate estimates) is non-genome-specific, running even on non-model genomes as long as genetic and physical maps are available. BREC is based on pure statistics and is data-driven, implying that good input data quality remains a strong requirement. Therefore, a data pre-processing module (data quality control and cleaning) is provided. Experiments show that BREC handles different markers density and distribution issues. BREC’s heterochromatin boundaries have been validated with cytological equivalents experimentally generated on the fruit fly Drosophila melanogaster genome, for which BREC returns congruent corresponding values. Also, BREC’s recombination rates have been compared with previously reported estimates. Based on the promising results, we believe our tool has the potential to help bring data science into the service of genome biology and evolution. We introduce BREC within an R-package and a Shiny web-based user-friendly application yielding a fast, easy-to-use, and broadly accessible resource.AvailabilityBREC R-package is available at the GitHub repository https://github.com/ymansour21/BREC.



Author(s):  
Timo Aho ◽  
Outi Sievi-Korte ◽  
Terhi Kilamo ◽  
Sezin Yaman ◽  
Tommi Mikkonen


Sign in / Sign up

Export Citation Format

Share Document