Approaches and criteria for provenance in biomedical data sets/workflows: a scoping review (Preprint)
BACKGROUND Provenance supports the understanding of data genesis and it is a key factor to ensure the trustworthiness of the digital objects containing (sensitive) scientific data. Provenance information contributes to a better understanding of scientific results and fosters collaboration on existing data as well as data-sharing. This encompasses defining comprehensive concepts and standards for transparency and traceability, reproducibility, validity and quality assurance during clinical and scientific data workflows and/or research. OBJECTIVE The aim of this scoping review is to investigate approaches and challenges for provenance tracking as well as disclosing current knowledge gaps in the area. The review covers modeling aspects as well as metadata frameworks for capturing meaningful and usable provenance information during creation, collection and processing of (sensitive) scientific biomedical data. The objective of the review also includes the examination of quality aspects of provenance criteria. METHODS The scoping review will follow the methodological framework by Arksey and O'Malley. Relevant publications will be obtained by querying PubMed and Web of Science. All articles in English language will be included, within the time period between 2006 and 23-March 2021. Database retrieval will be accompanied by manual search for grey literature. Potential publications will then be exported into a reference management software, and duplicates will be removed. Afterwards, the obtained set of papers will be transferred into a systematic review management tool. All publications will be screened, extracted and analyzed: title and abstract screening will be carried out by 4 independent reviewers. Majority vote is required for consent to eligibility of articles based on defined inclusion and exclusion criteria. Full-text reading will be performed independently by 2 reviewers and in the last step key information will be extracted on a template which has been evaluated by the reviewers beforehand. If agreement cannot be reached, the conflict will be resolved by a domain expert. Charted data will be analyzed by categorizing and summarizing the individual data items based on the research questions. Tabular or graphical overviews will be given, if applicable. RESULTS The reporting follows the extension of the PRISMA statements for scoping reviews (PRISMA-ScR). Electronic database searches in PubMed and Web of Science resulted in 469 matches after deduplication. As of June 2021, the scoping review is in the full text screening stage. The data extraction using the pretested charting template will follow the full text screening stage. We expect the scoping review report to be completed by the end of 2021. CONCLUSIONS Information about the origin of healthcare data has a major impact on the quality and the reusability of scientific results as well as follow-up activities. This scoping review will provide information about current approaches, challenges or knowledge gaps with provenance tracking in biomedical sciences.