BACKGROUND
Data science offers an unparalleled opportunity to identify new insight into many aspects of human life, with recent advances in healthcare. Using data science in digital health raises significant challenges in data privacy, transparency, and trustworthiness. Recent regulations enforce the need for a clear legal basis to collect, process, and share data, for example under the General Data Protection Regulation (GDPR) and UK Data Protection Act (DPA) 2018. For healthcare providers, the legal basis of using the electronic health record (EHR) is strictly for clinical care. Any other use of the data requires thoughtful considerations of the legal context and direct patient consent. Identifiable personal and sensitive information must be sufficiently anonymized. Raw data are commonly anonymized to be used for research purposes with risk assessment for re-identification and utility. Whilst healthcare organizations have internal policies defined for information governance, there is a significant lack of practical tools and intuitive guidance about the use of data for research and modelling. Off-the-shelf data anonymization tools are developed frequently, but privacy-related functionalities are often incomparable for use in different problem domains. Additionally, tools to support measuring the risk of the anonymized data regarding re-identification against its usefulness exist but it can be unclear as to their efficacy.
OBJECTIVE
In this systematic literature mapping (SLM) study, we aim to alleviate those issues by reviewing the landscape of data anonymization for digital healthcare.
METHODS
We employ the Google Scholar, Web of Science, Elsevier Scopus, and PubMed for to retrieve academic studies published in English up to June 2020. Noteworthy, grey literature is also involved to initialize the search. We focus on review questions covering five bottom-up aspects: 1) basic anonymization operations; 2) privacy models; 3) re-identification risk and usability metrics; 4) off-the-shelf anonymization tools; 5) lawful basis for EHR data anonymization.
RESULTS
We identified 239 eligible studies in which 60 articles are related to general background introduction, 16 papers are selected for seven basic anonymization operations, 104 studies are covered for seventy-two conventional and machine-learning-based privacy models, seven and fifteen metrics are respectively included for measuring the re-identification risk and degree of usability in 4 and 19 papers, and twenty data anonymization software tools are explored in 36 publications. In addition, we also
evaluate the practical feasibility of performing anonymization on HER data with reference to its usability of medical decision-making. Furthermore, we summarize the lawful basis to deliver guidance for practical EHR anonymization.
CONCLUSIONS
This SLM study indicates that data anonymization on EHR is theoretically achievable yet practically, requires more research efforts in practical implementations to balance privacy-preserving and usability, thus, to ensure more reliable healthcare applications.