data utility
Recently Published Documents


TOTAL DOCUMENTS

138
(FIVE YEARS 54)

H-INDEX

13
(FIVE YEARS 3)

2022 ◽  
Author(s):  
Paula Delgado-Santos ◽  
Giuseppe Stragapede ◽  
Ruben Tolosana ◽  
Richard Guest ◽  
Farzin Deravi ◽  
...  

The number of mobile devices, such as smartphones and smartwatches, is relentlessly increasing to almost 6.8 billion by 2022, and along with it, the amount of personal and sensitive data captured by them. This survey overviews the state of the art of what personal and sensitive user attributes can be extracted from mobile device sensors, emphasising critical aspects such as demographics, health and body features, activity and behaviour recognition, etc. In addition, we review popular metrics in the literature to quantify the degree of privacy, and discuss powerful privacy methods to protect the sensitive data while preserving data utility for analysis. Finally, open research questions are presented for further advancements in the field.


Iproceedings ◽  
10.2196/35431 ◽  
2021 ◽  
Vol 6 (1) ◽  
pp. e35431
Author(s):  
Hyeon Ki Jeong ◽  
Christine Park ◽  
Ricardo Henao ◽  
Meenal Kheterpal

Background In the era of increasing tools for automatic image analysis in dermatology, new machine learning models require high-quality image data sets. Facial image data are needed for developing models to evaluate attributes such as redness (acne and rosacea models), texture (wrinkles and aging models), pigmentation (melasma, seborrheic keratoses, aging, and postinflammatory hyperpigmentation), and skin lesions. Deidentifying facial images is critical for protecting patient anonymity. Traditionally, journals have required facial feature concealment typically covering the eyes, but these guidelines are largely insufficient to meet ethical and legal guidelines of the Health Insurance Portability and Accountability Act for patient privacy. Currently, facial feature deidentification is a challenging task given lack of expert consensus and lack of testing infrastructure for adequate automatic and manual facial image detection. Objective This study aimed to review the current literature on automatic facial deidentification algorithms and to assess their utility in dermatology use cases, defined by preservation of skin attributes (redness, texture, pigmentation, and lesions) and data utility. Methods We conducted a systematic search using a combination of headings and keywords to encompass the concepts of facial deidentification and privacy preservation. The MEDLINE (via PubMed), Embase (via Elsevier), and Web of Science (via Clarivate) databases were queried from inception to May 1, 2021. Studies with the incorrect design and outcomes were excluded during the screening and review process. Results A total of 18 studies, largely focusing on general adversarial network (GANs), were included in the final review reporting various methodologies of facial deidentification algorithms for still and video images. GAN-based studies were included owing to the algorithm’s capacity to generate high-quality, realistic images. Study methods were rated individually for their utility for use cases in dermatology, pertaining to skin color or pigmentation and texture preservation, data utility, and human detection, by 3 human reviewers. We found that most studies notable in the literature address facial feature and expression preservation while sacrificing skin color, texture, pigmentation, which are critical features in dermatology-related data utility. Conclusions Overall, facial deidentification algorithms have made notable advances such as disentanglement and face swapping techniques, while producing realistic faces for protecting privacy. However, they are sparse and currently not suitable for complete preservation of skin texture, color, and pigmentation quality in facial photographs. Using the current advances in artificial intelligence for facial deidentification summarized herein, a novel approach is needed to ensure greater patient anonymity, while increasing data access for automated image analysis in dermatology. Conflicts of Interest None declared.


2021 ◽  
Author(s):  
Hyeon Ki Jeong ◽  
Christine Park ◽  
Ricardo Henao ◽  
Meenal Kheterpal

BACKGROUND In the era of increasing tools for automatic image analysis in dermatology, new machine learning models require high-quality image data sets. Facial image data are needed for developing models to evaluate attributes such as redness (acne and rosacea models), texture (wrinkles and aging models), pigmentation (melasma, seborrheic keratoses, aging, and postinflammatory hyperpigmentation), and skin lesions. Deidentifying facial images is critical for protecting patient anonymity. Traditionally, journals have required facial feature concealment typically covering the eyes, but these guidelines are largely insufficient to meet ethical and legal guidelines of the Health Insurance Portability and Accountability Act for patient privacy. Currently, facial feature deidentification is a challenging task given lack of expert consensus and lack of testing infrastructure for adequate automatic and manual facial image detection. OBJECTIVE This study aimed to review the current literature on automatic facial deidentification algorithms and to assess their utility in dermatology use cases, defined by preservation of skin attributes (redness, texture, pigmentation, and lesions) and data utility. METHODS We conducted a systematic search using a combination of headings and keywords to encompass the concepts of facial deidentification and privacy preservation. The MEDLINE (via PubMed), Embase (via Elsevier), and Web of Science (via Clarivate) databases were queried from inception to May 1, 2021. Studies with the incorrect design and outcomes were excluded during the screening and review process. RESULTS A total of 18 studies, largely focusing on general adversarial network (GANs), were included in the final review reporting various methodologies of facial deidentification algorithms for still and video images. GAN-based studies were included owing to the algorithm’s capacity to generate high-quality, realistic images. Study methods were rated individually for their utility for use cases in dermatology, pertaining to skin color or pigmentation and texture preservation, data utility, and human detection, by 3 human reviewers. We found that most studies notable in the literature address facial feature and expression preservation while sacrificing skin color, texture, pigmentation, which are critical features in dermatology-related data utility. CONCLUSIONS Overall, facial deidentification algorithms have made notable advances such as disentanglement and face swapping techniques, while producing realistic faces for protecting privacy. However, they are sparse and currently not suitable for complete preservation of skin texture, color, and pigmentation quality in facial photographs. Using the current advances in artificial intelligence for facial deidentification summarized herein, a novel approach is needed to ensure greater patient anonymity, while increasing data access for automated image analysis in dermatology.


2021 ◽  
Vol 11 (12) ◽  
pp. 3164-3173
Author(s):  
R. Indhumathi ◽  
S. Sathiya Devi

Data sharing is essential in present biomedical research. A large quantity of medical information is gathered and for different objectives of analysis and study. Because of its large collection, anonymity is essential. Thus, it is quite important to preserve privacy and prevent leakage of sensitive information of patients. Most of the Anonymization methods such as generalisation, suppression and perturbation are proposed to overcome the information leak which degrades the utility of the collected data. During data sanitization, the utility is automatically diminished. Privacy Preserving Data Publishing faces the main drawback of maintaining tradeoff between privacy and data utility. To address this issue, an efficient algorithm called Anonymization based on Improved Bucketization (AIB) is proposed, which increases the utility of published data while maintaining privacy. The Bucketization technique is used in this paper with the intervention of the clustering method. The proposed work is divided into three stages: (i) Vertical and Horizontal partitioning (ii) Assigning Sensitive index to attributes in the cluster (iii) Verifying each cluster against privacy threshold (iv) Examining for privacy breach in Quasi Identifier (QI). To increase the utility of published data, the threshold value is determined based on the distribution of elements in each attribute, and the anonymization method is applied only to the specific QI element. As a result, the data utility has been improved. Finally, the evaluation results validated the design of paper and demonstrated that our design is effective in improving data utility.


2021 ◽  
Vol 1 (1) ◽  
Author(s):  
Stefanie James ◽  
Chris Harbron ◽  
Janice Branson ◽  
Mimmi Sundler

AbstractSynthetic data is a rapidly evolving field with growing interest from multiple industry stakeholders and European bodies. In particular, the pharmaceutical industry is starting to realise the value of synthetic data which is being utilised more prevalently as a method to optimise data utility and sharing, ultimately as an innovative response to the growing demand for improved privacy. Synthetic data is data generated by simulation, based upon and mirroring properties of an original dataset. Here, with supporting viewpoints from across the pharmaceutical industry, we set out to explore use cases for synthetic data across seven key but relatable areas for optimising data utility for improved data privacy and protection. We also discuss the various methods which can be used to produce a synthetic dataset and availability of metrics to ensure robust quality of generated synthetic datasets. Lastly, we discuss the potential merits, challenges and future direction of synthetic data within the pharmaceutical industry and the considerations for this privacy enhancing technology.


2021 ◽  
Vol 2022 (1) ◽  
pp. 481-500
Author(s):  
Xue Jiang ◽  
Xuebing Zhou ◽  
Jens Grossklags

Abstract Business intelligence and AI services often involve the collection of copious amounts of multidimensional personal data. Since these data usually contain sensitive information of individuals, the direct collection can lead to privacy violations. Local differential privacy (LDP) is currently considered a state-ofthe-art solution for privacy-preserving data collection. However, existing LDP algorithms are not applicable to high-dimensional data; not only because of the increase in computation and communication cost, but also poor data utility. In this paper, we aim at addressing the curse-of-dimensionality problem in LDP-based high-dimensional data collection. Based on the idea of machine learning and data synthesis, we propose DP-Fed-Wae, an efficient privacy-preserving framework for collecting high-dimensional categorical data. With the combination of a generative autoencoder, federated learning, and differential privacy, our framework is capable of privately learning the statistical distributions of local data and generating high utility synthetic data on the server side without revealing users’ private information. We have evaluated the framework in terms of data utility and privacy protection on a number of real-world datasets containing 68–124 classification attributes. We show that our framework outperforms the LDP-based baseline algorithms in capturing joint distributions and correlations of attributes and generating high-utility synthetic data. With a local privacy guarantee ∈ = 8, the machine learning models trained with the synthetic data generated by the baseline algorithm cause an accuracy loss of 10% ~ 30%, whereas the accuracy loss is significantly reduced to less than 3% and at best even less than 1% with our framework. Extensive experimental results demonstrate the capability and efficiency of our framework in synthesizing high-dimensional data while striking a satisfactory utility-privacy balance.


2021 ◽  
Author(s):  
Matthias Templ ◽  
Chifundo Kanjala ◽  
Inken Siems

BACKGROUND Sharing and anonymising data have become hot topics for individuals, organisations, and countries around the world. Open-access sharing of anonymised data containing sensitive information about individuals makes the most sense whenever the utility of the data can be preserved and the risk of disclosure can be kept below acceptable levels. In this case, researchers can use the data without access restrictions and limitations. OBJECTIVE The goal of this paper is to highlight solutions and requirements for sharing longitudinal health and surveillance event history data in form of open-access data. The challenges lie in the anonymisation of multiple event dates and the time-varying variables. A sequential approach that adds noise to the event dates is proposed. This approach maintains the event order and preserves the average time between events. Additionally, a nosy neighbor distance-based matching approach to estimate the risk is proposed. Regarding dealing with the key variables that change over time such as educational level or occupation, we make two proposals, one based on limiting the intermediate status of a person (e.g. on education), and the other to achieve k-anonymity in subsets of the data. The proposed approaches were applied to the Karonga Health and Demographic Surveillance System (HDSS) core dataset, which contains longitudinal data from 1995 to the end of 2016 and includes 280,381 event records with time-varying, socio-economic variables and demographic information on individuals. The proposed anonymisation strategy lowers the risk of disclosure to acceptable levels thus allowing sharing of the data. METHODS statistical disclosure control, k-anonymity, adding noise, disclosure risk measurement, event history data anonymization, longitudinal data anonymization, data utility by visual comparisons. RESULTS Anonymized version of event history data including longitudinal information on individuals over time with high data utility. CONCLUSIONS The proposed anonymisation of study participants in event history data including static and time-varying status variables, specifically applied to longitudinal health and demographic surveillance system data, led to an anonymized data set with very low disclosure risk and high data utility ready to be shared to the public in form of an open-access data set. Different level of noise for event history dates were evaluated for disclosure risk and data utility. It turned out that high utility had been achieved even with the highest level of noise. Details matters to ensure consistency/credibility. Most important, the sequential noise approach presented in this paper maintains the event order. It has been shown that not even the event order is preserved but also the time between events is well maintained in comparison to the original data. We also proposed an anonymization strategy to handle the information of time-varying status of educational, occupational level of a person, year of death, year of birth, and number of events of a person. We proposed an approach that preserves the data utility well but limit the number of educational and occupational levels of a person. Using distance-based neighborhood matching we simulated an attack under a nosy neighbor situation and by using a worst-case scenario where attackers has full information on the original data. It could be shown that the disclosure risk is very low even by assuming that the attacker’s data base and information is optimal. The HDSS and medical science research communities in LMIC settings will be the primary beneficiaries of the results and methods presented in this science article, but the results will be useful for anyone working on anonymising longitudinal datasets possibly including also time-varying information and event history data for purposes of sharing. In other words, the proposed approaches can be applied to almost any event history data, and, additionally, to event history data including static and/or status variables that changes its entries in time.


Author(s):  
Hao Wang ◽  
Xiao Peng ◽  
Yihang Xiao ◽  
Zhengquan Xu ◽  
Xian Chen

AbstractPrivacy preserving methods supporting for data aggregating have attracted the attention of researchers in multidisciplinary fields. Among the advanced methods, differential privacy (DP) has become an influential privacy mechanism owing to its rigorous privacy guarantee and high data utility. But DP has no limitation on the bound of noise, leading to a low-level utility. Recently, researchers investigate how to preserving rigorous privacy guarantee while limiting the relative error to a fixed bound. However, these schemes destroy the statistical properties, including the mean, variance and MSE, which are the foundational elements for data aggregating and analyzing. In this paper, we explore the optimal privacy preserving solution, including novel definitions and implementing mechanisms, to maintain the statistical properties while satisfying DP with a fixed relative error bound. Experimental evaluation demonstrates that our mechanism outperforms current schemes in terms of security and utility for large quantities of queries.


2021 ◽  
Author(s):  
Fabien Viton ◽  
Clemence Mauger ◽  
Gilles Dequen ◽  
Jean-Luc Guerin ◽  
Gael Le Mahec

Sign in / Sign up

Export Citation Format

Share Document