Privacy-Preserving High-dimensional Data Collection with Federated Generative Autoencoder

Abstract Business intelligence and AI services often involve the collection of copious amounts of multidimensional personal data. Since these data usually contain sensitive information of individuals, the direct collection can lead to privacy violations. Local differential privacy (LDP) is currently considered a state-ofthe-art solution for privacy-preserving data collection. However, existing LDP algorithms are not applicable to high-dimensional data; not only because of the increase in computation and communication cost, but also poor data utility. In this paper, we aim at addressing the curse-of-dimensionality problem in LDP-based high-dimensional data collection. Based on the idea of machine learning and data synthesis, we propose DP-Fed-Wae, an efficient privacy-preserving framework for collecting high-dimensional categorical data. With the combination of a generative autoencoder, federated learning, and differential privacy, our framework is capable of privately learning the statistical distributions of local data and generating high utility synthetic data on the server side without revealing users’ private information. We have evaluated the framework in terms of data utility and privacy protection on a number of real-world datasets containing 68–124 classification attributes. We show that our framework outperforms the LDP-based baseline algorithms in capturing joint distributions and correlations of attributes and generating high-utility synthetic data. With a local privacy guarantee ∈ = 8, the machine learning models trained with the synthetic data generated by the baseline algorithm cause an accuracy loss of 10% ~ 30%, whereas the accuracy loss is significantly reduced to less than 3% and at best even less than 1% with our framework. Extensive experimental results demonstrate the capability and efficiency of our framework in synthesizing high-dimensional data while striking a satisfactory utility-privacy balance.

Download Full-text

Techniques and Challenges while Applying Machine Learning Algorithms in Privacy Preserving Fashion

Proceeding International Conference on Science and Engineering ◽

10.14421/icse.v3.600 ◽

2020 ◽

Vol 3 ◽

pp. xix-xix

Author(s):

Artrim Kjamilji

Keyword(s):

Machine Learning ◽

Private Information ◽

Cyber Security ◽

Credit Card ◽

Differential Privacy ◽

Homomorphic Encryption ◽

Privacy Preserving ◽

Machine Learning Algorithms ◽

Garbled Circuits ◽

Private Data

Nowadays many different entities collect data of the same nature, but in slightly different environments. In this sense different hospitals collect data about their patients’ symptoms and corresponding disease diagnoses, different banks collect transactions of their customers’ bank accounts, multiple cyber-security companies collect data about log files and corresponding attacks, etc. It is shown that if those different entities would merge their privately collected data in a single dataset and use it to train a machine learning (ML) model, they often end up with a trained model that outperforms the human experts of the corresponding fields in terms of accurate predictions. However, there is a drawback. Due to privacy concerns, empowered by laws and ethical reasons, no entity is willing to share with others their privately collected data. The same problem appears during the classification case over an already trained ML model. On one hand, a user that has an unclassified query (record), doesn’t want to share with the server that owns the trained model neither the content of the query (which might contain private data such as credit card number, IP address, etc.), nor the final prediction (classification) of the query. On the other hand, the owner of the trained model doesn’t want to leak any parameter of the trained model to the user. In order to overcome those shortcomings, several cryptographic and probabilistic techniques have been proposed during the last few years to enable both privacy preserving training and privacy preserving classification schemes. Some of them include anonymization and k-anonymity, differential privacy, secure multiparty computation (MPC), federated learning, Private Information Retrieval (PIR), Oblivious Transfer (OT), garbled circuits and/or homomorphic encryption, to name a few. Theoretical analyses and experimental results show that the current privacy preserving schemes are suitable for real-case deployment, while the accuracy of most of them differ little or not at all with the schemes that work in non-privacy preserving fashion.

Download Full-text

Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

Applied Sciences ◽

10.3390/app11020472 ◽

2021 ◽

Vol 11 (2) ◽

pp. 472

Author(s):

Hyeongmin Cho ◽

Sangkyun Lee

Keyword(s):

Machine Learning ◽

Data Quality ◽

Large Scale ◽

High Dimensional Data ◽

Quality Measures ◽

Training Data ◽

Measure Data ◽

High Dimensional ◽

Small Scale ◽

Class Separability

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

Download Full-text

Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning

Electronic Commerce Research and Applications ◽

10.1016/j.elerap.2018.08.002 ◽

2018 ◽

Vol 31 ◽

pp. 24-39 ◽

Cited By ~ 48

Author(s):

Xiaojun Ma ◽

Jinglan Sha ◽

Dehua Wang ◽

Yuanbo Yu ◽

Qian Yang ◽

...

Keyword(s):

Machine Learning ◽

Data Cleaning ◽

High Dimensional Data ◽

P2p Network ◽

High Dimensional ◽

Loan Default

Download Full-text

Scalable hierarchical clustering by composition rank vector encoding and tree structure

10.1101/2020.04.12.038026 ◽

2020 ◽

Author(s):

Xiao Lai ◽

Pu Tian

Keyword(s):

Machine Learning ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

High Dimensional Data ◽

Machine Learning Algorithms ◽

Tree Structure ◽

Supervised Machine Learning ◽

High Dimensional ◽

Rank Vector ◽

Nonlinear Correlations

AbstractSupervised machine learning, especially deep learning based on a wide variety of neural network architectures, have contributed tremendously to fields such as marketing, computer vision and natural language processing. However, development of un-supervised machine learning algorithms has been a bottleneck of artificial intelligence. Clustering is a fundamental unsupervised task in many different subjects. Unfortunately, no present algorithm is satisfactory for clustering of high dimensional data with strong nonlinear correlations. In this work, we propose a simple and highly efficient hierarchical clustering algorithm based on encoding by composition rank vectors and tree structure, and demonstrate its utility with clustering of protein structural domains. No record comparison, which is an expensive and essential common step to all present clustering algorithms, is involved. Consequently, it achieves linear time and space computational complexity hierarchical clustering, thus applicable to arbitrarily large datasets. The key factor in this algorithm is definition of composition, which is dependent upon physical nature of target data and therefore need to be constructed case by case. Nonetheless, the algorithm is general and applicable to any high dimensional data with strong nonlinear correlations. We hope this algorithm to inspire a rich research field of encoding based clustering well beyond composition rank vector trees.

Download Full-text

Machine Learning and High-Dimensional Data Analysis

Principles of Clinical Cancer Research ◽

10.1891/9781617052392.0017 ◽

2018 ◽

Author(s):

Sanjay Aneja ◽

James B. Yu

Keyword(s):

Machine Learning ◽

Data Analysis ◽

High Dimensional Data ◽

High Dimensional ◽

High Dimensional Data Analysis

Download Full-text

Privacy preserving processing of high dimensional data classification based on sample selection and Singular Value Decomposition

2013 International Conference on Control, Automation, Robotics and Embedded Systems (CARE) ◽

10.1109/care.2013.6733775 ◽

2013 ◽

Author(s):

Priyank Jain ◽

Pratibha Tapashetti ◽

A.S. Umesh ◽

Sweta Sharma

Keyword(s):

Singular Value Decomposition ◽

Sample Selection ◽

High Dimensional Data ◽

Data Classification ◽

Privacy Preserving ◽

Singular Value ◽

High Dimensional ◽

Value Decomposition

Download Full-text

Finding causative genes from high-dimensional data: an appraisal of statistical and machine learning approaches

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2015-0072 ◽

2016 ◽

Vol 15 (4) ◽

Author(s):

Chamont Wang ◽

Jana L. Gevertz

Keyword(s):

Machine Learning ◽

High Dimensional Data ◽

High Dimensional ◽

Learning Approaches

Download Full-text

Mining High Utility Itemsets in Large High Dimensional Data

First International Workshop on Knowledge Discovery and Data Mining (WKDD 2008) ◽

10.1109/wkdd.2008.64 ◽

2008 ◽

Cited By ~ 6

Author(s):

Guangzhu Yu ◽

Keqing Li ◽

Shihuang Shao

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

High Utility ◽

High Utility Itemsets

Download Full-text

Sparse Boosting Based Machine Learning Methods for High-Dimensional Data

10.5772/intechopen.100506 ◽

2021 ◽

Author(s):

Mu Yue

Keyword(s):

Machine Learning ◽

Parameter Estimation ◽

Variable Selection ◽

Survival Data ◽

High Dimensional Data ◽

High Dimensional ◽

Learning Methods ◽

Require Time ◽

Machine Learning Methods ◽

Boosting Method

In high-dimensional data, penalized regression is often used for variable selection and parameter estimation. However, these methods typically require time-consuming cross-validation methods to select tuning parameters and retain more false positives under high dimensionality. This chapter discusses sparse boosting based machine learning methods in the following high-dimensional problems. First, a sparse boosting method to select important biomarkers is studied for the right censored survival data with high-dimensional biomarkers. Then, a two-step sparse boosting method to carry out the variable selection and the model-based prediction is studied for the high-dimensional longitudinal observations measured repeatedly over time. Finally, a multi-step sparse boosting method to identify patient subgroups that exhibit different treatment effects is studied for the high-dimensional dense longitudinal observations. This chapter intends to solve the problem of how to improve the accuracy and calculation speed of variable selection and parameter estimation in high-dimensional data. It aims to expand the application scope of sparse boosting and develop new methods of high-dimensional survival analysis, longitudinal data analysis, and subgroup analysis, which has great application prospects.

Download Full-text

Human and machine learning pipelines for responsible clinical prediction using high-dimensional data

10.21203/rs.3.pex-1655/v1 ◽

2021 ◽

Author(s):

Herdiantri Sufriyana ◽

Yu Wei Wu ◽

Emily Chia-Yu Su

Keyword(s):

Machine Learning ◽

High Dimensional Data ◽

Model Development ◽

Healthcare Providers ◽

Predictive Performance ◽

High Dimensional ◽

Clinical Prediction ◽

Data Collection And Analysis ◽

Medical Histories ◽

Feature Discovery

Abstract This protocol aims to develop, validate, and deploy a prediction model using high dimensional data by both human and machine learning. The applicability is intended for clinical prediction in healthcare providers, including but not limited to those using medical histories from electronic health records. This protocol applies diverse approaches to improve both predictive performance and interpretability while maintaining the generalizability of model evaluation. However, some steps require expensive computational capacity; otherwise, these will take longer time. The key stages consist of designs of data collection and analysis, feature discovery and quality control, and model development, validation, and deployment.

Download Full-text