scholarly journals High-Dimensional Crowdsourced Data Distribution Estimation with Local Privacy

Author(s):  
Xuebin Ren ◽  
Chia-Mu Yu ◽  
Weiren Yu ◽  
Shusen Yang ◽  
Xinyu Yang ◽  
...  
2021 ◽  
pp. 1-12
Author(s):  
Jian Zheng ◽  
Jianfeng Wang ◽  
Yanping Chen ◽  
Shuping Chen ◽  
Jingjin Chen ◽  
...  

Neural networks can approximate data because of owning many compact non-linear layers. In high-dimensional space, due to the curse of dimensionality, data distribution becomes sparse, causing that it is difficulty to provide sufficient information. Hence, the task becomes even harder if neural networks approximate data in high-dimensional space. To address this issue, according to the Lipschitz condition, the two deviations, i.e., the deviation of the neural networks trained using high-dimensional functions, and the deviation of high-dimensional functions approximation data, are derived. This purpose of doing this is to improve the ability of approximation high-dimensional space using neural networks. Experimental results show that the neural networks trained using high-dimensional functions outperforms that of using data in the capability of approximation data in high-dimensional space. We find that the neural networks trained using high-dimensional functions more suitable for high-dimensional space than that of using data, so that there is no need to retain sufficient data for neural networks training. Our findings suggests that in high-dimensional space, by tuning hidden layers of neural networks, this is hard to have substantial positive effects on improving precision of approximation data.


Author(s):  
Liang Xiong

When we are faced with data, one common task is to learn the correspondence relationship between different data sets. More concretely, by learning data correspondence, samples that share similar intrinsic parameters, which are often hard to estimate directly, can be discovered. For example, given some face image data, an alignment algorithm is able to find images of two different persons with similar poses or expressions. We call this technique the alignment of data. Besides its usage in data analysis and visualization, this problem also has wide potential applications in various fields. For instance, in facial expression recognition, one may have a set of standard labeled images with known expressions, such as happiness, sadness, surprise, anger and fear, of a particular person. Then we can recognize the expressions of another person just by aligning his/her facial images to the standard image set. Its application can also be found directly in pose estimation. One can refer to (Ham, Lee & Saul, 2005) for more details. Although intuitive, without any premise this alignment problem can be very difficult. Usually, the samples are distributed in high-dimensional observation spaces, and the relation between features and samples’ intrinsic parameters can be too complex to be modeled explicitly. Therefore, some hypotheses about the data distribution are made. In the recent years, the manifold assumption of data distribution has been very popular in the field of data mining and machine learning. Researchers have realized that in many applications the samples of interest are actually confined to particular subspaces embedded in the high-dimensional feature space (Seung & Lee, 2000; Roweis & Saul, 2000). Intuitively, the manifold assumption means that certain groups of samples are lying in a non-linear low-dimensional subspace embedded in the observation space. This assumption has been verified to play an important role in human perception (Seung & Lee, 2000), and many effective algorithms are developed under it in the recent years. Under the manifold assumption, structural information of data can be utilized to facilitate the alignment.


Author(s):  
Di Chen ◽  
Carla P. Gomes

Citizen science projects are successful at gathering rich datasets for various applications. However, the data collected by citizen scientists are often biased — in particular, aligned more with the citizens’ preferences than with scientific objectives. We propose the Shift Compensation Network (SCN), an end-to-end learning scheme which learns the shift from the scientific objectives to the biased data while compensating for the shift by re-weighting the training data. Applied to bird observational data from the citizen science project eBird, we demonstrate how SCN quantifies the data distribution shift and outperforms supervised learning models that do not address the data bias. Compared with competing models in the context of covariate shift, we further demonstrate the advantage of SCN in both its effectiveness and its capability of handling massive high-dimensional data.


2021 ◽  
Vol 11 (3) ◽  
Author(s):  
Ryan McKenna ◽  
Gerome Miklau ◽  
Daniel Sheldon

We propose a general approach for differentially private synthetic data generation, that consists of three steps: (1) select a collection of low-dimensional marginals, (2) measure those marginals with a noise addition mechanism, and (3) generate synthetic data that preserves the measured marginals well. Central to this approach is Private-PGM, a post-processing method that is used to estimate a high-dimensional data distribution from noisy measurements of its marginals. We present two mechanisms, NIST-MST and MST, that are instances of this general approach. NIST-MST was the winning mechanism in the 2018 NIST differential privacy synthetic data competition, and MST is a new mechanism that can work in more general settings, while still performing comparably to NIST-MST. We believe our general approach should be of broad interest, and can be adopted in future mechanisms for synthetic data generation.


Sensors ◽  
2020 ◽  
Vol 20 (5) ◽  
pp. 1443
Author(s):  
Marius Laska ◽  
Jörg Blankenbach ◽  
Ralf Klamma

The accuracy of fingerprinting-based indoor localization correlates with the quality and up-to-dateness of collected training data. Perpetual crowdsourced data collection reduces manual labeling effort and provides a fresh data base. However, the decentralized collection comes with the cost of heterogeneous data that causes performance degradation. In settings with imperfect data, area localization can provide higher positioning guarantees than exact position estimation. Existing area localization solutions employ a static segmentation into areas that is independent of the available training data. This approach is not applicable for crowdsoucred data collection, which features an unbalanced spatial training data distribution that evolves over time. A segmentation is required that utilizes the existing training data distribution and adapts once new data is accumulated. We propose an algorithm for data-aware floor plan segmentation and a selection metric that balances expressiveness (information gain) and performance (correctly classified examples) of area classifiers. We utilize supervised machine learning, in particular, deep learning, to train the area classifiers. We demonstrate how to regularly provide an area localization model that adapts its prediction space to the accumulating training data. The resulting models are shown to provide higher reliability compared to models that pinpoint the exact position.


2019 ◽  
Vol 2 (1) ◽  
pp. 31 ◽  
Author(s):  
Ruth Ema Febrita ◽  
Wayan Firdaus Mahmudy ◽  
Aji Prasetya Wibawa

As the population grows and e economic development, houses could be one of basic needs of every family. Therefore, housing investment has promising value in the future. This research implements the Self-Organized Map (SOM) algorithm to cluster house data for providing several house groups based on the various features. K-means is used as the baseline of the proposed approach. SOM has higher silhouette coefficient (0.4367) compared to its comparison (0.236). Thus, this method outperforms k-means in terms of visualizing high-dimensional data cluster. It is also better in the cluster formation and regulating the data distribution.


Author(s):  
CHUN-GUANG LI ◽  
JUN GUO ◽  
BO XIAO

In this paper, a novel method to estimate the intrinsic dimensionality of high-dimensional data set is proposed. Based on neighborhood information, our method calculates the non-negative locally linear reconstruction coefficients from its neighbors for each data point, and the numbers of those dominant positive reconstruction coefficients are regarded as a faithful guide to the intrinsic dimensionality of data set. The proposed method requires no parametric assumption on data distribution and is easy to implement in the general framework of manifold learning. Experimental results on several synthesized data sets and real data sets have shown the benefits of the proposed method.


Sign in / Sign up

Export Citation Format

Share Document