scholarly journals Privacy-Preserving Data Sharing in High Dimensional Regression and Classification Settings

Author(s):  
Stephen E. Fienberg ◽  
Jiashun Jin

We focus on the problem of multi-party data sharing in high dimensional data settings where the number of measured features (or the dimension) p is frequently much larger than the number of subjects (or the sample size) n, the so-called p >> n scenario that has been the focus of much recent statistical research. Here, we consider data sharing for two interconnected problems in high dimensional data analysis, namely the feature selection and classification. We characterize the notions of ``cautious", ``regular", and ``generous" data sharing in terms of their privacy-preserving implications for the parties and their share of data, with focus on the ``feature privacy" rather than the ``sample privacy", though the violation of the former may lead to the latter. We evaluate the data sharing methods using {\it phase diagram} from the statistical literature on multiplicity and Higher Criticism thresholding. In the two-dimensional phase space calibrated by the signal sparsity and signal strength, a phase diagram is a partition of the phase space and contains three distinguished regions, where we have no (feature)-privacy violation, relatively rare privacy violations, and an overwhelming amount of privacy violation.

2021 ◽  
Vol 256 ◽  
pp. 02038
Author(s):  
Xin Ji ◽  
Haifeng Zhang ◽  
Jianfang Li ◽  
Xiaolong Zhao ◽  
Shouchao Li ◽  
...  

In order to improve the prediction accuracy of high-dimensional data time series, a high-dimensional data multivariate time series prediction method based on deep reinforcement learning is proposed. The deep reinforcement learning method is used to solve the time delay of each variable and mine the data characteristics. According to the principle of maximum conditional entropy, the embedding dimension of the phase space is expanded, and a multivariate time series model of high-dimensional data is constructed. Thus, the conversion of reconstructed coordinates from low-dimensional to high-dimensional can be kept relatively stable. The strong independence and low redundancy of the final reconstructed phase space construct an effective model input vector for multivariate time series forecasting. Numerical experiments of classical multivariable chaotic time series show that the method proposed in this paper has better forecasting effect, which shows the forecasting effectiveness of this method.


Author(s):  
David Donoho ◽  
Jiashun Jin

We consider two-class linear classification in a high-dimensional, small-sample-size setting. Only a small fraction of the features are useful, these being unknown to us, and each useful feature contributes weakly to the classification decision. This was called the rare/weak (RW) model in our previous study ( Donoho, D. & Jin, J. 2008 Proc. Natl Acad. Sci. USA 105 , 14 790–14 795). We select features by thresholding feature Z -scores. The threshold is set by higher criticism (HC). For 1≤ i ≤ N , let π i denote the p -value associated with the i th Z -score and π ( i ) denote the i th order statistic of the collection of p -values. The HC threshold (HCT) is the order statistic of the Z -score corresponding to index i maximizing . The ideal threshold optimizes the classification error. In that previous study, we showed that HCT was numerically close to the ideal threshold. We formalize an asymptotic framework for studying the RW model, considering a sequence of problems with increasingly many features and relatively fewer observations. We show that, along this sequence, the limiting performance of ideal HCT is essentially just as good as the limiting performance of ideal thresholding. Our results describe two-dimensional phase space , a two-dimensional diagram with coordinates quantifying ‘rare’ and ‘weak’ in the RW model. The phase space can be partitioned into two regions—one where ideal threshold classification is successful, and one where the features are so weak and so rare that it must fail. Surprisingly, the regions where ideal HCT succeeds and fails make exactly the same partition of the phase diagram. Other threshold methods, such as false (feature) discovery rate (FDR) threshold selection, are successful in a substantially smaller region of the phase space than either HCT or ideal thresholding. The FDR and local FDR of the ideal and HC threshold selectors have surprising phase diagrams, which are also described. Results showing the asymptotic equivalence of HCT with ideal HCT can be found in a forthcoming paper ( Donoho, D. & Jin, J. In preparation ).


Sign in / Sign up

Export Citation Format

Share Document