Privacy Preserving Clustering for Distributed Homogeneous Gene Expression Data Sets

Author(s):  
Xin Li

In this paper, the authors present a new approach to perform principal component analysis (PCA)-based gene clustering on genomic data distributed in multiple sites (horizontal partitions) with privacy protection. This approach allows data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. The authors developed a framework for privacy preserving PCA-based gene clustering, which includes two types of participants such as data providers and a trusted central site. Within this mechanism, distributed horizontal partitions of genomic data can be globally clustered with privacy preservation. Compared to results from centralized scenarios, the result generated from distributed partitions achieves 100% accuracy by using this approach. An experiment on a real genomic data set is conducted, and result shows that the proposed framework produces exactly the same cluster formation as that from the centralized data set.

Author(s):  
Xin Li

In this paper, the authors present a new approach to perform principal component analysis (PCA)-based gene clustering on genomic data distributed in multiple sites (horizontal partitions) with privacy protection. This approach allows data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. The authors developed a framework for privacy preserving PCA-based gene clustering, which includes two types of participants such as data providers and a trusted central site. Within this mechanism, distributed horizontal partitions of genomic data can be globally clustered with privacy preservation. Compared to results from centralized scenarios, the result generated from distributed partitions achieves 100% accuracy by using this approach. An experiment on a real genomic data set is conducted, and result shows that the proposed framework produces exactly the same cluster formation as that from the centralized data set.


Author(s):  
Xin Li

In this paper, we present approaches to perform principal component analysis (PCA) clustering for distributed heterogeneous genomic datasets with privacy protection. The approaches allow data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. We then further develop a framework for privacy preserving PCA-based gene clustering, which includes two types of participants: data providers and a trusted central site (TCS). Two different methodologies are employed: Collective PCA (C-PCA) and Repeating PCA (R-PCA). The C-PCA requires local sites to transmit a sample of original data to the TCS and can be applied to any heterogeneous datasets. The R-PCA approach requires all local sites have the same or similar number of columns, but releases no original data. Experiments on five independent genomic datasets show that both C-PCA and R-PCA approaches maintain very good accuracy compared with the centralized scenario.


Author(s):  
Xin Li

In this paper, we present approaches to perform principal component analysis (PCA) clustering for distributed heterogeneous genomic datasets with privacy protection. The approaches allow data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. We then further develop a framework for privacy preserving PCA-based gene clustering, which includes two types of participants: data providers and a trusted central site (TCS). Two different methodologies are employed: Collective PCA (C-PCA) and Repeating PCA (R-PCA). The C-PCA requires local sites to transmit a sample of original data to the TCS and can be applied to any heterogeneous datasets. The R-PCA approach requires all local sites have the same or similar number of columns, but releases no original data. Experiments on five independent genomic datasets show that both C-PCA and R-PCA approaches maintain very good accuracy compared with the centralized scenario.


2007 ◽  
Vol 56 (6) ◽  
pp. 75-83 ◽  
Author(s):  
X. Flores ◽  
J. Comas ◽  
I.R. Roda ◽  
L. Jiménez ◽  
K.V. Gernaey

The main objective of this paper is to present the application of selected multivariable statistical techniques in plant-wide wastewater treatment plant (WWTP) control strategies analysis. In this study, cluster analysis (CA), principal component analysis/factor analysis (PCA/FA) and discriminant analysis (DA) are applied to the evaluation matrix data set obtained by simulation of several control strategies applied to the plant-wide IWA Benchmark Simulation Model No 2 (BSM2). These techniques allow i) to determine natural groups or clusters of control strategies with a similar behaviour, ii) to find and interpret hidden, complex and casual relation features in the data set and iii) to identify important discriminant variables within the groups found by the cluster analysis. This study illustrates the usefulness of multivariable statistical techniques for both analysis and interpretation of the complex multicriteria data sets and allows an improved use of information for effective evaluation of control strategies.


Author(s):  
L Mohana Tirumala ◽  
S. Srinivasa Rao

Privacy preserving in Data mining & publishing, plays a major role in today networked world. It is important to preserve the privacy of the vital information corresponding to a data set. This process can be achieved by k-anonymization solution for classification. Along with the privacy preserving using anonymization, yielding the optimized data sets is also of equal importance with a cost effective approach. In this paper Top-Down Refinement algorithm has been proposed which yields optimum results in a cost effective manner. Bayesian Classification has been proposed in this paper to predict class membership probabilities for a data tuple for which the associated class label is unknown.


Author(s):  
Andrew J. Connolly ◽  
Jacob T. VanderPlas ◽  
Alexander Gray ◽  
Andrew J. Connolly ◽  
Jacob T. VanderPlas ◽  
...  

With the dramatic increase in data available from a new generation of astronomical telescopes and instruments, many analyses must address the question of the complexity as well as size of the data set. This chapter deals with how we can learn which measurements, properties, or combinations thereof carry the most information within a data set. It describes techniques that are related to concepts discussed when describing Gaussian distributions, density estimation, and the concepts of information content. The chapter begins with an exploration of the problems posed by high-dimensional data. It then describes the data sets used in this chapter, and introduces perhaps the most important and widely used dimensionality reduction technique, principal component analysis (PCA). The remainder of the chapter discusses several alternative techniques which address some of the weaknesses of PCA.


BMC Genomics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Da Xu ◽  
Jialin Zhang ◽  
Hanxiao Xu ◽  
Yusen Zhang ◽  
Wei Chen ◽  
...  

Abstract Background The small number of samples and the curse of dimensionality hamper the better application of deep learning techniques for disease classification. Additionally, the performance of clustering-based feature selection algorithms is still far from being satisfactory due to their limitation in using unsupervised learning methods. To enhance interpretability and overcome this problem, we developed a novel feature selection algorithm. In the meantime, complex genomic data brought great challenges for the identification of biomarkers and therapeutic targets. The current some feature selection methods have the problem of low sensitivity and specificity in this field. Results In this article, we designed a multi-scale clustering-based feature selection algorithm named MCBFS which simultaneously performs feature selection and model learning for genomic data analysis. The experimental results demonstrated that MCBFS is robust and effective by comparing it with seven benchmark and six state-of-the-art supervised methods on eight data sets. The visualization results and the statistical test showed that MCBFS can capture the informative genes and improve the interpretability and visualization of tumor gene expression and single-cell sequencing data. Additionally, we developed a general framework named McbfsNW using gene expression data and protein interaction data to identify robust biomarkers and therapeutic targets for diagnosis and therapy of diseases. The framework incorporates the MCBFS algorithm, network recognition ensemble algorithm and feature selection wrapper. McbfsNW has been applied to the lung adenocarcinoma (LUAD) data sets. The preliminary results demonstrated that higher prediction results can be attained by identified biomarkers on the independent LUAD data set, and we also structured a drug-target network which may be good for LUAD therapy. Conclusions The proposed novel feature selection method is robust and effective for gene selection, classification, and visualization. The framework McbfsNW is practical and helpful for the identification of biomarkers and targets on genomic data. It is believed that the same methods and principles are extensible and applicable to other different kinds of data sets.


2018 ◽  
Vol 17 ◽  
pp. 117693511877108 ◽  
Author(s):  
Min Wang ◽  
Steven M Kornblau ◽  
Kevin R Coombes

Principal component analysis (PCA) is one of the most common techniques in the analysis of biological data sets, but applying PCA raises 2 challenges. First, one must determine the number of significant principal components (PCs). Second, because each PC is a linear combination of genes, it rarely has a biological interpretation. Existing methods to determine the number of PCs are either subjective or computationally extensive. We review several methods and describe a new R package, PCDimension, that implements additional methods, the most important being an algorithm that extends and automates a graphical Bayesian method. Using simulations, we compared the methods. Our newly automated procedure is competitive with the best methods when considering both accuracy and speed and is the most accurate when the number of objects is small compared with the number of attributes. We applied the method to a proteomics data set from patients with acute myeloid leukemia. Proteins in the apoptosis pathway could be explained using 6 PCs. By clustering the proteins in PC space, we were able to replace the PCs by 6 “biological components,” 3 of which could be immediately interpreted from the current literature. We expect this approach combining PCA with clustering to be widely applicable.


2013 ◽  
Vol 13 (6) ◽  
pp. 3133-3147 ◽  
Author(s):  
Y. L. Roberts ◽  
P. Pilewskie ◽  
B. C. Kindel ◽  
D. R. Feldman ◽  
W. D. Collins

Abstract. The Climate Absolute Radiance and Refractivity Observatory (CLARREO) is a climate observation system that has been designed to monitor the Earth's climate with unprecedented absolute radiometric accuracy and SI traceability. Climate Observation System Simulation Experiments (OSSEs) have been generated to simulate CLARREO hyperspectral shortwave imager measurements to help define the measurement characteristics needed for CLARREO to achieve its objectives. To evaluate how well the OSSE-simulated reflectance spectra reproduce the Earth's climate variability at the beginning of the 21st century, we compared the variability of the OSSE reflectance spectra to that of the reflectance spectra measured by the Scanning Imaging Absorption Spectrometer for Atmospheric Cartography (SCIAMACHY). Principal component analysis (PCA) is a multivariate decomposition technique used to represent and study the variability of hyperspectral radiation measurements. Using PCA, between 99.7% and 99.9% of the total variance the OSSE and SCIAMACHY data sets can be explained by subspaces defined by six principal components (PCs). To quantify how much information is shared between the simulated and observed data sets, we spectrally decomposed the intersection of the two data set subspaces. The results from four cases in 2004 showed that the two data sets share eight (January and October) and seven (April and July) dimensions, which correspond to about 99.9% of the total SCIAMACHY variance for each month. The spectral nature of these shared spaces, understood by examining the transformed eigenvectors calculated from the subspace intersections, exhibit similar physical characteristics to the original PCs calculated from each data set, such as water vapor absorption, vegetation reflectance, and cloud reflectance.


Author(s):  
Quanming Yao ◽  
Xiawei Guo ◽  
James Kwok ◽  
Weiwei Tu ◽  
Yuqiang Chen ◽  
...  

To meet the standard of differential privacy, noise is usually added into the original data, which inevitably deteriorates the predicting performance of subsequent learning algorithms. In this paper, motivated by the success of improving predicting performance by ensemble learning, we propose to enhance privacy-preserving logistic regression by stacking. We show that this can be done either by sample-based or feature-based partitioning. However, we prove that when privacy-budgets are the same, feature-based partitioning requires fewer samples than sample-based one, and thus likely has better empirical performance. As transfer learning is difficult to be integrated with a differential privacy guarantee, we further combine the proposed method with hypothesis transfer learning to address the problem of learning across different organizations. Finally, we not only demonstrate the effectiveness of our method on two benchmark data sets, i.e., MNIST and NEWS20, but also apply it into a real application of cross-organizational diabetes prediction from RUIJIN data set, where privacy is of a significant concern.


Sign in / Sign up

Export Citation Format

Share Document