Privacy Preserving Clustering for Distributed Homogeneous Gene Expression Data Sets

Author(s):  
Xin Li

In this paper, the authors present a new approach to perform principal component analysis (PCA)-based gene clustering on genomic data distributed in multiple sites (horizontal partitions) with privacy protection. This approach allows data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. The authors developed a framework for privacy preserving PCA-based gene clustering, which includes two types of participants such as data providers and a trusted central site. Within this mechanism, distributed horizontal partitions of genomic data can be globally clustered with privacy preservation. Compared to results from centralized scenarios, the result generated from distributed partitions achieves 100% accuracy by using this approach. An experiment on a real genomic data set is conducted, and result shows that the proposed framework produces exactly the same cluster formation as that from the centralized data set.

Author(s):  
Xin Li

In this paper, the authors present a new approach to perform principal component analysis (PCA)-based gene clustering on genomic data distributed in multiple sites (horizontal partitions) with privacy protection. This approach allows data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. The authors developed a framework for privacy preserving PCA-based gene clustering, which includes two types of participants such as data providers and a trusted central site. Within this mechanism, distributed horizontal partitions of genomic data can be globally clustered with privacy preservation. Compared to results from centralized scenarios, the result generated from distributed partitions achieves 100% accuracy by using this approach. An experiment on a real genomic data set is conducted, and result shows that the proposed framework produces exactly the same cluster formation as that from the centralized data set.


Author(s):  
Xin Li

In this paper, we present approaches to perform principal component analysis (PCA) clustering for distributed heterogeneous genomic datasets with privacy protection. The approaches allow data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. We then further develop a framework for privacy preserving PCA-based gene clustering, which includes two types of participants: data providers and a trusted central site (TCS). Two different methodologies are employed: Collective PCA (C-PCA) and Repeating PCA (R-PCA). The C-PCA requires local sites to transmit a sample of original data to the TCS and can be applied to any heterogeneous datasets. The R-PCA approach requires all local sites have the same or similar number of columns, but releases no original data. Experiments on five independent genomic datasets show that both C-PCA and R-PCA approaches maintain very good accuracy compared with the centralized scenario.


Author(s):  
Xin Li

In this paper, we present approaches to perform principal component analysis (PCA) clustering for distributed heterogeneous genomic datasets with privacy protection. The approaches allow data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. We then further develop a framework for privacy preserving PCA-based gene clustering, which includes two types of participants: data providers and a trusted central site (TCS). Two different methodologies are employed: Collective PCA (C-PCA) and Repeating PCA (R-PCA). The C-PCA requires local sites to transmit a sample of original data to the TCS and can be applied to any heterogeneous datasets. The R-PCA approach requires all local sites have the same or similar number of columns, but releases no original data. Experiments on five independent genomic datasets show that both C-PCA and R-PCA approaches maintain very good accuracy compared with the centralized scenario.


2007 ◽  
Vol 56 (6) ◽  
pp. 75-83 ◽  
Author(s):  
X. Flores ◽  
J. Comas ◽  
I.R. Roda ◽  
L. Jiménez ◽  
K.V. Gernaey

The main objective of this paper is to present the application of selected multivariable statistical techniques in plant-wide wastewater treatment plant (WWTP) control strategies analysis. In this study, cluster analysis (CA), principal component analysis/factor analysis (PCA/FA) and discriminant analysis (DA) are applied to the evaluation matrix data set obtained by simulation of several control strategies applied to the plant-wide IWA Benchmark Simulation Model No 2 (BSM2). These techniques allow i) to determine natural groups or clusters of control strategies with a similar behaviour, ii) to find and interpret hidden, complex and casual relation features in the data set and iii) to identify important discriminant variables within the groups found by the cluster analysis. This study illustrates the usefulness of multivariable statistical techniques for both analysis and interpretation of the complex multicriteria data sets and allows an improved use of information for effective evaluation of control strategies.


Author(s):  
L Mohana Tirumala ◽  
S. Srinivasa Rao

Privacy preserving in Data mining & publishing, plays a major role in today networked world. It is important to preserve the privacy of the vital information corresponding to a data set. This process can be achieved by k-anonymization solution for classification. Along with the privacy preserving using anonymization, yielding the optimized data sets is also of equal importance with a cost effective approach. In this paper Top-Down Refinement algorithm has been proposed which yields optimum results in a cost effective manner. Bayesian Classification has been proposed in this paper to predict class membership probabilities for a data tuple for which the associated class label is unknown.


Author(s):  
Andrew J. Connolly ◽  
Jacob T. VanderPlas ◽  
Alexander Gray ◽  
Andrew J. Connolly ◽  
Jacob T. VanderPlas ◽  
...  

With the dramatic increase in data available from a new generation of astronomical telescopes and instruments, many analyses must address the question of the complexity as well as size of the data set. This chapter deals with how we can learn which measurements, properties, or combinations thereof carry the most information within a data set. It describes techniques that are related to concepts discussed when describing Gaussian distributions, density estimation, and the concepts of information content. The chapter begins with an exploration of the problems posed by high-dimensional data. It then describes the data sets used in this chapter, and introduces perhaps the most important and widely used dimensionality reduction technique, principal component analysis (PCA). The remainder of the chapter discusses several alternative techniques which address some of the weaknesses of PCA.


BMC Genomics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Da Xu ◽  
Jialin Zhang ◽  
Hanxiao Xu ◽  
Yusen Zhang ◽  
Wei Chen ◽  
...  

Abstract Background The small number of samples and the curse of dimensionality hamper the better application of deep learning techniques for disease classification. Additionally, the performance of clustering-based feature selection algorithms is still far from being satisfactory due to their limitation in using unsupervised learning methods. To enhance interpretability and overcome this problem, we developed a novel feature selection algorithm. In the meantime, complex genomic data brought great challenges for the identification of biomarkers and therapeutic targets. The current some feature selection methods have the problem of low sensitivity and specificity in this field. Results In this article, we designed a multi-scale clustering-based feature selection algorithm named MCBFS which simultaneously performs feature selection and model learning for genomic data analysis. The experimental results demonstrated that MCBFS is robust and effective by comparing it with seven benchmark and six state-of-the-art supervised methods on eight data sets. The visualization results and the statistical test showed that MCBFS can capture the informative genes and improve the interpretability and visualization of tumor gene expression and single-cell sequencing data. Additionally, we developed a general framework named McbfsNW using gene expression data and protein interaction data to identify robust biomarkers and therapeutic targets for diagnosis and therapy of diseases. The framework incorporates the MCBFS algorithm, network recognition ensemble algorithm and feature selection wrapper. McbfsNW has been applied to the lung adenocarcinoma (LUAD) data sets. The preliminary results demonstrated that higher prediction results can be attained by identified biomarkers on the independent LUAD data set, and we also structured a drug-target network which may be good for LUAD therapy. Conclusions The proposed novel feature selection method is robust and effective for gene selection, classification, and visualization. The framework McbfsNW is practical and helpful for the identification of biomarkers and targets on genomic data. It is believed that the same methods and principles are extensible and applicable to other different kinds of data sets.


2018 ◽  
Vol 14 (4) ◽  
pp. 20-37 ◽  
Author(s):  
Yinglei Song ◽  
Yongzhong Li ◽  
Junfeng Qu

This article develops a new approach for supervised dimensionality reduction. This approach considers both global and local structures of a labelled data set and maximizes a new objective that includes the effects from both of them. The objective can be approximately optimized by solving an eigenvalue problem. The approach is evaluated based on a few benchmark data sets and image databases. Its performance is also compared with a few other existing approaches for dimensionality reduction. Testing results show that, on average, this new approach can achieve more accurate results for dimensionality reduction than existing approaches.


2018 ◽  
Vol 17 ◽  
pp. 117693511877108 ◽  
Author(s):  
Min Wang ◽  
Steven M Kornblau ◽  
Kevin R Coombes

Principal component analysis (PCA) is one of the most common techniques in the analysis of biological data sets, but applying PCA raises 2 challenges. First, one must determine the number of significant principal components (PCs). Second, because each PC is a linear combination of genes, it rarely has a biological interpretation. Existing methods to determine the number of PCs are either subjective or computationally extensive. We review several methods and describe a new R package, PCDimension, that implements additional methods, the most important being an algorithm that extends and automates a graphical Bayesian method. Using simulations, we compared the methods. Our newly automated procedure is competitive with the best methods when considering both accuracy and speed and is the most accurate when the number of objects is small compared with the number of attributes. We applied the method to a proteomics data set from patients with acute myeloid leukemia. Proteins in the apoptosis pathway could be explained using 6 PCs. By clustering the proteins in PC space, we were able to replace the PCs by 6 “biological components,” 3 of which could be immediately interpreted from the current literature. We expect this approach combining PCA with clustering to be widely applicable.


1988 ◽  
Vol 32 (17) ◽  
pp. 1183-1187
Author(s):  
J. G. Kreifeldt ◽  
S. H. Levine ◽  
M. C. Chuang

Sensory modalities exhibit a characteristic known as Weber's ratio which remarks that when two stimuli are compared for a difference: (1) there is some minimal nonzero difference which can be differentiated and (2) this minimal difference is a nearly constant proportion of the magnitude of the stimuli. Both of these would, in a typical measurement context, appear to be system defects. We have found through simulation explorations that in fact these are apparently the characteristics required by a system designed to extract an adequate amount of information from an incomplete observation data set according to a new approach to measurement.


Sign in / Sign up

Export Citation Format

Share Document