scholarly journals Federated-PCA on Vertical-Partitioned Data

Author(s):  
Yiu-ming Cheung ◽  
Feng Yu

In the cross-silo federated learning setting, one kind of data partition according to features, which is so-called vertical federated learning (i.e. feature-wise federated learning) (Yang et al. 2019), is to apply to multiple datasets that share the same sample ID space but different feature spaces. Simultaneously, the image dataset can also be partitioned according to labels. To improve the model performance of the isolated parties based on feature-wise (i.e. label-wise) results, the most effective method is to federate the model results of the isolated parties together. However, it is a non-trivial task to allow the participating parties to share the model results without violating the data privacy of the parties. In this paper, within the framework of principal component analysis (PCA), we propose a Federated-PCA machine learning approach, in which the PCA method is used to reduce the dimensionality of sample data for all parties and extract the principal component feature information to improve the efficiency of subsequent training work. This process will not reveal the original data information of each party. The federal system can help each side build a common profit strategy. Under this federal mechanism, the identity and status of each party are the same. By comparing the federated results of the isolated parties and the result of the unseparated party through multiple sets of comparative experiments, we find that the experimental results of these two settings are close, and the proposed method can effectively improve the training model performance of most participating parties.

2020 ◽  
Author(s):  
Yiu-ming Cheung ◽  
Feng Yu

In the cross-silo federated learning setting, one kind of data partition according to features, which is so-called vertical federated learning (i.e. feature-wise federated learning) (Yang et al. 2019), is to apply to multiple datasets that share the same sample ID space but different feature spaces. Simultaneously, the image dataset can also be partitioned according to labels. To improve the model performance of the isolated parties based on feature-wise (i.e. label-wise) results, the most effective method is to federate the model results of the isolated parties together. However, it is a non-trivial task to allow the participating parties to share the model results without violating the data privacy of the parties. In this paper, within the framework of principal component analysis (PCA), we propose a Federated-PCA machine learning approach, in which the PCA method is used to reduce the dimensionality of sample data for all parties and extract the principal component feature information to improve the efficiency of subsequent training work. This process will not reveal the original data information of each party. The federal system can help each side build a common profit strategy. Under this federal mechanism, the identity and status of each party are the same. By comparing the federated results of the isolated parties and the result of the unseparated party through multiple sets of comparative experiments, we find that the experimental results of these two settings are close, and the proposed method can effectively improve the training model performance of most participating parties.


2020 ◽  
Author(s):  
Yannis Pantazis ◽  
Christos Tselas ◽  
Kleanthi Lakiotaki ◽  
Vincenzo Lagani ◽  
Ioannis Tsamardinos

AbstractHigh-throughput technologies such as microarrays and RNA-sequencing (RNA-seq) allow to precisely quantify transcriptomic profiles, generating datasets that are inevitably high-dimensional. In this work, we investigate whether the whole human transcriptome can be represented in a compressed, low dimensional latent space without loosing relevant information. We thus constructed low-dimensional latent feature spaces of the human genome, by utilizing three dimensionality reduction approaches and a diverse set of curated datasets. We applied standard Principal Component Analysis (PCA), kernel PCA and Autoencoder Neural Networks on 1360 datasets from four different measurement technologies. The latent feature spaces are tested for their ability to (a) reconstruct the original data and (b) improve predictive performance on validation datasets not used during the creation of the feature space. While linear techniques show better reconstruction performance, nonlinear approaches, particularly, neural-based models seem to be able to capture non-additive interaction effects, and thus enjoy stronger predictive capabilities. Our results show that low dimensional representations of the human transcriptome can be achieved by integrating hundreds of datasets, despite the limited sample size of each dataset and the biological / technological heterogeneity across studies. The created space is two to three orders of magnitude smaller compared to the raw data, offering the ability of capturing a large portion of the original data variability and eventually reducing computational time for downstream analyses.


2013 ◽  
Vol 756-759 ◽  
pp. 3450-3454
Author(s):  
Feng Qiao ◽  
Hao Ming Zhao ◽  
Feng Zhang ◽  
Qing Ma

There are some disadvantages for fault detection and diagnosis with traditional Principal Component Analysis (PCA) method because of its shortcomings. It is, in this paper, presented a novel fault diagnosis method based on conventional PCA enhanced by wavelet denoising. The proposed method employs wavelet denoising to deal with the signals, which can reserve enough information of original data, and then establishes PCA model. Based on SPE and T2 statistics, abnormal situation can be detected. And the location of the fault can be recognized via contribution plots. At last, the simulation studies with Matlab are carried out to verify the correctness and effectiveness of the proposed method, the advantages of the proposed method over the conventional PCA also are shown in the simulation.


2021 ◽  
Vol 11 (2) ◽  
pp. 911-923
Author(s):  
Chuanjun Han ◽  
Yang Yue

AbstractAt present, oil companies are committed to applying the theory and means of mathematics or data science to the research of oilfield data rules. However, for some old oil wells, aging equipment, complex environment and backward management, cause the authenticity and accuracy of the data collected by the equipment cannot be determined. According to the actual engineering demand of the old wells, this paper proposes a method based on principal component analysis, cluster analysis and regression analysis to mine and analyze the data of polished rod load of old oil wells, so as to judge the working conditions of the oil wells. Combined with the application of this study in several operation areas of some oilfields, the findings of this study can help for better understanding of the working condition information hidden in "big data" of oilfield. Meanwhile, the PCA method can reduce the complexity of the original data, the regression equation can calculate the size of the polished rod load more accurately, and the prediction model can effectively judge the working conditions of the old oil wells on site.


2021 ◽  
Vol 11 (3) ◽  
pp. 359
Author(s):  
Katharina Hogrefe ◽  
Georg Goldenberg ◽  
Ralf Glindemann ◽  
Madleen Klonowski ◽  
Wolfram Ziegler

Assessment of semantic processing capacities often relies on verbal tasks which are, however, sensitive to impairments at several language processing levels. Especially for persons with aphasia there is a strong need for a tool that measures semantic processing skills independent of verbal abilities. Furthermore, in order to assess a patient’s potential for using alternative means of communication in cases of severe aphasia, semantic processing should be assessed in different nonverbal conditions. The Nonverbal Semantics Test (NVST) is a tool that captures semantic processing capacities through three tasks—Semantic Sorting, Drawing, and Pantomime. The main aim of the current study was to investigate the relationship between the NVST and measures of standard neurolinguistic assessment. Fifty-one persons with aphasia caused by left hemisphere brain damage were administered the NVST as well as the Aachen Aphasia Test (AAT). A principal component analysis (PCA) was conducted across all AAT and NVST subtests. The analysis resulted in a two-factor model that captured 69% of the variance of the original data, with all linguistic tasks loading high on one factor and the NVST subtests loading high on the other. These findings suggest that nonverbal tasks assessing semantic processing capacities should be administered alongside standard neurolinguistic aphasia tests.


2021 ◽  
pp. 000370282098784
Author(s):  
James Renwick Beattie ◽  
Francis Esmonde-White

Spectroscopy rapidly captures a large amount of data that is not directly interpretable. Principal Components Analysis (PCA) is widely used to simplify complex spectral datasets into comprehensible information by identifying recurring patterns in the data with minimal loss of information. The linear algebra underpinning PCA is not well understood by many applied analytical scientists and spectroscopists who use PCA. The meaning of features identified through PCA are often unclear. This manuscript traces the journey of the spectra themselves through the operations behind PCA, with each step illustrated by simulated spectra. PCA relies solely on the information within the spectra, consequently the mathematical model is dependent on the nature of the data itself. The direct links between model and spectra allow concrete spectroscopic explanation of PCA, such the scores representing ‘concentration’ or ‘weights’. The principal components (loadings) are by definition hidden, repeated and uncorrelated spectral shapes that linearly combine to generate the observed spectra. They can be visualized as subtraction spectra between extreme differences within the dataset. Each PC is shown to be a successive refinement of the estimated spectra, improving the fit between PC reconstructed data and the original data. Understanding the data-led development of a PCA model shows how to interpret application specific chemical meaning of the PCA loadings and how to analyze scores. A critical benefit of PCA is its simplicity and the succinctness of its description of a dataset, making it powerful and flexible.


2021 ◽  
Vol 11 (5) ◽  
pp. 2166
Author(s):  
Van Bui ◽  
Tung Lam Pham ◽  
Huy Nguyen ◽  
Yeong Min Jang

In the last decade, predictive maintenance has attracted a lot of attention in industrial factories because of its wide use of the Internet of Things and artificial intelligence algorithms for data management. However, in the early phases where the abnormal and faulty machines rarely appeared in factories, there were limited sets of machine fault samples. With limited fault samples, it is difficult to perform a training process for fault classification due to the imbalance of input data. Therefore, data augmentation was required to increase the accuracy of the learning model. However, there were limited methods to generate and evaluate the data applied for data analysis. In this paper, we introduce a method of using the generative adversarial network as the fault signal augmentation method to enrich the dataset. The enhanced data set could increase the accuracy of the machine fault detection model in the training process. We also performed fault detection using a variety of preprocessing approaches and classified the models to evaluate the similarities between the generated data and authentic data. The generated fault data has high similarity with the original data and it significantly improves the accuracy of the model. The accuracy of fault machine detection reaches 99.41% with 20% original fault machine data set and 93.1% with 0% original fault machine data set (only use generate data only). Based on this, we concluded that the generated data could be used to mix with original data and improve the model performance.


Author(s):  
Dhamanpreet Kaur ◽  
Matthew Sobiesk ◽  
Shubham Patil ◽  
Jin Liu ◽  
Puran Bhagat ◽  
...  

Abstract Objective This study seeks to develop a fully automated method of generating synthetic data from a real dataset that could be employed by medical organizations to distribute health data to researchers, reducing the need for access to real data. We hypothesize the application of Bayesian networks will improve upon the predominant existing method, medBGAN, in handling the complexity and dimensionality of healthcare data. Materials and Methods We employed Bayesian networks to learn probabilistic graphical structures and simulated synthetic patient records from the learned structure. We used the University of California Irvine (UCI) heart disease and diabetes datasets as well as the MIMIC-III diagnoses database. We evaluated our method through statistical tests, machine learning tasks, preservation of rare events, disclosure risk, and the ability of a machine learning classifier to discriminate between the real and synthetic data. Results Our Bayesian network model outperformed or equaled medBGAN in all key metrics. Notable improvement was achieved in capturing rare variables and preserving association rules. Discussion Bayesian networks generated data sufficiently similar to the original data with minimal risk of disclosure, while offering additional transparency, computational efficiency, and capacity to handle more data types in comparison to existing methods. We hope this method will allow healthcare organizations to efficiently disseminate synthetic health data to researchers, enabling them to generate hypotheses and develop analytical tools. Conclusion We conclude the application of Bayesian networks is a promising option for generating realistic synthetic health data that preserves the features of the original data without compromising data privacy.


2021 ◽  
Vol 13 (6) ◽  
pp. 1205
Author(s):  
Caidan Zhao ◽  
Gege Luo ◽  
Yilin Wang ◽  
Caiyun Chen ◽  
Zhiqiang Wu

A micro-Doppler signature (m-DS) based on the rotation of drone blades is an effective way to detect and identify small drones. Deep-learning-based recognition algorithms can achieve higher recognition performance, but they needs a large amount of sample data to train models. In addition to the hovering state, the signal samples of small unmanned aerial vehicles (UAVs) should also include flight dynamics, such as vertical, pitch, forward and backward, roll, lateral, and yaw. However, it is difficult to collect all dynamic UAV signal samples under actual flight conditions, and these dynamic flight characteristics will lead to the deviation of the original features, thus affecting the performance of the recognizer. In this paper, we propose a small UAV m-DS recognition algorithm based on dynamic feature enhancement. We extract the combined principal component analysis and discrete wavelet transform (PCA-DWT) time–frequency characteristics and texture features of the UAV’s micro-Doppler signal and use a dynamic attribute-guided augmentation (DAGA) algorithm to expand the feature domain for model training to achieve an adaptive, accurate, and efficient multiclass recognition model in complex environments. After the training model is stable, the average recognition accuracy rate can reach 98% during dynamic flight.


Sign in / Sign up

Export Citation Format

Share Document