Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study

Background Machine learning (ML) is now widely deployed in our everyday lives. Building robust ML models requires a massive amount of data for training. Traditional ML algorithms require training data centralization, which raises privacy and data governance issues. Federated learning (FL) is an approach to overcome this issue. We focused on applying FL on vertically partitioned data, in which an individual’s record is scattered among different sites. Objective The aim of this study was to perform FL on vertically partitioned data to achieve performance comparable to that of centralized models without exposing the raw data. Methods We used three different datasets (Adult income, Schwannoma, and eICU datasets) and vertically divided each dataset into different pieces. Following the vertical division of data, overcomplete autoencoder-based model training was performed for each site. Following training, each site’s data were transformed into latent data, which were aggregated for training. A tabular neural network model with categorical embedding was used for training. A centrally based model was used as a baseline model, which was compared to that of FL in terms of accuracy and area under the receiver operating characteristic curve (AUROC). Results The autoencoder-based network successfully transformed the original data into latent representations with no domain knowledge applied. These altered data were different from the original data in terms of the feature space and data distributions, indicating appropriate data security. The loss of performance was minimal when using an overcomplete autoencoder; accuracy loss was 1.2%, 8.89%, and 1.23%, and AUROC loss was 1.1%, 0%, and 1.12% in the Adult income, Schwannoma, and eICU dataset, respectively. Conclusions We proposed an autoencoder-based ML model for vertically incomplete data. Since our model is based on unsupervised learning, no domain-specific knowledge is required in individual sites. Under the circumstances where direct data sharing is not available, our approach may be a practical solution enabling both data protection and building a robust model.

Download Full-text

Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study (Preprint)

10.2196/preprints.26598 ◽

2020 ◽

Author(s):

Dongchul Cha ◽

MinDong Sung ◽

Yu-Rang Park

Keyword(s):

Domain Knowledge ◽

Characteristic Curve ◽

Feature Space ◽

Original Data ◽

Training Data ◽

Data Governance ◽

Domain Specific Knowledge ◽

Partitioned Data ◽

Vertically Partitioned Data ◽

Latent Representations

BACKGROUND Machine learning (ML) is now widely deployed in our everyday lives. Building robust ML models requires a massive amount of data for training. Traditional ML algorithms require training data centralization, which raises privacy and data governance issues. Federated learning (FL) is an approach to overcome this issue. We focused on applying FL on vertically partitioned data, in which an individual’s record is scattered among different sites. OBJECTIVE The aim of this study was to perform FL on vertically partitioned data to achieve performance comparable to that of centralized models without exposing the raw data. METHODS We used three different datasets (Adult income, Schwannoma, and eICU datasets) and vertically divided each dataset into different pieces. Following the vertical division of data, overcomplete autoencoder-based model training was performed for each site. Following training, each site’s data were transformed into latent data, which were aggregated for training. A tabular neural network model with categorical embedding was used for training. A centrally based model was used as a baseline model, which was compared to that of FL in terms of accuracy and area under the receiver operating characteristic curve (AUROC). RESULTS The autoencoder-based network successfully transformed the original data into latent representations with no domain knowledge applied. These altered data were different from the original data in terms of the feature space and data distributions, indicating appropriate data security. The loss of performance was minimal when using an overcomplete autoencoder; accuracy loss was 1.2%, 8.89%, and 1.23%, and AUROC loss was 1.1%, 0%, and 1.12% in the Adult income, Schwannoma, and eICU dataset, respectively. CONCLUSIONS We proposed an autoencoder-based ML model for vertically incomplete data. Since our model is based on unsupervised learning, no domain-specific knowledge is required in individual sites. Under the circumstances where direct data sharing is not available, our approach may be a practical solution enabling both data protection and building a robust model.

Download Full-text

A MULTIPLE CLASSIFIER SYSTEM USING AMBIGUITY REJECTION FOR CLUSTERING-CLASSIFICATION COOPERATION

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s021848850000054x ◽

2000 ◽

Vol 08 (06) ◽

pp. 747-762 ◽

Cited By ~ 2

Author(s):

VEYIS GUNES ◽

MICHEL MENARD ◽

PIERRE LOONIS

Keyword(s):

Fuzzy Clustering ◽

Feature Space ◽

Original Data ◽

Point Of View ◽

Training Data ◽

Data Set ◽

Classifier Selection ◽

Fuzzy Clustering Method ◽

Data Clusters ◽

Multiple Classifier

This article aims at showing that supervised and unsupervised learnings are not competitive, but complementary methods. We propose to use a fuzzy clustering method with ambiguity rejection to guide the supervised learning performed by bayesian classifiers. This method detects ambiguous or mixed areas of a learning set. The problem is seen from the multi-decision point of view (i.e. several classification modules). Each classification module is specialized on a particular region of the feature space. These regions are obtained by fuzzy clustering and constitute the original data set by union. A classifier is associated with each cluster. The training set for each classifier is then defined on the cluster and its associated ambiguous clusters. The overall system is parallel, since different classifiers work with their own training data clusters. The algorithm makes possible the adaptive classifier selection in the sense that the fuzzy clustering with ambiguity rejection gives adapted training data regions of the feature space. The decision making is the fusion of outputs from the most adapted classifiers.

Download Full-text

VERTIcal Grid lOgistic regression (VERTIGO)

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocv146 ◽

2015 ◽

Vol 23 (3) ◽

pp. 570-579 ◽

Cited By ~ 16

Author(s):

Yong Li ◽

Xiaoqian Jiang ◽

Shuang Wang ◽

Hongkai Xiong ◽

Lucila Ohno-Machado

Keyword(s):

Logistic Regression ◽

Data Analysis ◽

Characteristic Curve ◽

Computational Cost ◽

Data Sets ◽

Classification Problems ◽

Number Of Patients ◽

Vertical Grid ◽

Partitioned Data ◽

Vertically Partitioned Data

Objective To develop an accurate logistic regression (LR) algorithm to support federated data analysis of vertically partitioned distributed data sets. Material and Methods We propose a novel technique that solves the binary LR problem by dual optimization to obtain a global solution for vertically partitioned data. We evaluated this new method, VERTIcal Grid lOgistic regression (VERTIGO), in artificial and real-world medical classification problems in terms of the area under the receiver operating characteristic curve, calibration, and computational complexity. We assumed that the institutions could “align” patient records (through patient identifiers or hashed “privacy-protecting” identifiers), and also that they both had access to the values for the dependent variable in the LR model (eg, that if the model predicts death, both institutions would have the same information about death). Results The solution derived by VERTIGO has the same estimated parameters as the solution derived by applying classical LR. The same is true for discrimination and calibration over both simulated and real data sets. In addition, the computational cost of VERTIGO is not prohibitive in practice. Discussion There is a technical challenge in scaling up federated LR for vertically partitioned data. When the number of patients m is large, our algorithm has to invert a large Hessian matrix. This is an expensive operation of time complexity O(m3) that may require large amounts of memory for storage and exchange of information. The algorithm may also not work well when the number of observations in each class is highly imbalanced. Conclusion The proposed VERTIGO algorithm can generate accurate global models to support federated data analysis of vertically partitioned data.

Download Full-text

Siamese Reconstruction Network: Accurate Image Reconstruction from Human Brain Activity by Learning to Compare

Applied Sciences ◽

10.3390/app9224749 ◽

2019 ◽

Vol 9 (22) ◽

pp. 4749

Author(s):

Lingyun Jiang ◽

Kai Qiao ◽

Linyuan Wang ◽

Chi Zhang ◽

Jian Chen ◽

...

Keyword(s):

Deep Learning ◽

Human Brain ◽

Brain Activity ◽

Feature Space ◽

Training Data ◽

Reconstruction Method ◽

Learning Method ◽

Training Samples ◽

Visual Reconstruction ◽

Relationship Of

Decoding human brain activities, especially reconstructing human visual stimuli via functional magnetic resonance imaging (fMRI), has gained increasing attention in recent years. However, the high dimensionality and small quantity of fMRI data impose restrictions on satisfactory reconstruction, especially for the reconstruction method with deep learning requiring huge amounts of labelled samples. When compared with the deep learning method, humans can recognize a new image because our human visual system is naturally capable of extracting features from any object and comparing them. Inspired by this visual mechanism, we introduced the mechanism of comparison into deep learning method to realize better visual reconstruction by making full use of each sample and the relationship of the sample pair by learning to compare. In this way, we proposed a Siamese reconstruction network (SRN) method. By using the SRN, we improved upon the satisfying results on two fMRI recording datasets, providing 72.5% accuracy on the digit dataset and 44.6% accuracy on the character dataset. Essentially, this manner can increase the training data about from n samples to 2n sample pairs, which takes full advantage of the limited quantity of training samples. The SRN learns to converge sample pairs of the same class or disperse sample pairs of different class in feature space.

Download Full-text

bCNN-Methylpred: Feature-Based Prediction of RNA Sequence Modification Using Branch Convolutional Neural Network

Genes ◽

10.3390/genes12081155 ◽

2021 ◽

Vol 12 (8) ◽

pp. 1155

Author(s):

Naeem Islam ◽

Jaebyung Park

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Domain Knowledge ◽

Feature Space ◽

Experimental Methods ◽

Machine Learning Techniques ◽

Rna Modification ◽

Biological Processes ◽

Rna Sequence ◽

Functional Mechanisms

RNA modification is vital to various cellular and biological processes. Among the existing RNA modifications, N6-methyladenosine (m6A) is considered the most important modification owing to its involvement in many biological processes. The prediction of m6A sites is crucial because it can provide a better understanding of their functional mechanisms. In this regard, although experimental methods are useful, they are time consuming. Previously, researchers have attempted to predict m6A sites using computational methods to overcome the limitations of experimental methods. Some of these approaches are based on classical machine-learning techniques that rely on handcrafted features and require domain knowledge, whereas other methods are based on deep learning. However, both methods lack robustness and yield low accuracy. Hence, we develop a branch-based convolutional neural network and a novel RNA sequence representation. The proposed network automatically extracts features from each branch of the designated inputs. Subsequently, these features are concatenated in the feature space to predict the m6A sites. Finally, we conduct experiments using four different species. The proposed approach outperforms existing state-of-the-art methods, achieving accuracies of 94.91%, 94.28%, 88.46%, and 94.8% for the H. sapiens, M. musculus, S. cerevisiae, and A. thaliana datasets, respectively.

Download Full-text

MULFE: Multi-Label Learning via Label-Specific Feature Space Ensemble

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3451392 ◽

2021 ◽

Vol 16 (1) ◽

pp. 1-24

Author(s):

Yaojin Lin ◽

Qinghua Hu ◽

Jinghua Liu ◽

Xingquan Zhu ◽

Xindong Wu

Keyword(s):

Empirical Studies ◽

Feature Space ◽

Training Data ◽

Data Sets ◽

Learning Framework ◽

Feature Spaces ◽

Public Data ◽

Margin Distribution ◽

Label Correlations ◽

Label Correlation

In multi-label learning, label correlations commonly exist in the data. Such correlation not only provides useful information, but also imposes significant challenges for multi-label learning. Recently, label-specific feature embedding has been proposed to explore label-specific features from the training data, and uses feature highly customized to the multi-label set for learning. While such feature embedding methods have demonstrated good performance, the creation of the feature embedding space is only based on a single label, without considering label correlations in the data. In this article, we propose to combine multiple label-specific feature spaces, using label correlation, for multi-label learning. The proposed algorithm, mu lti- l abel-specific f eature space e nsemble (MULFE), takes consideration label-specific features, label correlation, and weighted ensemble principle to form a learning framework. By conducting clustering analysis on each label’s negative and positive instances, MULFE first creates features customized to each label. After that, MULFE utilizes the label correlation to optimize the margin distribution of the base classifiers which are induced by the related label-specific feature spaces. By combining multiple label-specific features, label correlation based weighting, and ensemble learning, MULFE achieves maximum margin multi-label classification goal through the underlying optimization framework. Empirical studies on 10 public data sets manifest the effectiveness of MULFE.

Download Full-text

Experiments of Image Classification Using Dissimilarity Spaces Built with Siamese Networks

Sensors ◽

10.3390/s21051573 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1573

Author(s):

Loris Nanni ◽

Giovanni Minchio ◽

Sheryl Brahnam ◽

Gianluca Maguolo ◽

Alessandra Lumini

Keyword(s):

Vector Space ◽

Image Classification ◽

Ad Hoc ◽

Feature Space ◽

Medical Data ◽

Training Data ◽

Data Sets ◽

Large Set ◽

Clustering Methods ◽

Siamese Networks

Traditionally, classifiers are trained to predict patterns within a feature space. The image classification system presented here trains classifiers to predict patterns within a vector space by combining the dissimilarity spaces generated by a large set of Siamese Neural Networks (SNNs). A set of centroids from the patterns in the training data sets is calculated with supervised k-means clustering. The centroids are used to generate the dissimilarity space via the Siamese networks. The vector space descriptors are extracted by projecting patterns onto the similarity spaces, and SVMs classify an image by its dissimilarity vector. The versatility of the proposed approach in image classification is demonstrated by evaluating the system on different types of images across two domains: two medical data sets and two animal audio data sets with vocalizations represented as images (spectrograms). Results show that the proposed system’s performance competes competitively against the best-performing methods in the literature, obtaining state-of-the-art performance on one of the medical data sets, and does so without ad-hoc optimization of the clustering methods on the tested data sets.

Download Full-text

Fast and Secure Back-Propagation Learning Using Vertically Partitioned Data with IoT

2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW) ◽

10.1109/candarw.2019.00085 ◽

2019 ◽

Author(s):

Hirofumi Miyajima ◽

Hiromi Miyajima ◽

Norio Shiratori

Keyword(s):

Back Propagation ◽

Partitioned Data ◽

Vertically Partitioned Data

Download Full-text

A review: preprocessing techniques and data augmentation for sentiment analysis

Computational Social Networks ◽

10.1186/s40649-020-00080-x ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Huu-Thanh Duong ◽

Tram-Anh Nguyen-Thi

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Supervised Learning ◽

Data Augmentation ◽

Original Data ◽

Training Data ◽

Unseen Data ◽

Augmentation Techniques ◽

User Intervention

AbstractIn literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.

Download Full-text

Evaluating Grayware Characteristics and Risks

Journal of Computer Networks and Communications ◽

10.1155/2011/569829 ◽

2011 ◽

Vol 2011 ◽

pp. 1-28 ◽

Cited By ~ 1

Author(s):

Zhongqiang Chen ◽

Zhanyan Liang ◽

Yuan Zhang ◽

Zhongrong Chen

Keyword(s):

Information Gain ◽

Feature Space ◽

Training Data ◽

Support Vector ◽

Learning Models ◽

Generalization Capability ◽

Self Organizing Maps ◽

Defense Strategies ◽

Security Applications ◽

Vector Machines

Grayware encyclopedias collect known species to provide information for incident analysis, however, the lack of categorization and generalization capability renders them ineffective in the development of defense strategies against clustered strains. A grayware categorization framework is therefore proposed here to not only classify grayware according to diverse taxonomic features but also facilitate evaluations on grayware risk to cyberspace. Armed with Support Vector Machines, the framework builds learning models based on training data extracted automatically from grayware encyclopedias and visualizes categorization results with Self-Organizing Maps. The features used in learning models are selected with information gain and the high dimensionality of feature space is reduced by word stemming and stopword removal process. The grayware categorizations on diversified features reveal that grayware typically attempts to improve its penetration rate by resorting to multiple installation mechanisms and reduced code footprints. The framework also shows that grayware evades detection by attacking victims' security applications and resists being removed by enhancing its clotting capability with infected hosts. Our analysis further points out that species in categoriesSpywareandAdwarecontinue to dominate the grayware landscape and impose extremely critical threats to the Internet ecosystem.

Download Full-text