VERTIcal Grid lOgistic regression (VERTIGO)

Objective To develop an accurate logistic regression (LR) algorithm to support federated data analysis of vertically partitioned distributed data sets. Material and Methods We propose a novel technique that solves the binary LR problem by dual optimization to obtain a global solution for vertically partitioned data. We evaluated this new method, VERTIcal Grid lOgistic regression (VERTIGO), in artificial and real-world medical classification problems in terms of the area under the receiver operating characteristic curve, calibration, and computational complexity. We assumed that the institutions could “align” patient records (through patient identifiers or hashed “privacy-protecting” identifiers), and also that they both had access to the values for the dependent variable in the LR model (eg, that if the model predicts death, both institutions would have the same information about death). Results The solution derived by VERTIGO has the same estimated parameters as the solution derived by applying classical LR. The same is true for discrimination and calibration over both simulated and real data sets. In addition, the computational cost of VERTIGO is not prohibitive in practice. Discussion There is a technical challenge in scaling up federated LR for vertically partitioned data. When the number of patients m is large, our algorithm has to invert a large Hessian matrix. This is an expensive operation of time complexity O(m3) that may require large amounts of memory for storage and exchange of information. The algorithm may also not work well when the number of observations in each class is highly imbalanced. Conclusion The proposed VERTIGO algorithm can generate accurate global models to support federated data analysis of vertically partitioned data.

Download Full-text

Generalized random rotation perturbation for vertically partitioned data sets

2009 IEEE Symposium on Computational Intelligence and Data Mining ◽

10.1109/cidm.2009.4938644 ◽

2009 ◽

Cited By ~ 3

Author(s):

Zhenmin Lin ◽

Jie Wang ◽

Lian Liu ◽

Jun Zhang

Keyword(s):

Data Sets ◽

Random Rotation ◽

Partitioned Data ◽

Vertically Partitioned Data

Download Full-text

A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems

mBio ◽

10.1128/mbio.00434-20 ◽

2020 ◽

Vol 11 (3) ◽

Cited By ~ 9

Author(s):

Begüm D. Topçuoğlu ◽

Nicholas A. Lesniak ◽

Mack T. Ruffin ◽

Jenna Wiens ◽

Patrick D. Schloss

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Sequence Data ◽

Characteristic Curve ◽

Predictive Performance ◽

Model Complexity ◽

Support Vector ◽

Classification Problems ◽

Microbial Biomarkers

ABSTRACT Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs) (n = 490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1- and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, a decision tree, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an area under the receiver operating characteristic curve (AUROC) of 0.695 (interquartile range [IQR], 0.651 to 0.739) but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 (IQR, 0.625 to 0.735), trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability. IMPORTANCE Diagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely overoptimistic. Moreover, there is a trend toward using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step toward developing more-reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.

Download Full-text

Accuracy, Robustness and Scalability of Dimensionality Reduction Methods for Single Cell RNAseq Analysis

10.1101/641142 ◽

2019 ◽

Cited By ~ 2

Author(s):

Shiquan Sun ◽

Jiaqiang Zhu ◽

Ying Ma ◽

Xiang Zhou

Keyword(s):

Data Analysis ◽

Dimensionality Reduction ◽

Single Cell ◽

Comprehensive Evaluation ◽

Computational Cost ◽

Noise Removal ◽

Data Sets ◽

Vast Number ◽

Cell Clustering ◽

Reduction Methods

ABSTRACTBackgroundDimensionality reduction (DR) is an indispensable analytic component for many areas of single cell RNA sequencing (scRNAseq) data analysis. Proper DR can allow for effective noise removal and facilitate many downstream analyses that include cell clustering and lineage reconstruction. Unfortunately, despite the critical importance of DR in scRNAseq analysis and the vast number of DR methods developed for scRNAseq studies, however, few comprehensive comparison studies have been performed to evaluate the effectiveness of different DR methods in scRNAseq.ResultsHere, we aim to fill this critical knowledge gap by providing a comparative evaluation of a variety of commonly used DR methods for scRNAseq studies. Specifically, we compared 18 different DR methods on 30 publicly available scRNAseq data sets that cover a range of sequencing techniques and sample sizes. We evaluated the performance of different DR methods for neighborhood preserving in terms of their ability to recover features of the original expression matrix, and for cell clustering and lineage reconstruction in terms of their accuracy and robustness. We also evaluated the computational scalability of different DR methods by recording their computational cost.ConclusionsBased on the comprehensive evaluation results, we provide important guidelines for choosing DR methods for scRNAseq data analysis. We also provide all analysis scripts used in the present study atwww.xzlab.org/reproduce.html. Together, we hope that our results will serve as an important practical reference for practitioners to choose DR methods in the field of scRNAseq analysis.

Download Full-text

Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study

JMIR Medical Informatics ◽

10.2196/26598 ◽

2021 ◽

Vol 9 (6) ◽

pp. e26598

Author(s):

Dongchul Cha ◽

MinDong Sung ◽

Yu-Rang Park

Keyword(s):

Domain Knowledge ◽

Characteristic Curve ◽

Feature Space ◽

Original Data ◽

Training Data ◽

Data Governance ◽

Domain Specific Knowledge ◽

Partitioned Data ◽

Vertically Partitioned Data ◽

Latent Representations

Background Machine learning (ML) is now widely deployed in our everyday lives. Building robust ML models requires a massive amount of data for training. Traditional ML algorithms require training data centralization, which raises privacy and data governance issues. Federated learning (FL) is an approach to overcome this issue. We focused on applying FL on vertically partitioned data, in which an individual’s record is scattered among different sites. Objective The aim of this study was to perform FL on vertically partitioned data to achieve performance comparable to that of centralized models without exposing the raw data. Methods We used three different datasets (Adult income, Schwannoma, and eICU datasets) and vertically divided each dataset into different pieces. Following the vertical division of data, overcomplete autoencoder-based model training was performed for each site. Following training, each site’s data were transformed into latent data, which were aggregated for training. A tabular neural network model with categorical embedding was used for training. A centrally based model was used as a baseline model, which was compared to that of FL in terms of accuracy and area under the receiver operating characteristic curve (AUROC). Results The autoencoder-based network successfully transformed the original data into latent representations with no domain knowledge applied. These altered data were different from the original data in terms of the feature space and data distributions, indicating appropriate data security. The loss of performance was minimal when using an overcomplete autoencoder; accuracy loss was 1.2%, 8.89%, and 1.23%, and AUROC loss was 1.1%, 0%, and 1.12% in the Adult income, Schwannoma, and eICU dataset, respectively. Conclusions We proposed an autoencoder-based ML model for vertically incomplete data. Since our model is based on unsupervised learning, no domain-specific knowledge is required in individual sites. Under the circumstances where direct data sharing is not available, our approach may be a practical solution enabling both data protection and building a robust model.

Download Full-text

Assessing Global Covid-19 Cases Data through Compositional Data Analysis(CoDa)

10.1101/2020.12.17.20248424 ◽

2020 ◽

Author(s):

Luis P.V. Braga ◽

Dina Feigenbaum

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Compositional Data Analysis ◽

Discrete Groups ◽

Data Sets ◽

Cumulative Number ◽

Governmental Agencies ◽

Global Pandemic ◽

Number Of Patients ◽

Log Ratio

AbstractBackgroundCovid-19 cases data pose an enormous challenge to any analysis. The evaluation of such a global pandemic requires matching reports that follow different procedures and even overcoming some countries’ censorship that restricts publications.MethodsThis work proposes a methodology that could assist future studies. Compositional Data Analysis (CoDa) is proposed as the proper approach as Covid-19 cases data is compositional in nature. Under this methodology, for each country three attributes were selected: cumulative number of deaths (D); cumulative number of recovered patients(R); present number of patients (A).ResultsAfter the operation called closure, with c=1, a ternary diagram and Log-Ratio plots, as well as, compositional statistics are presented. Cluster analysis is then applied, splitting the countries into discrete groups.ConclusionsThis methodology can also be applied to other data sets such as countries, cities, provinces or districts in order to help authorities and governmental agencies to improve their actions to fight against a pandemic.

Download Full-text

Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study (Preprint)

10.2196/preprints.26598 ◽

2020 ◽

Author(s):

Dongchul Cha ◽

MinDong Sung ◽

Yu-Rang Park

Keyword(s):

Domain Knowledge ◽

Characteristic Curve ◽

Feature Space ◽

Original Data ◽

Training Data ◽

Data Governance ◽

Domain Specific Knowledge ◽

Partitioned Data ◽

Vertically Partitioned Data ◽

Latent Representations

BACKGROUND Machine learning (ML) is now widely deployed in our everyday lives. Building robust ML models requires a massive amount of data for training. Traditional ML algorithms require training data centralization, which raises privacy and data governance issues. Federated learning (FL) is an approach to overcome this issue. We focused on applying FL on vertically partitioned data, in which an individual’s record is scattered among different sites. OBJECTIVE The aim of this study was to perform FL on vertically partitioned data to achieve performance comparable to that of centralized models without exposing the raw data. METHODS We used three different datasets (Adult income, Schwannoma, and eICU datasets) and vertically divided each dataset into different pieces. Following the vertical division of data, overcomplete autoencoder-based model training was performed for each site. Following training, each site’s data were transformed into latent data, which were aggregated for training. A tabular neural network model with categorical embedding was used for training. A centrally based model was used as a baseline model, which was compared to that of FL in terms of accuracy and area under the receiver operating characteristic curve (AUROC). RESULTS The autoencoder-based network successfully transformed the original data into latent representations with no domain knowledge applied. These altered data were different from the original data in terms of the feature space and data distributions, indicating appropriate data security. The loss of performance was minimal when using an overcomplete autoencoder; accuracy loss was 1.2%, 8.89%, and 1.23%, and AUROC loss was 1.1%, 0%, and 1.12% in the Adult income, Schwannoma, and eICU dataset, respectively. CONCLUSIONS We proposed an autoencoder-based ML model for vertically incomplete data. Since our model is based on unsupervised learning, no domain-specific knowledge is required in individual sites. Under the circumstances where direct data sharing is not available, our approach may be a practical solution enabling both data protection and building a robust model.

Download Full-text

Fuzzy C-Means com Método Wrapper Com Baixo Custo Computacional de Seleção de Atributos

10.21528/cbic2021-87 ◽

2021 ◽

Author(s):

Gabriel Marcondes Santos ◽

Emmanuel Tavares Ferreira Affonso ◽

Alisson Marques Silva ◽

Gray Farias Moita

Keyword(s):

Computational Cost ◽

Attribute Selection ◽

Classification Model ◽

Classification Algorithms ◽

Data Sets ◽

Features Selection ◽

Classification Problems ◽

Original Algorithm ◽

Fuzzy C Means ◽

Irrelevant Attributes

Nowadays the Computational Intelligence (IC) algorithms have shown a lot of efficiency in pattern classification and recognition processes. However, some databases may contain irrelevant attributes that may be detrimental to the learning of the classification model. In order to detect and exclude input attributes with little representativeness in the data sets presented to the classification algorithms, the Features Selection (FS) methods are commonly used. The goal of features selection methods is to minimize the number of input attributes processed by a classifier in order to improve its assertiveness. In this way, this work aims to analyze solutions to classification problems with three different classification algorithms. The first approach used for classification is the unsupervised Fuzzy C-Means (FCM) algorithm, the second approach is a supervised version of FCM and the third approach is a variation of supervised FCM with features selection. The method of features selection incorporated in FCM is called the Mean Ratio Feature Selection (MRFS), and was developed with the objective of being a method with low computational cost, without need for complex mathematical equations and can be easily incorporated into any classifier. For the experiments, the three versions of the unsupervised FCM, supervised FCM and FCM with attribute selection were performed with the aim of verifying whether there would be a significant improvement between the variations of the FCM. The results of the experiments showed that FCM with MRFS is promising, with results superior to the original algorithm and also to its supervised version.

Download Full-text