scholarly journals optimalFlow: optimal transport approach to flow cytometry gating and population matching

2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Eustasio del Barrio ◽  
Hristo Inouzhe ◽  
Jean-Michel Loubes ◽  
Carlos Matrán ◽  
Agustín Mayo-Íscar

Abstract Background Data obtained from flow cytometry present pronounced variability due to biological and technical reasons. Biological variability is a well-known phenomenon produced by measurements on different individuals, with different characteristics such as illness, age, sex, etc. The use of different settings for measurement, the variation of the conditions during experiments and the different types of flow cytometers are some of the technical causes of variability. This mixture of sources of variability makes the use of supervised machine learning for identification of cell populations difficult. The present work is conceived as a combination of strategies to facilitate the task of supervised gating. Results We propose optimalFlowTemplates, based on a similarity distance and Wasserstein barycenters, which clusters cytometries and produces prototype cytometries for the different groups. We show that supervised learning, restricted to the new groups, performs better than the same techniques applied to the whole collection. We also present optimalFlowClassification, which uses a database of gated cytometries and optimalFlowTemplates to assign cell types to a new cytometry. We show that this procedure can outperform state of the art techniques in the proposed datasets. Our code is freely available as optimalFlow, a Bioconductor R package at https://bioconductor.org/packages/optimalFlow. Conclusions optimalFlowTemplates + optimalFlowClassification addresses the problem of using supervised learning while accounting for biological and technical variability. Our methodology provides a robust automated gating workflow that handles the intrinsic variability of flow cytometry data well. Our main innovation is the methodology itself and the optimal transport techniques that we apply to flow cytometry analysis.

2020 ◽  
Author(s):  
Etienne Becht ◽  
Daniel Tolstrup ◽  
Charles-Antoine Dutertre ◽  
Florent Ginhoux ◽  
Evan W. Newell ◽  
...  

AbstractModern immunologic research increasingly requires high-dimensional analyses in order to understand the complex milieu of cell-types that comprise the tissue microenvironments of disease. To achieve this, we developed Infinity Flow combining hundreds of overlapping flow cytometry panels using machine learning to enable the simultaneous analysis of the co-expression patterns of 100s of surface-expressed proteins across millions of individual cells. In this study, we demonstrate that this approach allows the comprehensive analysis of the cellular constituency of the steady-state murine lung and to identify novel cellular heterogeneity in the lungs of melanoma metastasis bearing mice. We show that by using supervised machine learning, Infinity Flow enhances the accuracy and depth of clustering or dimensionality reduction algorithms. Infinity Flow is a highly scalable, low-cost and accessible solution to single cell proteomics in complex tissues.


2018 ◽  
Author(s):  
Daniel Commenges ◽  
Chariff Alkhassim ◽  
Raphael Gottardo ◽  
Boris Hejblum ◽  
Rodolphe Thiébaut

AbstractMotivationFlow cytometry is a powerful technology that allows the high-throughput quantification of dozens of surface and intracellular proteins at the single-cell level. It has become the most widely used technology for immunophenotyping of cells over the past three decades. Due to the increasing complexity of cytometry experiments (more cells and more markers), traditional manual flow cytometry data analysis has become untenable due to its subjectivity and time-consuming nature.ResultsWe present a new unsupervised algorithm called “cytometree” to perform automated population discovery (aka gating) in flow cytometry. cytometree is based on the construction of a binary tree, the nodes of which are subpopulations of cells. At each node, the marker distributions are modeled by mixtures of normal distribution. Node splitting is done according to a normalized difference of Akaike information criteria (AIC) between the two models. Post-processing of the tree structure and derived populations allows us to complete the annotation of the derived populations. The algorithm is shown to perform better than the state-of-the-art unsupervised algorithms previously proposed on panels introduced by the Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP I) project. The algorithm is also applied to a T-cell panel proposed by the Human Immunology Project Consortium (HIPC) program; it also outperforms the best unsupervised open-source available algorithm while requiring the shortest computation time.AvailabilityAn R package named “cytometree” is available on the CRAN [email protected]; [email protected] informationSupplementary data are available.


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. e14138-e14138
Author(s):  
Beung-Chul AHN ◽  
Kyoung Ho Pyo ◽  
Dongmin Jung ◽  
Chun-Feng Xin ◽  
Chang Gon Kim ◽  
...  

e14138 Background: Immune checkpoint inhibitors have become breakthrough therapy for various types of cancers. However, regarding their total response rate around 20% based on clinical trials, predicting accurate aPD-1 response for individual patient is unestablished. The presence of PD-L1 expression or tumor infiltrating lymphocyte may be used as indicators of response but are limited. We developed models using machine learning methods to predict the aPD-1 response. Methods: A total of 126 advanced NSCLC patients treated with the aPD-1 were enrolled. Their clinical characteristics, treatment outcomes, and adverse events were collected. Total clinical data (n = 126) consist of 15 variables were divided into two subsets, discovery set (n = 63) and test set (n = 63). Thirteen supervised learning algorithms including support vector machine and regularized regression (lasso, ridge, elastic net) were applied on discovery set for model development and on test set for validation. Each model were evaluated according to the ROC curve and cross-validation method. Same methods were used to the subset which had additional flow cytometry data (n = 40). Results: The median age was 64 and 69.8% were male. Adenocarcinoma was predominant (69.8%) and twenty patients (15.1%) were driver mutation positive. Clinical data set (n = 126) demonstrated that the Ridge regression (AUC: 0.79) was the best model for prediction. Of 15 clinical variables, tumor burden, age, ECOG PS and PD-L1, were most important based on the random forest algorithm. When we merged the clinical and flow cytometry data, the Ridge regression model (AUC:0.82) showed better performance compared to using clinical data only. Among 52 variables of merged set, the top most important immune markers were as follows: CD3+CD8+CD25+/Teff-CD28, CD3+CD8+CD25-/Teff-Ki-67, and CD3+CD8+CD25+/Teff-NY-ESO/Teff-PD-1, which indicate activated tumor specific T cell subset. Conclusions: Our machine learning based model has benefit for predicting aPD-1 responses. After further validation in independent patient cohort, the supervised learning based non-invasive predictive score can be established to predict aPD-1 response.


2021 ◽  
Vol 460 ◽  
pp. 109743
Author(s):  
Oluwafemi D. Olusoji ◽  
Jurg W. Spaak ◽  
Mark Holmes ◽  
Thomas Neyens ◽  
Marc Aerts ◽  
...  

Blood ◽  
2013 ◽  
Vol 122 (21) ◽  
pp. 2864-2864
Author(s):  
Jens Rueter ◽  
Vivek Philip ◽  
Krishna Karuturi ◽  
Zaher Oueida ◽  
Margaret Chavaree ◽  
...  

Abstract Introduction Recent developments of novel immunotherapeutic drugs have shown promising results for patients with hematologic malignancies, however, an unmet need for accurate and specific biomarkers persists. To address this need, we developed a novel integrative analysis procedure for the automated analysis of multidimensional flow cytometry data obtained from the peripheral blood of patients with chronic lymphocytic leukemia (CLL). State of the art flow cytometry analysis is accomplished by manual sequential segmentation, or gating, of cell populations based on similarities in fluorescence and light scatter characteristics through visualization of the data in one- or two-dimensional plots. This approach has a number of limitations, including the subjective nature of the gating and the inability to fully utilize the high-dimensional data. Recent efforts have produced sophisticated computational methods that overcome many of these limitations; however, these newer computational methods have not been rigorously tested in a clinical context and have focused on the rigorous and automated analysis of samples from individual patients, with substantially less effort towards the analysis of patient populations. The ultimate goal of our analysis is to develop computational approaches that will enable an identification of subsets of patients with distinct immunological markers. Methods We developed a novel analysis framework that facilitates automated identification of both common cell types and patient population subgroups, based on post-processing of individual sample analysis with the FLOCK program. FLOCK identifies clusters of putatively similar cells in an individual sample by multidimensional clustering of the fluorescence marker and light-scattering measurements. We developed a rigorous hierarchical clustering approach to identify common “cell signatures” across multiple patients. The cell signatures were then mapped back onto the individual patient samples and used in a second clustering that identified patient subgroups based on similar abundances of specific cell types. Results We used our analytic framework to analyze multidimensional flow cytometry data (26 cell surface markers in 4 different antibody cocktails) from peripheral blood specimens of a heterogeneous group of 55 CLL patients and 13 healthy controls. Our analysis revealed distinct differences between controls and CLL patients. Analyzing the non-malignant peripheral blood cell types, we were furthermore able to differentiate between distinct clinical subpopulations of patients (e.g. identify treatment-naïve patients from those that had previously undergone chemotherapy). Conclusion/Discussion Using a novel integrative analysis procedure to analyze complex flow cytometry data of the peripheral blood from CLL patients, we are able to identify distinct cell type distributions. We propose that this information is a marker for the overall health/disease status of the corresponding patient, and could ultimately be used for diagnosis, prognosis, and selection of optimal treatment. In the context of multiple novel treatment options for CLL patients, such a tool will be crucial for defining individual patient prognosis, and defining an accurately matched treatment plan. Disclosures: No relevant conflicts of interest to declare.


2015 ◽  
Vol 23 (2) ◽  
Author(s):  
H. Thomas Banks ◽  
Dustin F. Kapraun ◽  
Kathryn G. Link ◽  
W. Clayton Thompson ◽  
Cristina Peligero ◽  
...  

AbstractIn this article we assess variability in cell proliferation dynamics observed for CD4+ and CD8+ T cells collected from two healthy donors. We review a recently developed class of models that incorporates the so-called “cyton model” for cell numbers into a conservation-based PDE model for cell population dynamics and describe a statistical model that relates CFSE-based flow cytometry data to such models. A parameter estimation scheme is summarized and then applied to a large body of data to assess experimental variability (variation in parameter estimates as identical experiments are replicated) and biological variability (differences in parameter estimates obtained for different donors and cell types) in the context of these models. Variability in the data obtained from replicated experiments is also discussed. The results of this study indicate that many of the cyton model parameters for describing cell proliferation can be reliably estimated using our approach; however, they also show that substantial changes to our mathematical model and/or experimental procedures may be required to ensure identifiability of the remaining cell proliferation parameters.


2019 ◽  
Author(s):  
Alice Yue ◽  
Cedric Chauve ◽  
Maxwell Libbrecht ◽  
Ryan R. Brinkman

AbstractWe introduce a new cell population score called SpecEnr (specific enrichment) and describe a method that discovers robust and accurate candidate biomarkers from flow cytometry data. Our approach identifies a new class of candidate biomarkers we define as driver cell populations, whose abundance is associated with a sample class (e.g. disease), but not as a result of a change in a related population. We show that the driver cell populations we find are also easily interpretable using a lattice-based visualization tool. Our method is implemented in the R package flowGraph, freely available on GitHub (github.com/aya49/flowGraph) and will be available BioConductor.


Author(s):  
Anjali Sifar ◽  
Nisheeth Srivastava

Supervised learning operates on the premise that labels unambiguously represent ground truth. This premise is reasonable in domains wherein a high degree of consensus is easily possible for any given data record, e.g. in agreeing on whether an image contains an elephant or not. However, there are several domains wherein people disagree with each other on the appropriate label to assign to a record, e.g. whether a tweet is toxic. We argue that data labeling must be understood as a process with some degree of domain-dependent noise and that any claims of predictive prowess must be sensitive to the degree of this noise. We present a method for quantifying labeling noise in a particular domain wherein people are seen to disagree with their own past selves on the appropriate label to assign to a record: choices under prospect uncertainty. Our results indicate that `state-of-the-art' choice models of decisions from description, by failing to consider the intrinsic variability of human choice behavior, find themselves in the odd position of predicting humans' choices better than the same humans' own previous choices for the same problem. We conclude with observations on how the predicament we empirically demonstrate in our work could be handled in the practice of supervised learning.


2018 ◽  
Vol 6 (7) ◽  
pp. e01164 ◽  
Author(s):  
Tyler William Smith ◽  
Paul Kron ◽  
Sara L. Martin

Computers ◽  
2021 ◽  
Vol 10 (8) ◽  
pp. 95
Author(s):  
Md. Kamrul Hossain ◽  
Md. Mokammel Haque ◽  
M. Ali Akber Dewan

This paper presents a comparative analysis of four semi-supervised machine learning (SSML) algorithms for detecting malicious nodes in an optical burst switching (OBS) network. The SSML approaches include a modified version of K-means clustering, a Gaussian mixture model (GMM), a classical self-training (ST) model, and a modified version of self-training (MST) model. All the four approaches work in semi-supervised fashion, while the MST uses an ensemble of classifiers for the final decision making. SSML approaches are particularly useful when a limited number of labeled data is available for training and validation of the classification model. Manual labeling of a large dataset is complex and time consuming. It is even worse for the OBS network data. SSML can be used to leverage the unlabeled data for making a better prediction than using a smaller set of labelled data. We evaluated the performance of four SSML approaches for two (Behaving, Not-behaving), three (Behaving, Not-behaving, and Potentially Not-behaving), and four (No-Block, Block, NB- wait and NB-No-Block) class classifications using precision, recall, and F1 score. In case of the two-class classification, the K-means and GMM-based approaches performed better than the others. In case of the three-class classification, the K-means and the classical ST approaches performed better than the others. In case of the four-class classification, the MST showed the best performance. Finally, the SSML approaches were compared with two supervised learning (SL) based approaches. The comparison results showed that the SSML based approaches outperform when a smaller sized labeled data is available to train the classification models.


Sign in / Sign up

Export Citation Format

Share Document