scholarly journals Diagnostic Evidence GAuge of Single cells (DEGAS): A transfer learning framework to infer impressions of cellular and patient phenotypes between patients and single cells

2020 ◽  
Author(s):  
Travis S. Johnson ◽  
Christina Y. Yu ◽  
Zhi Huang ◽  
Siwen Xu ◽  
Tongxin Wang ◽  
...  

AbstractWith the rapid advance of single cell sequencing techniques, single cell molecular data are quickly accumulated. However, there lacks a sound approach to properly integrate single cell data with the existing large amount of patient-level disease data. To address such need, we proposed DEGAS (Diagnostic Evidence GAuge of Single cells), a novel deep transfer-learning framework which allows for cellular and clinical information, including cell types, disease risk, and patient subtypes, to be cross-mapped between single cell and patient data, provided they share at least one common type of molecular data. We call such transferrable information “impressions”, which are generated by the deep learning models learned in the DEGAS framework. Using eight datasets from a wide range of diseases including Glioblastoma Multiforme (GBM), Alzheimer’s Disease (AD), and Multiple Myeloma (MM), we demonstrate the feasibility and broad applications of DEGAS in cross-mapping clinical and cellular information across disparate single cell and patient level transcriptomic datasets. Specifically, we correctly mapped clinically known GBM patient subtypes onto single cell data. We also identified previously known neuron loss from AD brains, then mapped the “impression” of AD risk to single cell data. Furthermore, we discovered novel differences in excitatory and inhibitory neuron loss in AD data. From the exploratory MM data, we identified differences in the malignancy of different CD138+ cellular subtypes based on “impressions” of relapse information transferred from MM patients. Through this work, we demonstrated that DEGAS is a powerful framework to cross-infer cellular and patient-level characteristics, which not only unites single cell and patient level transcriptomic data by identifying their latent links using the deep learning approach, but can also prioritize both patient subtypes and cellular subtypes for precision medicine.

Blood ◽  
2019 ◽  
Vol 134 (Supplement_1) ◽  
pp. 3075-3075
Author(s):  
Travis S Johnson ◽  
Christina Y Yu ◽  
Chuanpeng Dong ◽  
Tongxin Wang ◽  
Mohammad Issam Abu Zaid ◽  
...  

Background: Clonal heterogeneity is a known issue in multiple myeloma (MM) and the emergence of drug resistant clones is responsible for the incurability of the disease. Multiple studies of bulk CD138+ bone marrow samples have attempted to stratify MM patients into smaller, more distinct, patient risk groups based on molecular phenotypes. Recently, single cell RNA sequencing (scRNA-seq) technology has been applied in MM to identify cell clones. This leads to a new question: can we classify patients with scRNA-seq data guided by previously defined subtypes, and how do the single cell results correspond with the classification? Methods: We developed a novel, deep transfer learning framework to predict MM patient subtypes in patients with scRNA-seq based on patient classifications from microarray data. While the problem of scRNA-seq batch corrections has been intensively studied using transfer learning, there has been less work on similar comparisons between scRNA-seq and patient-level data. To address this issue, we utilized domain adaptation, a specific transfer learning approach, to combine scRNA-seq profiles and patient-level microarray data using a multitask learning framework. Figure 1 illustrates our computational framework. Its aim is to classify both cells and patients (with scRNA-seq data) according to patient level classifications derived from previous gene expression profiling studies for MM. Specifically, we adopted the 10-subtype classifications derived from microarray data1. Patients with scRNA-seq were summarized into a single vector by averaging gene counts across all the cells. Gene expression profiling data (including scRNA-seq and microarray) for MM patients from multiple studies were input into the transfer learning network consisting of 5 hidden layers. The last hidden layer was used to calculate the maximum mean discrepancy (MMD) between the patients from scRNA-seq and microarray to integrate the datasets. The datasets in this study are summarized in Table 1. Two microarray datasets (GSE19784, GSE2658) and one scRNA-seq dataset (GSE117156) were obtained from NCBI Gene Expression Omnibus. IUSM data were locally generated. One microarray and one scRNA-seq dataset were used in training and testing. GSE19784 was split into 80% training and 20% testing. GSE117156, due to the smaller sample size (11 patients), was split into 90% training and 10% testing. We ran 20 rounds of random cross validation using TensorFlow on a GTX1080 GPU. The expression profiles of patients and single cells from all datasets (GSE19784, GSE117156, GSE2658, IUSM) were input into the trained model after each round of cross validation to produce low-dimensional representations and predictions for each training, testing, and validation sample. Results: We found that our model was able to identify signals in the data based on expression profiles from patient-level and single cell data. The patient classification labels can be consistently reproduced in a held-out test set of patients as well as in a validation cohort of microarray data from 559 MM patients (GSE2658) and scRNA-seq from 4 MM patients from IUSM (Figure 2). These results show that the model can learn the subtypes across multiple datasets and platforms. The 4 IUSM patients tended to cluster similarly to their individual CD138+ cells after training, while GSE2658 patients still maintained some separation between MM subtype clusters (Figure 3). The single cells from our cohort of 4 patients did not necessarily classify to the same subtype as their patient. Conclusions: We found that a domain adaptive classifier can be trained across scRNA-seq and bulk gene expression profiling data from MM patients to integrate data and transfer knowledge. These models showed that single cells within a patient do not necessarily match the patient level molecular characteristics. Not surprisingly, similar results have been found in other cancer types2. As our novel framework is further refined and more patients are sequenced, we expect more unique insights into both inter- and intra-tumor MM heterogeneity. References: 1. Broyl A, Hose D, Lokhorst H, et al. Gene expression profiling for molecular classification of multiple myeloma in newly diagnosed patients. Blood. 2010;116(14):2543-2553. 2. Patel AP, Tirosh I, Trombetta JJ, et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science. 2014;344(6190):1396-1401. Disclosures Abonour: Celgene: Consultancy, Research Funding; BMS: Consultancy; Takeda: Consultancy, Research Funding; Janssen: Consultancy, Research Funding. Roodman:Amgen: Membership on an entity's Board of Directors or advisory committees.


Genes ◽  
2019 ◽  
Vol 10 (7) ◽  
pp. 531 ◽  
Author(s):  
Zhang ◽  
Luo ◽  
Zhong ◽  
Choi ◽  
Ma ◽  
...  

Advances in single-cell RNA sequencing (scRNA-Seq) have allowed for comprehensive analyses of single cell data. However, current analyses of scRNA-Seq data usually start from unsupervised clustering or visualization. These methods ignore the prior knowledge of transcriptomes and of the probable structures of the data. Moreover, cell identification heavily relies on subjective and inaccurate human inspection afterwards. To address these analytical challenges, we developed the Semi-supervised Category Identification and Assignment (SCINA) algorithm, a semi-supervised model, for analyses of scRNA-Seq and flow cytometry/CyTOF data, and other data of similar format, by automatically exploiting previously established gene signatures using an expectation–maximization (EM) algorithm. We applied SCINA on a wide range of datasets, and showed its accuracy, stableness and efficiency exceeded most popular unsupervised approaches. SCINA discovered an intermediate stage of oligodendrocyte from mouse brain scRNA-Seq data. SCINA also detected immune cell population shifting in Stk4 knock-out -knockoutmouse cytometry data. Finally, SCINA identified a new kidney tumor clade with similarity to FH-deficient tumors from bulk tumor data. Overall, SCINA provides both methodological advances and biological insights from perspectives different from traditional analytical methods.


2019 ◽  
Author(s):  
Ze Zhang ◽  
M.S. Danni Luo ◽  
Xue Zhong ◽  
Jin Huk Choi ◽  
Yuanqing Ma ◽  
...  

ABSTRACTAdvances in single-cell RNA sequencing (scRNA-Seq) have allowed for comprehensive analyses of single cell data. However, current analyses of scRNA-Seq data usually start from unsupervised clustering or visualization. These methods ignore the prior knowledge of transcriptomes and of the probable structures of the data. Moreover, cell identification heavily relies on subjective and inaccurate human inspection afterwards. We reversed this paradigm and developed SCINA, a semi-supervised model, for analyses of scRNA-Seq and flow cytometry/CyTOF data, and other data of similar format, by automatically exploiting previously established gene signatures using an expectation-maximization (EM) algorithm. We applied SCINA on a wide range of datasets, and showed its accuracy, stableness and efficiency exceeded most popular unsupervised approaches. Notably, SCINA discovered an intermediate stage of oligodendrocyte from mouse brain scRNA-Seq data. SCINA also detected immune cell population shifting in Stk4 knock-out mouse cytometry data. Finally, SCINA identified a new kidney tumor clade with similarity to FH-deficient tumors from bulk tumor data. Overall, SCINA provides both methodological advances and biological insights from perspectives different from traditional analytical methods.


Micromachines ◽  
2018 ◽  
Vol 9 (8) ◽  
pp. 367 ◽  
Author(s):  
Yuguang Liu ◽  
Dirk Schulze-Makuch ◽  
Jean-Pierre de Vera ◽  
Charles Cockell ◽  
Thomas Leya ◽  
...  

Single-cell sequencing is a powerful technology that provides the capability of analyzing a single cell within a population. This technology is mostly coupled with microfluidic systems for controlled cell manipulation and precise fluid handling to shed light on the genomes of a wide range of cells. So far, single-cell sequencing has been focused mostly on human cells due to the ease of lysing the cells for genome amplification. The major challenges that bacterial species pose to genome amplification from single cells include the rigid bacterial cell walls and the need for an effective lysis protocol compatible with microfluidic platforms. In this work, we present a lysis protocol that can be used to extract genomic DNA from both gram-positive and gram-negative species without interfering with the amplification chemistry. Corynebacterium glutamicum was chosen as a typical gram-positive model and Nostoc sp. as a gram-negative model due to major challenges reported in previous studies. Our protocol is based on thermal and chemical lysis. We consider 80% of single-cell replicates that lead to >5 ng DNA after amplification as successful attempts. The protocol was directly applied to Gloeocapsa sp. and the single cells of the eukaryotic Sphaerocystis sp. and achieved a 100% success rate.


2021 ◽  
Vol 2021 ◽  
pp. 1-7
Author(s):  
Juncai Li ◽  
Xiaofei Jiang

Molecular property prediction is an essential task in drug discovery. Most computational approaches with deep learning techniques either focus on designing novel molecular representation or combining with some advanced models together. However, researchers pay fewer attention to the potential benefits in massive unlabeled molecular data (e.g., ZINC). This task becomes increasingly challenging owing to the limitation of the scale of labeled data. Motivated by the recent advancements of pretrained models in natural language processing, the drug molecule can be naturally viewed as language to some extent. In this paper, we investigate how to develop the pretrained model BERT to extract useful molecular substructure information for molecular property prediction. We present a novel end-to-end deep learning framework, named Mol-BERT, that combines an effective molecular representation with pretrained BERT model tailored for molecular property prediction. Specifically, a large-scale prediction BERT model is pretrained to generate the embedding of molecular substructures, by using four million unlabeled drug SMILES (i.e., ZINC 15 and ChEMBL 27). Then, the pretrained BERT model can be fine-tuned on various molecular property prediction tasks. To examine the performance of our proposed Mol-BERT, we conduct several experiments on 4 widely used molecular datasets. In comparison to the traditional and state-of-the-art baselines, the results illustrate that our proposed Mol-BERT can outperform the current sequence-based methods and achieve at least 2% improvement on ROC-AUC score on Tox21, SIDER, and ClinTox dataset.


2021 ◽  
Vol 17 (2) ◽  
pp. e1008767
Author(s):  
Zutan Li ◽  
Hangjin Jiang ◽  
Lingpeng Kong ◽  
Yuanyuan Chen ◽  
Kun Lang ◽  
...  

N6-methyladenine (6mA) is an important DNA modification form associated with a wide range of biological processes. Identifying accurately 6mA sites on a genomic scale is crucial for under-standing of 6mA’s biological functions. However, the existing experimental techniques for detecting 6mA sites are cost-ineffective, which implies the great need of developing new computational methods for this problem. In this paper, we developed, without requiring any prior knowledge of 6mA and manually crafted sequence features, a deep learning framework named Deep6mA to identify DNA 6mA sites, and its performance is superior to other DNA 6mA prediction tools. Specifically, the 5-fold cross-validation on a benchmark dataset of rice gives the sensitivity and specificity of Deep6mA as 92.96% and 95.06%, respectively, and the overall prediction accuracy is 94%. Importantly, we find that the sequences with 6mA sites share similar patterns across different species. The model trained with rice data predicts well the 6mA sites of other three species: Arabidopsis thaliana, Fragaria vesca and Rosa chinensis with a prediction accuracy over 90%. In addition, we find that (1) 6mA tends to occur at GAGG motifs, which means the sequence near the 6mA site may be conservative; (2) 6mA is enriched in the TATA box of the promoter, which may be the main source of its regulating downstream gene expression.


Blood ◽  
2015 ◽  
Vol 126 (23) ◽  
pp. 4090-4090
Author(s):  
Alison R Moliterno ◽  
Donna Marie Williams ◽  
Jonathan M. Gerber ◽  
Michael A McDevitt ◽  
Ophelia Rogers ◽  
...  

Abstract Introduction: Essential thrombocytosis (ET), polycythemia vera (PV), and myelofibrosis (MF; post ETMF, post PVMF and primary MF) share the JAK2V617F mutation, but differ with regard to clinical phenotype, rate of disease progression, and risk of transformation. Variation in the JAK2V617F neutrophil allele burden does not account for these observed differences in clinical behavior or natural history. We therefore investigated the JAK2V617F burden and JAK2 genotype composition in the hematopoietic stem cell (HSC) population of MPN patients. Methods: We studied 47 JAK2V617F-positive MPN patients during 51 distinct disease phases. Circulating CD34+ cells were flow-sorted based on the stem cell markers CD34, CD38 and aldehyde dehydrogenase (ALDH). CD34+ CD38- ALDH+ HSC were sorted into 96 well plates and single cell JAK2 genotypes (average 40 single cells genotyped/patient with >1000 total single cells genotyped) were obtained using a nested PCR assay. Additional genomic lesions and chromosomal copy number variation were investigated in the sorted, single cell fractions in informative patients by FISH or multiplex single cell PCR. Distribution of JAK2V617F stem cell genotypes were correlated with disease phenotype, neutrophil JAK2V617F allele burden, splenomegaly, white cell count, chemotherapy requirement and disease evolution. Results: In all MPN cases, regardless of disease class, the JAK2V617F mutation was detected in the CD34+ CD38- ALDH+ fraction - the same population in which normal HSC reside. All ET and MF patients, and the majority of PV patients, had three JAK2 genotypes coexisting in their respective HSC populations. ET was characterized by a high percentage of JAK2WT stem cells (>75%) despite the concomitant presence of JAK2V617F homozygous clones and disease durations >15 years. Importantly, in the ET patients where JAK2WT clones fell to less than 50%, a PV phase followed. MF was characterized by a relatively low percentage of JAK2WT stem cells (median 24%), regardless of disease duration. PV had the most variable JAK2 genotypes with a wide range of JAK2WT stem cells (4%-92%) and a wide range of JAK2V617F homozygous stem cells (2-100%), and in 5/16 PV cases, only JAK2WT and JAK2V617F homozygous stem cells were identified. PV patients with JAK2V617F homozygous clones comprising more than 50% of their stem cells, regardless of disease duration, had higher white cell counts, higher neutrophil allele burdens, larger spleens and higher prevalence of chemotherapy compared to PV patients who had less than 50% JAK2V617F homozygous HSCs. The percentage of JAK2V617F homozygous HSC did not correlate with disease duration: some PV patients with a disease duration of >18 years had less than 10 % JAK2V617F homozygous HSC. A JAK2V617F - positive PV patient with a high JAK2V617F HSC burden and a high neutrophil JAK2V617F burden transformed to a JAK2V617F-negative chronic myelomonocytic leukemia (CMMoL); at the time of HSC analysis, the neutrophil JAK2V617F allele burden was 0% (previously 90%) and the HSC JAK2V617F homozygous percentage fell to 3% (previously 60%). While this patient's CMMoL was molecularly undefined, lesions identified in other JAK2V617F-positive patients (including mutations of ASXL1, TET2, deletion of 5q, 7q and 11q, trisomy 8 and 9), were also found in the CD34+ CD38- ALDH+ HSCs using single cell techniques, sometime coexistent with JAK2V617F-positive HSC, and sometimes in JAK2WT HSC. Conclusion: Driver and progression lesions in the JAK2V617F-positive MPN are acquired at the primitive HSC level. Despite decades of disease, the HSC pool in the MPN is mosaic for acquired lesions and also retains JAK2WT clones. Dominance of a particular JAK2 genotype at the primitive HSC level is variable, and distinguishes ET, where JAK2WT stem cells outnumber JAK2V617F-positive HSC, from MF, where JAK2WT HSC are the minority. PV is the most variable of the three MPN with regard to JAK2 genotype mosaicism. The allelic burden of HSC JAK2V617F in PV correlates with clinical disease burden. However, neither time nor JAK2V617F genotype determines the HSC burden in ET and PV, indicating that an undefined factor is a modifier of this important disease-defining process. Understanding the biology of HSC JAK2V617F homozygous clonal dominance may define an exploitable target to control disease burden, and to mitigate disease progression and evolution. Disclosures Moliterno: incyte: Membership on an entity's Board of Directors or advisory committees. Spivak:Incyte: Membership on an entity's Board of Directors or advisory committees.


Author(s):  
Nathalie Ne`ve ◽  
James K. Lingwood ◽  
Shelley R. Winn ◽  
Derek C. Tretheway ◽  
Sean S. Kohles

Interfacing a novel micron-resolution particle image velocimetry and dual optical tweezers system (μPIVOT) with microfluidics facilitates the exposure of an individual biologic cell to a wide range of static and dynamic mechanical stress conditions. Single cells can be manipulated in a sequence of mechanical stresses (hydrostatic pressure variations, tension or compression, as well as shear and extensional fluid induced stresses) while measuring cellular deformation. The unique multimodal load states enable a new realm of single cell biomechanical studies.


2019 ◽  
Author(s):  
Shuoguo Wang ◽  
Constance Brett ◽  
Mohan Bolisetty ◽  
Ryan Golhar ◽  
Isaac Neuhaus ◽  
...  

AbstractMotivationThanks to technological advances made in the last few years, we are now able to study transcriptomes from thousands of single cells. These have been applied widely to study various aspects of Biology. Nevertheless, comprehending and inferring meaningful biological insights from these large datasets is still a challenge. Although tools are being developed to deal with the data complexity and data volume, we do not have yet an effective visualizations and comparative analysis tools to realize the full value of these datasets.ResultsIn order to address this gap, we implemented a single cell data visualization portal called Single Cell Viewer (SCV). SCV is an R shiny application that offers users rich visualization and exploratory data analysis options for single cell datasets.AvailabilitySource code for the application is available online at GitHub (http://www.github.com/neuhausi/single-cell-viewer) and there is a hosted exploration application using the same example dataset as this publication at http://periscopeapps.org/[email protected]; [email protected]


2021 ◽  
Author(s):  
Nathanael Andrews ◽  
Martin Enge

Abstract CIM-seq is a tool for deconvoluting RNA-seq data from cell multiplets (clusters of two or more cells) in order to identify physically interacting cell in a given tissue. The method requires two RNAseq data sets from the same tissue: one of single cells to be used as a reference, and one of cell multiplets to be deconvoluted. CIM-seq is compatible with both droplet based sequencing methods, such as Chromium Single Cell 3′ Kits from 10x genomics; and plate based methods, such as Smartseq2. The pipeline consists of three parts: 1) Dissociation of the target tissue, FACS sorting of single cells and multiplets, and conventional scRNA-seq 2) Feature selection and clustering of cell types in the single cell data set - generating a blueprint of transcriptional profiles in the given tissue 3) Computational deconvolution of multiplets through a maximum likelihood estimation (MLE) to determine the most likely cell type constituents of each multiplet.


Sign in / Sign up

Export Citation Format

Share Document