scholarly journals Self-distillation contrastive learning enables clustering-free signature extraction and mapping to multimodal single-cell atlas of multimillion scale

Author(s):  
Meng Yang ◽  
Yueyuxiao Yang ◽  
Haiping Huang ◽  
Chenxi Xie ◽  
Huanming Yang ◽  
...  

Abstract Massively generated single-cell multi-omics datasets are revolutionizing biological studies of heterogenous tissues and organisms, which necessitate powerful computational methods to unleash the full potential of these tremendous data. Here, we present Concerto, stands for self-distillation contrastive learning of cell representations, a self-supervised representation learning framework optimized with asymmetric teacher-student configuration to analyze single-cell multi-omics datasets with scalability up to building 10 million-cell reference within 1.5 hour and querying 10k cells within 8 seconds. Concerto leverages dropout layer as minimal data augmentation to learn meaningful cell representations in a contrastive manner. The teacher module uses attention mechanism to aggregate contextualized gene embeddings within cellular context, while the student module uses simpler dense structure with discreate input. The learned task-agnostic representations can be adapted to a broad range of single-cell computation tasks. 1) Via supervised fine-tuning, Concerto enables automatic cell classification as well as novel cell-type discovery; 2) Attention weights provide model interpretability via automatically extracting specific molecular signatures at single-cell resolution without the needs of clustering; 3) Via source-aware training, Concerto supports efficient data integration by projecting all cells across multiple batches into a joint embedding space. 4) Via batch-aware inference or unsupervised fine-tuning, Concerto enables mapping query cells onto reference and accurately transferring annotations. Concerto can flexibly extend to multi-omics datasets simply through cross-modality summation operation to obtain unified cell embeddings. Using examples from human peripheral blood, human thymus, human pancreas, and mouse tissue atlas, Concerto shows superior performance benchmarking against other top-performing methods. We also demonstrate Concerto recapitulates detailed COVID-19 disease variation through query-to-reference mapping. Concerto can operate on all genes and represents a fully data-driven approach with minimum prior distribution assumptions, eliminating the needs of PCA-like or autoencoder-like dimensionality reduction, which significantly reforms the current best practice. Concerto is a simple, straightforward, robust, and scalable framework, offering a brand new perspective to derive cell representations and can effectively satisfy the emerging paradigm of query-to-reference mapping in the era of atlas-level single-cell multimodal analysis.

2021 ◽  
Author(s):  
Hongyu Shen ◽  
Layne C. Price ◽  
Taha Bahadori ◽  
Franziska Seeger

AbstractWhile protein sequence data is an emerging application domain for machine learning methods, small modifications to protein sequences can result in difficult-to-predict changes to the protein’s function. Consequently, protein machine learning models typically do not use randomized data augmentation procedures analogous to those used in computer vision or natural language, e.g., cropping or synonym substitution. In this paper, we empirically explore a set of simple string manipulations, which we use to augment protein sequence data when fine-tuning semi-supervised protein models. We provide 276 different comparisons to the Tasks Assessing Protein Embeddings (TAPE) baseline models, with Transformer-based models and training datasets that vary from the baseline methods only in the data augmentations and representation learning procedure. For each TAPE validation task, we demonstrate improvements to the baseline scores when the learned protein representation is fixed between tasks. We also show that contrastive learning fine-tuning methods typically outperform masked-token prediction in these models, with increasing amounts of data augmentation generally improving performance for contrastive learning protein methods. We find the most consistent results across TAPE tasks when using domain-motivated transformations, such as amino acid replacement, as well as restricting the Transformer attention to randomly sampled sub-regions of the protein sequence. In rarer cases, we even find that information-destroying augmentations, such as randomly shuffling entire protein sequences, can improve downstream performance.


2021 ◽  
Author(s):  
Wenchuan Wang ◽  
Fan Yang ◽  
Yuan Fang ◽  
Duyu Tang ◽  
Junzhou Huang ◽  
...  

AbstractReliable cell type annotation is a prerequisite for downstream analysis of single-cell RNA sequencing data. Existing annotation algorithms typically suffer from improper handling of batch effect, lack of curated marker gene lists, or difficulty in leveraging the latent gene-gene interaction information. Inspired by large scale pretrained langurage models, we present a pretrained deep neural network-based model scBERT (single-cell Bidirectional Encoder Representations from Transformers) to overcome the above challenges. scBERT follows the state-of-the-art paradigm of pre-train and fine-tune in the deep learning field. In the first phase of scBERT, it obtains a general understanding of gene-gene interaction by being pre-trained on huge amounts of unlabeled scRNA-seq data. The pre-trained scBERT can then be used for the cell annotation task of unseen and user-specific scRNA-seq data through supervised fine-tuning. Extensive and rigorous benchmark studies validate the superior performance of scBERT on various tasks, including cell type annotation, novel cell type discovery, as well as investigation of gene-gene interactions. Thus, scBERT enjoys the advantages of improved generalization and interpretability than existing annotation tools.


2020 ◽  
Vol 11 (1) ◽  
pp. 20190122 ◽  
Author(s):  
N. Getty ◽  
T. Brettin ◽  
D. Jin ◽  
R. Stevens ◽  
F. Xia

Deep learning is increasingly used in medical imaging, improving many steps of the processing chain, from acquisition to segmentation and anomaly detection to outcome prediction. Yet significant challenges remain: (i) image-based diagnosis depends on the spatial relationships between local patterns, something convolution and pooling often do not capture adequately; (ii) data augmentation, the de facto method for learning three-dimensional pose invariance, requires exponentially many points to achieve robust improvement; (iii) labelled medical images are much less abundant than unlabelled ones, especially for heterogeneous pathological cases; and (iv) scanning technologies such as magnetic resonance imaging can be slow and costly, generally without online learning abilities to focus on regions of clinical interest. To address these challenges, novel algorithmic and hardware approaches are needed for deep learning to reach its full potential in medical imaging.


Diagnostics ◽  
2021 ◽  
Vol 11 (6) ◽  
pp. 1052
Author(s):  
Leang Sim Nguon ◽  
Kangwon Seo ◽  
Jung-Hyun Lim ◽  
Tae-Jun Song ◽  
Sung-Hyun Cho ◽  
...  

Mucinous cystic neoplasms (MCN) and serous cystic neoplasms (SCN) account for a large portion of solitary pancreatic cystic neoplasms (PCN). In this study we implemented a convolutional neural network (CNN) model using ResNet50 to differentiate between MCN and SCN. The training data were collected retrospectively from 59 MCN and 49 SCN patients from two different hospitals. Data augmentation was used to enhance the size and quality of training datasets. Fine-tuning training approaches were utilized by adopting the pre-trained model from transfer learning while training selected layers. Testing of the network was conducted by varying the endoscopic ultrasonography (EUS) image sizes and positions to evaluate the network performance for differentiation. The proposed network model achieved up to 82.75% accuracy and a 0.88 (95% CI: 0.817–0.930) area under curve (AUC) score. The performance of the implemented deep learning networks in decision-making using only EUS images is comparable to that of traditional manual decision-making using EUS images along with supporting clinical information. Gradient-weighted class activation mapping (Grad-CAM) confirmed that the network model learned the features from the cyst region accurately. This study proves the feasibility of diagnosing MCN and SCN using a deep learning network model. Further improvement using more datasets is needed.


Author(s):  
Leon Hetzel ◽  
David S. Fischer ◽  
Stephan Günnemann ◽  
Fabian J. Theis

Gene Therapy ◽  
2021 ◽  
Author(s):  
A. S. Mathew ◽  
C. M. Gorick ◽  
R. J. Price

AbstractGene delivery via focused ultrasound (FUS) mediated blood-brain barrier (BBB) opening is a disruptive therapeutic modality. Unlocking its full potential will require an understanding of how FUS parameters (e.g., peak-negative pressure (PNP)) affect transfected cell populations. Following plasmid (mRuby) delivery across the BBB with 1 MHz FUS, we used single-cell RNA-sequencing to ascertain that distributions of transfected cell types were highly dependent on PNP. Cells of the BBB (i.e., endothelial cells, pericytes, and astrocytes) were enriched at 0.2 MPa PNP, while transfection of cells distal to the BBB (i.e., neurons, oligodendrocytes, and microglia) was augmented at 0.4 MPa PNP. PNP-dependent differential gene expression was observed for multiple cell types. Cell stress genes were upregulated proportional to PNP, independent of cell type. Our results underscore how FUS may be tuned to bias transfection toward specific brain cell types in vivo and predict how those cells will respond to transfection.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Shengquan Chen ◽  
Guanao Yan ◽  
Wenyu Zhang ◽  
Jinzhao Li ◽  
Rui Jiang ◽  
...  

AbstractThe recent advancements in single-cell technologies, including single-cell chromatin accessibility sequencing (scCAS), have enabled profiling the epigenetic landscapes for thousands of individual cells. However, the characteristics of scCAS data, including high dimensionality, high degree of sparsity and high technical variation, make the computational analysis challenging. Reference-guided approaches, which utilize the information in existing datasets, may facilitate the analysis of scCAS data. Here, we present RA3 (Reference-guided Approach for the Analysis of single-cell chromatin Accessibility data), which utilizes the information in massive existing bulk chromatin accessibility and annotated scCAS data. RA3 simultaneously models (1) the shared biological variation among scCAS data and the reference data, and (2) the unique biological variation in scCAS data that identifies distinct subpopulations. We show that RA3 achieves superior performance when used on several scCAS datasets, and on references constructed using various approaches. Altogether, these analyses demonstrate the wide applicability of RA3 in analyzing scCAS data.


2021 ◽  
Vol 7 ◽  
pp. e571
Author(s):  
Nurdan Ayse Saran ◽  
Murat Saran ◽  
Fatih Nar

In the last decade, deep learning has been applied in a wide range of problems with tremendous success. This success mainly comes from large data availability, increased computational power, and theoretical improvements in the training phase. As the dataset grows, the real world is better represented, making it possible to develop a model that can generalize. However, creating a labeled dataset is expensive, time-consuming, and sometimes not likely in some domains if not challenging. Therefore, researchers proposed data augmentation methods to increase dataset size and variety by creating variations of the existing data. For image data, variations can be obtained by applying color or spatial transformations, only one or a combination. Such color transformations perform some linear or nonlinear operations in the entire image or in the patches to create variations of the original image. The current color-based augmentation methods are usually based on image processing methods that apply color transformations such as equalizing, solarizing, and posterizing. Nevertheless, these color-based data augmentation methods do not guarantee to create plausible variations of the image. This paper proposes a novel distribution-preserving data augmentation method that creates plausible image variations by shifting pixel colors to another point in the image color distribution. We achieved this by defining a regularized density decreasing direction to create paths from the original pixels’ color to the distribution tails. The proposed method provides superior performance compared to existing data augmentation methods which is shown using a transfer learning scenario on the UC Merced Land-use, Intel Image Classification, and Oxford-IIIT Pet datasets for classification and segmentation tasks.


Sign in / Sign up

Export Citation Format

Share Document