scholarly journals LAmbDA: label ambiguous domain adaptation dataset integration reduces batch effects and improves subtype detection

2019 ◽  
Vol 35 (22) ◽  
pp. 4696-4706 ◽  
Author(s):  
Travis S Johnson ◽  
Tongxin Wang ◽  
Zhi Huang ◽  
Christina Y Yu ◽  
Yi Wu ◽  
...  

Abstract Motivation Rapid advances in single cell RNA sequencing (scRNA-seq) have produced higher-resolution cellular subtypes in multiple tissues and species. Methods are increasingly needed across datasets and species to (i) remove systematic biases, (ii) model multiple datasets with ambiguous labels and (iii) classify cells and map cell type labels. However, most methods only address one of these problems on broad cell types or simulated data using a single model type. It is also important to address higher-resolution cellular subtypes, subtype labels from multiple datasets, models trained on multiple datasets simultaneously and generalizability beyond a single model type. Results We developed a species- and dataset-independent transfer learning framework (LAmbDA) to train models on multiple datasets (even from different species) and applied our framework on simulated, pancreas and brain scRNA-seq experiments. These models mapped corresponding cell types between datasets with inconsistent cell subtype labels while simultaneously reducing batch effects. We achieved high accuracy in labeling cellular subtypes (weighted accuracy simulated 1 datasets: 90%; simulated 2 datasets: 94%; pancreas datasets: 88% and brain datasets: 66%) using LAmbDA Feedforward 1 Layer Neural Network with bagging. This method achieved higher weighted accuracy in labeling cellular subtypes than two other state-of-the-art methods, scmap and CaSTLe in brain (66% versus 60% and 32%). Furthermore, it achieved better performance in correctly predicting ambiguous cellular subtype labels across datasets in 88% of test cases compared with CaSTLe (63%), scmap (50%) and MetaNeighbor (50%). LAmbDA is model- and dataset-independent and generalizable to diverse data types representing an advance in biocomputing. Availability and implementation github.com/tsteelejohnson91/LAmbDA Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Travis S Johnson ◽  
Zhi Huang ◽  
Christina Y Yu ◽  
Tongxin Wang ◽  
Yi Wu ◽  
...  

AbstractMotivationRapid advances in single cell RNA sequencing have produced more granular subtypes of cells in multiple tissues from different species. There exists a need to develop rigorous methods that can i) model multiple datasets with ambiguous labels across species and studies and ii) remove systematic biases across datasets and species.ResultsWe developed a species- and dataset-independent transfer learning framework (LAmbDA) to train models on multiple datasets and applied our framework on scRNA-seq experiments. These models mapped corresponding cell types between datasets with inconsistent labels while simultaneously reducing batch effects. We achieved high accuracy in labeling cellular subtypes (weighted accuracy pancreas: 91%, brain: 78%) using LAmbDA Random Forest. LAmbDA Feedforward 1 Layer Neural Network achieved higher weighted accuracy in labeling cellular subtypes than CaSTLe or MetaNeighbor in brain (48%, 32%, 20% respectively). Furthermore, LAmbDA Feedforward 1 Layer Neural Network was the only method to correctly predict ambiguous cellular subtype labels in both pancreas and brain compared to CaSTLe and MetaNeighbor. LAmbDA is model- and dataset-independent and generalizable to diverse data types representing an advance in biocomputing.Availability: github.com/tsteelejohnson91/LAmbDAContact:[email protected], [email protected]


2022 ◽  
Author(s):  
Chenfei Wang ◽  
Pengfei Ren ◽  
Xiaoying Shi ◽  
Xin Dong ◽  
Zhiguang Yu ◽  
...  

Abstract The rapid accumulation of single-cell RNA-seq data has provided rich resources to characterize various human cell types. Cell type annotation is the critical step in analyzing single-cell RNA-seq data. However, accurate cell type annotation based on public references is challenging due to the inconsistent annotations, batch effects, and poor characterization of rare cell types. Here, we introduce SELINA (single cELl identity NAvigator), an integrative annotation transferring framework for automatic cell type annotation. SELINA optimizes the annotation for minority cell types by synthetic minority over-sampling, removes batch effects among reference datasets using a multiple-adversarial domain adaptation network (MADA), and fits the query data with reference data using an autoencoder. Finally, SELINA affords a comprehensive and uniform reference atlas with 1.7 million cells covering 230 major human cell types. We demonstrated the robustness and superiority of SELINA in most human tissues compared to existing methods. SELINA provided a one-stop solution for human single- cell RNA-seq data annotation with the potential to extend for other species.


Author(s):  
Musu Yuan ◽  
Liang Chen ◽  
Minghua Deng

Abstract Motivation Single-cell RNA-seq (scRNA-seq) has been widely used to resolve cellular heterogeneity. After collecting scRNA-seq data, the natural next step is to integrate the accumulated data to achieve a common ontology of cell types and states. Thus, an effective and efficient cell-type identification method is urgently needed. Meanwhile, high quality reference data remain a necessity for precise annotation. However, such tailored reference data are always lacking in practice. To address this, we aggregated multiple datasets into a meta-dataset on which annotation is conducted. Existing supervised or semi-supervised annotation methods suffer from batch effects caused by different sequencing platforms, the effect of which increases in severity with multiple reference datasets. Results Herein, a robust deep learning based single-cell Multiple Reference Annotator (scMRA) is introduced. In scMRA, a knowledge graph is constructed to represent the characteristics of cell types in different datasets, and a graphic convolutional network (GCN) serves as a discriminator based on this graph. scMRA keeps intra-cell-type closeness and the relative position of cell types across datasets. scMRA is remarkably powerful at transferring knowledge from multiple reference datasets, to the unlabeled target domain, thereby gaining an advantage over other state-of-the-art annotation methods in multi-reference data experiments. Furthermore, scMRA can remove batch effects. To the best of our knowledge, this is the first attempt to use multiple insufficient reference datasets to annotate target data, and it is, comparatively, the best annotation method for multiple scRNA-seq datasets. Availability An implementation of scMRA is available from https://github.com/ddb-qiwang/scMRA-torch Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Nimrod Rappoport ◽  
Roy Safra ◽  
Ron Shamir

AbstractRecent advances in experimental biology allow creation of datasets where several genome-wide data types (called omics) are measured per sample. Integrative analysis of multi-omic datasets in general, and clustering of samples in such datasets specifically, can improve our understanding of biological processes and discover different disease subtypes. In this work we present Monet (Multi Omic clustering by Non-Exhaustive Types), which presents a unique approach to multi-omic clustering. Monet discovers modules of similar samples, such that each module is allowed to have a clustering structure for only a subset of the omics. This approach differs from most extant multi-omic clustering algorithms, which assume a common structure across all omics, and from several recent algorithms that model distinct cluster structures using Bayesian statistics. We tested Monet extensively on simulated data, on an image dataset, and on ten multi-omic cancer datasets from TCGA. Our analysis shows that Monet compares favorably with other multi-omic clustering methods. We demonstrate Monet’s biological and clinical relevance by analyzing its results for Ovarian Serous Cystadenocarcinoma. We also show that Monet is robust to missing data, can cluster genes in multi-omic dataset, and reveal modules of cell types in single-cell multi-omic data. Our work shows that Monet is a valuable tool that can provide complementary results to those provided by extant algorithms for multi-omic analysis.


2021 ◽  
Author(s):  
Xiang Zhou ◽  
Hua Chai ◽  
Yuansong Zeng ◽  
Huiying Zhao ◽  
Ching-Hsing Luo ◽  
...  

AbstractMotivationIn single cell analyses, cell types are conventionally identified based on known marker gene expressions. Such approaches are time-consuming and irreproducible. Therefore, many new supervised methods have been developed to identify cell types for target datasets using the rapid accumulation of public datasets. However, these approaches are sensitive to batch effects or biological variations since the data distributions are different in cross-platforms or species predictions.ResultsWe developed scAdapt, a virtual adversarial domain adaptation network to transfer cell labels between datasets with batch effects. scAdapt used both the labeled source and unlabeled target data to train an enhanced classifier, and aligned the labeled source centroid and pseudo-labeled target centroid to generate a joint embedding. We demonstrate that scAdapt outperforms existing methods for classification in simulated, cross-platforms, cross-species, and spatial transcriptomic datasets. Further quantitative evaluations and visualizations for the aligned embeddings confirm the superiority in cell mixing and preserving discriminative cluster structure present in the original datasets.Availabilityhttps://github.com/zhoux85/[email protected] or [email protected]


2019 ◽  
Author(s):  
Natalie Sauerwald ◽  
Carl Kingsford

AbstractIn many applications, a consistently high measurement across many samples can indicate particularly meaningful or useful information for quality control or biological interpretation. Identification of these strong features among many others can be challenging especially when the samples cannot be expected to have the same distribution or range of values. We present a general method called conserved feature discovery (CFD) for identifying features with consistently strong signals across multiple conditions or samples. Given any real-valued data, CFD requires no parameters, makes no assumptions on the shape of the underlying sample distributions, and is robust to differences across these distributions.We show that with high probability CFD identifies all true positives and no false positives under certain assumptions on the median and variance distributions of the feature measurements. Using simulated data, we show that CFD is tolerant to a small percentage of poor quality samples and robust to false positives. Applying CFD to RNA sequencing data from the Human Body Map project and GTEx, we identify housekeeping genes as highly expressed genes across tissue types and compare to housekeeping gene lists from previous methods. CFD is consistent between the Human Body Map and GTEx data sets, and identifies lists of genes enriched for basic cellular processes as expected. The framework can be easily adapted for many data types and desired feature properties.AvailabilityCode for CFD and scripts to reproduce the figures and analysis in this work are available at https://github.com/Kingsford-Group/cfd.Supplementary informationSupplementary data are available at https://github.com/Kingsford-Group/cfd.


2020 ◽  
Vol 29 (10) ◽  
pp. 2851-2864
Author(s):  
Manuel Ugidos ◽  
Sonia Tarazona ◽  
José M Prats-Montalbán ◽  
Alberto Ferrer ◽  
Ana Conesa

Diversity of omic technologies has expanded in the last years together with the number of omic data integration strategies. However, multiomic data generation is costly, and many research groups cannot afford research projects where many different omic techniques are generated, at least at the same time. As most researchers share their data in public repositories, different omic datasets of the same biological system obtained at different labs can be combined to construct a multiomic study. However, data obtained at different labs or moments in time are typically subjected to batch effects that need to be removed for successful data integration. While there are methods to correct batch effects on the same data types obtained in different studies, they cannot be applied to correct lab or batch effects across omics. This impairs multiomic meta-analysis. Fortunately, in many cases, at least one omics platform—i.e. gene expression— is repeatedly measured across labs, together with the additional omic modalities that are specific to each study. This creates an opportunity for batch analysis. We have developed MultiBaC (multiomic Multiomics Batch-effect Correction correction), a strategy to correct batch effects from multiomic datasets distributed across different labs or data acquisition events. Our strategy is based on the existence of at least one shared data type which allows data prediction across omics. We validate this approach both on simulated data and on a case where the multiomic design is fully shared by two labs, hence batch effect correction within the same omic modality using traditional methods can be compared with the MultiBaC correction across data types. Finally, we apply MultiBaC to a true multiomic data integration problem to show that we are able to improve the detection of meaningful biological effects.


Author(s):  
Siddharth Annaldasula ◽  
Martyna Gajos ◽  
Andreas Mayer

Abstract Summary Despite the continuous discovery of new transcript isoforms, fueled by the recent increase in accessibility and accuracy of long-read RNA sequencing data, functional differences between isoforms originating from the same gene often remain obscure. To address this issue and enable researchers to assess potential functional consequences of transcript isoform variation on the proteome, we developed IsoTV. IsoTV is a versatile pipeline to process, predict, and visualize the functional features of translated transcript isoforms. Attributes such as gene and isoform expression, transcript composition, and functional features are summarized in an easy-to-interpret visualization. IsoTV is able to analyze a variety of data types from all eukaryotic organisms, including short- and long-read RNA-seq data. Using Oxford Nanopore long read data, we demonstrate that IsoTV facilitates the understanding of potential protein isoform function in different cancer cell types. Availability IsoTV is available at https://github.molgen.mpg.de/MayerGroup/IsoTV, with the corresponding documentation at https://isotv.readthedocs.io/. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Alma Andersson ◽  
Joakim Lundeberg

Abstract Motivation Collection of spatial signals in large numbers has become a routine task in multiple omics-fields, but parsing of these rich datasets still pose certain challenges. In whole or near-full transcriptome spatial techniques, spurious expression profiles are intermixed with those exhibiting an organized structure. To distinguish profiles with spatial patterns from the background noise, a metric that enables quantification of spatial structure is desirable. Current methods designed for similar purposes tend to be built around a framework of statistical hypothesis testing, hence we were compelled to explore a fundamentally different strategy. Results We propose an unexplored approach to analyze spatial transcriptomics data, simulating diffusion of individual transcripts to extract genes with spatial patterns. The method performed as expected when presented with synthetic data. When applied to real data, it identified genes with distinct spatial profiles, involved in key biological processes or characteristic for certain cell types. Compared to existing methods, ours seemed to be less informed by the genes’ expression levels and showed better time performance when run with multiple cores. Availabilityand implementation Open-source Python package with a command line interface (CLI), freely available at https://github.com/almaan/sepal under an MIT licence. A mirror of the GitHub repository can be found at Zenodo, doi: 10.5281/zenodo.4573237. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Hanyu Zhang ◽  
Ruoyi Cai ◽  
James Dai ◽  
Wei Sun

AbstractWe introduce a new computational method named EMeth to estimate cell type proportions using DNA methylation data. EMeth is a reference-based method that requires cell type-specific DNA methylation data from relevant cell types. EMeth improves on the existing reference-based methods by detecting the CpGs whose DNA methylation are inconsistent with the deconvolution model and reducing their contributions to cell type decomposition. Another novel feature of EMeth is that it allows a cell type with known proportions but unknown reference and estimates its methylation. This is motivated by the case of studying methylation in tumor cells while bulk tumor samples include tumor cells as well as other cell types such as infiltrating immune cells, and tumor cell proportion can be estimated by copy number data. We demonstrate that EMeth delivers more accurate estimates of cell type proportions than several other methods using simulated data and in silico mixtures. Applications in cancer studies show that the proportions of T regulatory cells estimated by DNA methylation have expected associations with mutation load and survival time, while the estimates from gene expression miss such associations.


Sign in / Sign up

Export Citation Format

Share Document