Deep learning of immune cell differentiation

Alexandra Maslova; Ricardo N. Ramirez; Ke Ma; Hugo Schmutz; Chendi Wang; Curtis Fox; Bernard Ng; Christophe Benoist; Sara Mostafavi;

doi:10.1073/pnas.2011795117

Deep learning of immune cell differentiation

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2011795117 ◽

2020 ◽

Vol 117 (41) ◽

pp. 25655-25666 ◽

Cited By ~ 2

Author(s):

Alexandra Maslova ◽

Ricardo N. Ramirez ◽

Ke Ma ◽

Hugo Schmutz ◽

Chendi Wang ◽

...

Keyword(s):

Deep Learning ◽

Cell Differentiation ◽

Dna Sequences ◽

Immune Cell ◽

Cell Types ◽

Regulatory Elements ◽

Chromatin Accessibility ◽

Regulatory Sequence ◽

B Lineage ◽

Additional Support

Although we know many sequence-specific transcription factors (TFs), how the DNA sequence of cis-regulatory elements is decoded and orchestrated on the genome scale to determine immune cell differentiation is beyond our grasp. Leveraging a granular atlas of chromatin accessibility across 81 immune cell types, we asked if a convolutional neural network (CNN) could learn to infer cell type-specific chromatin accessibility solely from regulatory DNA sequences. With a tailored architecture and an ensemble approach to CNN parameter interpretation, we show that our trained network (“AI-TAC”) does so by rediscovering ab initio the binding motifs for known regulators and some unknown ones. Motifs whose importance is learned virtually as functionally important overlap strikingly well with positions determined by chromatin immunoprecipitation for several TFs. AI-TAC establishes a hierarchy of TFs and their interactions that drives lineage specification and also identifies stage-specific interactions, like Pax5/Ebf1 vs. Pax5/Prdm1, or the role of different NF-κB dimers in different cell types. AI-TAC assigns Spi1/Cebp and Pax5/Ebf1 as the drivers necessary for myeloid and B lineage fates, respectively, but no factors seemed as dominantly required for T cell differentiation, which may represent a fall-back pathway. Mouse-trained AI-TAC can parse human DNA, revealing a strikingly similar ranking of influential TFs and providing additional support that AI-TAC is a generalizable regulatory sequence decoder. Thus, deep learning can reveal the regulatory syntax predictive of the full differentiative complexity of the immune system.

Download Full-text

CATaDa reveals global remodelling of chromatin accessibility during stem cell differentiation in vivo

eLife ◽

10.7554/elife.32341 ◽

2018 ◽

Vol 7 ◽

Cited By ~ 29

Author(s):

Gabriel N Aughey ◽

Alicia Estacio Gomez ◽

Jamie Thomson ◽

Hang Yin ◽

Tony D Southall

Keyword(s):

Stem Cells ◽

Stem Cell ◽

Cell Differentiation ◽

Ectopic Expression ◽

Stem Cell Differentiation ◽

Cell Types ◽

Regulatory Elements ◽

Chromatin Accessibility ◽

Global Changes

During development eukaryotic gene expression is coordinated by dynamic changes in chromatin structure. Measurements of accessible chromatin are used extensively to identify genomic regulatory elements. Whilst chromatin landscapes of pluripotent stem cells are well characterised, chromatin accessibility changes in the development of somatic lineages are not well defined. Here we show that cell-specific chromatin accessibility data can be produced via ectopic expression of E. coli Dam methylase in vivo, without the requirement for cell-sorting (CATaDa). We have profiled chromatin accessibility in individual cell-types of Drosophila neural and midgut lineages. Functional cell-type-specific enhancers were identified, as well as novel motifs enriched at different stages of development. Finally, we show global changes in the accessibility of chromatin between stem-cells and their differentiated progeny. Our results demonstrate the dynamic nature of chromatin accessibility in somatic tissues during stem cell differentiation and provide a novel approach to understanding gene regulatory mechanisms underlying development.

Download Full-text

Unbiased integration of single cell multi-omics data

10.1101/2020.12.11.422014 ◽

2020 ◽

Author(s):

JINZHUANG DOU ◽

Shaoheng Liang ◽

Vakul Mohanty ◽

Xuesen Cheng ◽

Sangbae Kim ◽

...

Keyword(s):

Single Cell ◽

Immune Cell ◽

Cell Types ◽

Regulatory Elements ◽

Chromatin Accessibility ◽

Mouse Retina ◽

Transcriptomics Data ◽

Mouse Brain Cortex ◽

Broad Variety ◽

Data Matrices

Acquiring accurate single-cell multiomics profiles often requires performing unbiased in silico integration of data matrices generated by different single-cell technologies from the same biological sample. However, both the rows and the columns can represent different entities in different data matrices, making such integration a computational challenge that has only been solved approximately by existing approaches. Here, we present bindSC, a single-cell data integration tool that realizes simultaneous alignment of the rows and the columns between data matrices without making approximations. Using datasets produced by multiomics technologies as gold standard, we show that bindSC generates accurate multimodal co-embeddings that are substantially more accurate than those generated by existing approaches. Particularly, bindSC effectively integrated single cell RNA sequencing (scRNA-seq) and single cell chromatin accessibility sequencing (scATAC-seq) data towards discovering key regulatory elements in cancer cell-lines and mouse cells. It achieved accurate integration of both common and rare cell types (<0.25% abundance) in a novel mouse retina cell atlas generated using the 10x Genomics Multiome ATAC+RNA kit. Further, it achieves unbiased integration of scRNA-seq and 10x Visium spatial transcriptomics data derived from mouse brain cortex samples. Lastly, it demonstrated efficacy in delineating immune cell types via integrating single-cell RNA and protein data. Thus, bindSC, available at https://github.com/KChen-lab/bindSC, can be applied in a broad variety of context to accelerate discovery of complex cellular and biological identities and associated molecular underpinnings in diseases and developing organisms.

Download Full-text

Unbiased integration of single cell multi-omics data

10.21203/rs.3.rs-126986/v1 ◽

2020 ◽

Author(s):

Jinzhuang Dou ◽

Shaoheng Liang ◽

Vakul Mohanty ◽

Xuesen Cheng ◽

Sangbae Kim ◽

...

Keyword(s):

Single Cell ◽

Immune Cell ◽

Cell Types ◽

Regulatory Elements ◽

Chromatin Accessibility ◽

Mouse Retina ◽

Transcriptomics Data ◽

Mouse Brain Cortex ◽

Broad Variety ◽

Data Matrices

Abstract Acquiring accurate single-cell multiomics profiles often requires performing unbiased in silico integration of data matrices generated by different single-cell technologies from the same biological sample. However, both the rows and the columns can represent different entities in different data matrices, making such integration a computational challenge that has only been solved approximately by existing approaches. Here, we present bindSC, a single-cell data integration tool that realizes simultaneous alignment of the rows and the columns between data matrices without making approximations. Using datasets produced by multiomics technologies as gold standard, we show that bindSC generates accurate multimodal co-embeddings that are substantially more accurate than those generated by existing approaches. Particularly, bindSC effectively integrated single cell RNA sequencing (scRNA-seq) and single cell chromatin accessibility sequencing (scATAC-seq) data towards discovering key regulatory elements in cancer cell-lines and mouse cells. It achieved accurate integration of both common and rare cell types (<0.25% abundance) in a novel mouse retina cell atlas generated using the 10x Genomics Multiome ATAC+RNA kit. Further, it achieves unbiased integration of scRNA-seq and 10x Visium spatial transcriptomics data derived from mouse brain cortex samples. Lastly, it demonstrated efficacy in delineating immune cell types via integrating single-cell RNA and protein data. Thus, bindSC, available at https://github.com/KChen-lab/bindSC, can be applied in a broad variety of context to accelerate discovery of complex cellular and biological identities and associated molecular underpinnings in diseases and developing organisms.

Download Full-text

Incorporating Gene Expression in Genome-wide Prediction of Chromatin Accessibility via Deep Learning

10.1101/610642 ◽

2019 ◽

Author(s):

Qiao Liu ◽

Wing Hung Wong ◽

Rui Jiang

Keyword(s):

Deep Learning ◽

Human Genome ◽

Cell Types ◽

Regulatory Elements ◽

Chromatin Accessibility ◽

Superior Performance ◽

Specific Cell ◽

Cell Type ◽

Transcriptome Profile ◽

Cell Type Specific

AbstractRegulatory elements (REs) in human genome are major sites of non-coding transcription which lack adequate interpretation. Although computational approaches have been complementing high-throughput biological experiments towards the annotation of the human genome, it remains a big challenge to systematically and accurately characterize REs in the context of a specific cell type. To address this problem, we proposed DeepCAGE, an deep learning framework that incorporates transcriptome profile of human transcription factors (TFs) for accurately predicting the activities of cell type-specific REs. Our approach automatically learns the regulatory code of input DNA sequence incorporated with cell type-specific TFs expression. In a series of systematic comparison with existing methods, we show the superior performance of our model in not only the classification of accessible regions, but also the regression of DNase-seq signals. A typical scenario of usage for our method is to predict the activities of REs in novel cell types, especially where the chromatin accessibility data is not available. To sum up, our study provides a fascinating insight into disclosing complex regulatory mechanism by integrating transcriptome profile of human TFs.

Download Full-text

CoRE-ATAC: A deep learning model for the functional classification of regulatory elements from single cell and bulk ATAC-seq data

10.1101/2020.06.22.165183 ◽

2020 ◽

Author(s):

Asa Thibodeau ◽

Shubham Khetan ◽

Alper Eroglu ◽

Ryan Tewhey ◽

Michael L. Stitzel ◽

...

Keyword(s):

Gene Expression ◽

Deep Learning ◽

Immune Cells ◽

Cell Types ◽

Regulatory Elements ◽

Chromatin Accessibility ◽

Regulatory Function ◽

Regulate Gene Expression ◽

Regulatory Functions ◽

Regulate Gene

AbstractCis-Regulatory elements (cis-REs) include promoters, enhancers, and insulators that regulate gene expression programs via binding of transcription factors. ATAC-seq technology effectively identifies active cis-REs in a given cell type (including from single cells) by mapping accessible chromatin at base-pair resolution. However, these maps are not immediately useful for inferring specific functions of cis-REs. For this purpose, we developed a deep learning framework (CoRE-ATAC) with novel data encoders that integrate DNA sequence (reference or personal genotypes) with ATAC-seq cut sites and read pileups. CoRE-ATAC was trained on 4 cell types (n=6 samples/replicates) and accurately predicted known cis-RE functions from 7 cell types (n=40 samples) that were not used in model training (mean average precision=0.80). CoRE-ATAC enhancer predictions from 19 human islet samples coincided with genetically modulated gain/loss of enhancer activity, which was confirmed by massively parallel reporter assays (MPRAs). Finally, CoRE-ATAC effectively inferred cis-RE function from aggregate single nucleus ATAC-seq (snATAC) data from human blood-derived immune cells that overlapped with known functional annotations in sorted immune cells, which established the efficacy of these models to study cis-RE functions of rare cells without the need for cell sorting. ATAC-seq maps from primary human cells reveal individual- and cell-specific variation in cis-RE activity. CoRE-ATAC increases the functional resolution of these maps, a critical step for studying regulatory disruptions behind diseases.Author SummaryNon-coding DNA sequences serve different functional roles to regulate gene expression. For these sequences to be active, they must be accessible for proteins and other factors to bind in order to carry out a specific regulatory function. Even so, mutations within these sequences or other regulatory events may modulate their activity or regulatory function. It is therefore critical that we identify these non-coding sequences and their specific regulatory function to fully understand how specific genes are regulated. Current sequencing technologies allow us to identify accessible sequences via chromatin accessibility maps from low cell numbers, enabling the study of clinical samples. However, determining the functional role associated with these sequences remains a challenge. Towards this goal, we harnessed the power of deep learning to unravel the intricacies of chromatin accessibility maps to infer their associated gene regulatory functions. We demonstrate that our method, CoRE-ATAC, can infer regulatory functions in diverse cell types, captures activity differences modulated by genetic mutations, and can be applied to accessibility maps of single cell clusters to infer regulatory functions of rare cell populations. These inferences will further our understanding of how genes are regulated and enable the study of these mechanisms as they relate to disease.

Download Full-text

Spatial-ATAC-seq: spatially resolved chromatin accessibility profiling of tissues at genome scale and cellular level

10.1101/2021.06.06.447244 ◽

2021 ◽

Author(s):

Yanxiang Deng ◽

Marek Bartosovic ◽

Sai Ma ◽

Di Zhang ◽

Yang Liu ◽

...

Keyword(s):

Immune Cell ◽

System Development ◽

Local Environment ◽

Cell Types ◽

Chromatin Accessibility ◽

Cellular Level ◽

Human Tonsil ◽

Spatially Resolved ◽

Fate Decision ◽

Genome Scale

Cellular function in tissue is dependent upon the local environment, requiring new methods for spatial mapping of biomolecules and cells in the tissue context. The emergence of spatial transcriptomics has enabled genome-scale gene expression mapping, but it remains elusive to capture spatial epigenetic information of tissue at cellular level and genome scale. Here we report on spatial-ATAC-seq: spatially resolved chromatin accessibility profiling of tissue section via next-generation sequencing by combining in situ Tn5 transposition chemistry and microfluidic deterministic barcoding. Spatial chromatin accessibility profiling of mouse embryos delineated tissue region-specific epigenetic landscapes and identified gene regulators implicated in the central nerve system development. Mapping the accessible genome in human tonsil tissue with 20μm pixel size revealed spatially distinct organization of immune cell types and states in lymphoid follicles and extrafollicular zones. This technology takes spatial biology to a new realm by enabling spatially resolved epigenomics to improve our understanding of cell identity, state, and fate decision in relation to epigenetic underpinnings in development and disease.

Download Full-text

SeqEnhDL: sequence-based classification of cell type-specific enhancers using deep learning models

10.1101/2020.05.13.093997 ◽

2020 ◽

Author(s):

Yupeng Wang ◽

Rosario B. Jaime-Lara ◽

Abhrarup Roy ◽

Ying Sun ◽

Xinyue Liu ◽

...

Keyword(s):

Neural Network ◽

Deep Learning ◽

Dna Sequences ◽

Cell Types ◽

Learning Models ◽

Cell Type ◽

Coding Sequences ◽

Sequence Features ◽

Cell Type Specific ◽

Different Cell Types

AbstractWe propose SeqEnhDL, a deep learning framework for classifying cell type-specific enhancers based on sequence features. DNA sequences of “strong enhancer” chromatin states in nine cell types from the ENCODE project were retrieved to build and test enhancer classifiers. For any DNA sequence, sequential k-mer (k=5, 7, 9 and 11) fold changes relative to randomly selected non-coding sequences were used as features for deep learning models. Three deep learning models were implemented, including multi-layer perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). All models in SeqEnhDL outperform state-of-the-art enhancer classifiers including gkm-SVM and DanQ, with regard to distinguishing cell type-specific enhancers from randomly selected non-coding sequences. Moreover, SeqEnhDL is able to directly discriminate enhancers from different cell types, which has not been achieved by other enhancer classifiers. Our analysis suggests that both enhancers and their tissue-specificity can be accurately identified according to their sequence features. SeqEnhDL is publicly available at https://github.com/wyp1125/SeqEnhDL.

Download Full-text

Expression of the K-fgf proto-oncogene is controlled by 3' regulatory elements which are specific for embryonal carcinoma cells

Molecular and Cellular Biology ◽

10.1128/mcb.10.6.2475-2484.1990 ◽

1990 ◽

Vol 10 (6) ◽

pp. 2475-2484

Author(s):

A M Curatola ◽

C Basilico

Keyword(s):

Dna Sequences ◽

Embryonal Carcinoma ◽

Protein S ◽

Cell Types ◽

Regulatory Elements ◽

Developmentally Regulated ◽

Cis Acting ◽

Dna Elements ◽

Ec Cells ◽

Cat Expression

Expression of the K-fgf/hst proto-oncogene appears to be restricted to cells in the early stages of development, such as embryonal carcinoma (EC) cells. When EC cells are induced to differentiate, K-fgf expression is drastically repressed. To identify cis-acting DNA elements responsible for this type of regulation, we constructed a plasmid in which cat gene expression was driven by about 1 kilobase of upstream K-fgf human DNA sequences, including the putative promoter, and transfected it into undifferentiated F9 EC cells or HeLa cells as prototypes of cells which express or do not express, respectively, the K-fgf proto-oncogene. This plasmid was essentially inactive in both cell types, and the addition of more than 8 kilobases of DNA sequences upstream of the K-fgf promoter did not lead to any increase in chloramphenicol acetyltransferase (CAT) expression. On the other hand, when we inserted in this plasmid DNA sequences which are 3' of the human K-fgf coding sequences, we could detect a significant stimulation of CAT activity. Analysis of these sequences led to the identification of enhancerlike DNA elements which are part of the 3' noncoding region of K-fgf exon 3 and promote CAT expression only in undifferentiated mouse F9 or human NT2/D1 EC cells, but not in HeLa, 3T3, or differentiated F9 cells, therefore mimicking the physiological expression of the K-fgf proto-oncogene. Similar elements are also present in the 3' region of the murine K-fgf proto-oncogene, in a region showing high homology to the human K-fgf sequences. These regulatory elements can promote CAT expression from heterologous promoters in an EC-specific manner, suggesting that they interact with a specific cellular transacting protein(s) whose expression is developmentally regulated.

Download Full-text

CoRE-ATAC: A deep learning model for the functional classification of regulatory elements from single cell and bulk ATAC-seq data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009670 ◽

2021 ◽

Vol 17 (12) ◽

pp. e1009670

Author(s):

Asa Thibodeau ◽

Shubham Khetan ◽

Alper Eroglu ◽

Ryan Tewhey ◽

Michael L. Stitzel ◽

...

Keyword(s):

Deep Learning ◽

Immune Cells ◽

Single Cells ◽

Cell Types ◽

Regulatory Elements ◽

Enhancer Activity ◽

Specific Variation ◽

Single Nucleus ◽

Model Training ◽

Gain Loss

Cis-Regulatory elements (cis-REs) include promoters, enhancers, and insulators that regulate gene expression programs via binding of transcription factors. ATAC-seq technology effectively identifies active cis-REs in a given cell type (including from single cells) by mapping accessible chromatin at base-pair resolution. However, these maps are not immediately useful for inferring specific functions of cis-REs. For this purpose, we developed a deep learning framework (CoRE-ATAC) with novel data encoders that integrate DNA sequence (reference or personal genotypes) with ATAC-seq cut sites and read pileups. CoRE-ATAC was trained on 4 cell types (n = 6 samples/replicates) and accurately predicted known cis-RE functions from 7 cell types (n = 40 samples) that were not used in model training (mean average precision = 0.80, mean F1 score = 0.70). CoRE-ATAC enhancer predictions from 19 human islet samples coincided with genetically modulated gain/loss of enhancer activity, which was confirmed by massively parallel reporter assays (MPRAs). Finally, CoRE-ATAC effectively inferred cis-RE function from aggregate single nucleus ATAC-seq (snATAC) data from human blood-derived immune cells that overlapped with known functional annotations in sorted immune cells, which established the efficacy of these models to study cis-RE functions of rare cellswithout the need for cell sorting. ATAC-seq maps from primary human cells reveal individual- and cell-specific variation in cis-RE activity. CoRE-ATAC increases the functional resolution of these maps, a critical step for studying regulatory disruptions behind diseases.

Download Full-text

SeqEnhDL: sequence-based classification of cell type-specific enhancers using deep learning models

10.21203/rs.3.rs-94396/v1 ◽

2020 ◽

Author(s):

Yupeng Wang ◽

Rosario Jaime-Lara ◽

Abhrarup Roy ◽

Ying Sun ◽

Xinyue Liu ◽

...

Keyword(s):

Neural Network ◽

Deep Learning ◽

Cell Types ◽

Regulatory Elements ◽

Learning Models ◽

Cell Type ◽

Coding Sequences ◽

Sequence Features ◽

A Genome ◽

Cell Type Specific

Abstract ObjectiveComputational identification of cell type-specific regulatory elements on a genome-wide scale is very challenging.ResultsWe propose SeqEnhDL, a deep learning framework for classifying cell type-specific enhancers based on sequence features. DNA sequences of “strong enhancer” chromatin states in nine cell types from the ENCODE project were retrieved to build and test enhancer classifiers. For any DNA sequence, sequential k-mer (k=5, 7, 9 and 11) fold changes relative to randomly selected non-coding sequences were used as features for deep learning models. Three deep learning models were implemented, including multi-layer perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). All models in SeqEnhDL outperform state-of-the-art enhancer classifiers including gkm-SVM and DanQ, with regard to distinguishing cell type-specific enhancers from randomly selected non-coding sequences. Moreover, SeqEnhDL is able to directly discriminate enhancers from different cell types, which has not been achieved by other enhancer classifiers. Our analysis suggests that both enhancers and their tissue-specificity can be accurately identified according to their sequence features. SeqEnhDL is publicly available at https://github.com/wyp1125/SeqEnhDL.

Download Full-text