scholarly journals A pitfall for machine learning methods aiming to predict across cell types

2019 ◽  
Author(s):  
Jacob Schreiber ◽  
Ritambhara Singh ◽  
Jeffrey Bilmes ◽  
William Stafford Noble

AbstractMachine learning models used to predict phenomena such as gene expression, enhancer activity, transcription factor binding, or chromatin conformation are most useful when they can generalize to make accurate predictions across cell types. In this situation, a natural strategy is to train the model on experimental data from some cell types and evaluate performance on one or more held-out cell types. In this work, we show that when the training set contains examples derived from the same genomic loci across multiple cell types, the resulting model can be susceptible to a particular form of bias related to memorizing the average activity associated with each genomic locus. Consequently, the trained model may appear to perform well when evaluated on the genomic loci that it was trained on but tends to perform poorly on loci that it was not trained on. We demonstrate this phenomenon by using epigenomic measurements and nucleotide sequence to predict gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data and computing resources become available, future projects will increasingly risk suffering from this issue.

2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Jacob Schreiber ◽  
Ritambhara Singh ◽  
Jeffrey Bilmes ◽  
William Stafford Noble

AbstractMachine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue.


1992 ◽  
Vol 12 (3) ◽  
pp. 1202-1208
Author(s):  
R A Graves ◽  
P Tontonoz ◽  
B M Spiegelman

The molecular basis of adipocyte-specific gene expression is not well understood. We have previously identified a 518-bp enhancer from the adipocyte P2 gene that stimulates adipose-specific gene expression in both cultured cells and transgenic mice. In this analysis of the enhancer, we have defined and characterized a 122-bp DNA fragment that directs differentiation-dependent gene expression in cultured preadipocytes and adipocytes. Several cis-acting elements have been identified and shown by mutational analysis to be important for full enhancer activity. One pair of sequences, ARE2 and ARE4, binds a nuclear factor (ARF2) present in extracts derived from many cell types. Multiple copies of these elements stimulate gene expression from a minimal promoter in preadipocytes, adipocytes, and several other cultured cell lines. A second pair of elements, ARE6 and ARE7, binds a separate factor (ARF6) that is detected only in nuclear extracts derived from adipocytes. The ability of multimers of ARE6 or ARE7 to stimulate promoter activity is strictly adipocyte specific. Mutations in the ARE6 sequence greatly reduce the activity of the 518-bp enhancer. These data demonstrate that several cis- and trans-acting components contribute to the activity of the adipocyte P2 enhancer and suggest that ARF6, a novel differentiation-dependent factor, may be a key regulator of adipogenic gene expression.


2020 ◽  
Author(s):  
Yi-An Tung ◽  
Wen-Tse Yang ◽  
Tsung-Ting Hsieh ◽  
Yu-Chuan Chang ◽  
June-Tai Wu ◽  
...  

AbstractEnhancers are one class of the regulatory elements that have been shown to act as key components to assist promoters in modulating the gene expression in living cells. At present, the number of enhancers as well as their activities in different cell types are still largely unclear. Previous studies have shown that enhancer activities are associated with various functional data, such as histone modifications, sequence motifs, and chromatin accessibilities. In this study, we utilized DNase data to build a deep learning model for predicting the H3K27ac peaks as the active enhancers in a target cell type. We propose joint training of multiple cell types to boost the model performance in predicting the enhancer activities of an unstudied cell type. The results demonstrated that by incorporating more datasets across different cell types, the complex regulatory patterns could be captured by deep learning models and the prediction accuracy can be largely improved. The analyses conducted in this study demonstrated that the cell type-specific enhancer activity can be predicted by joint learning of multiple cell type data using only DNase data and the primitive sequences as the input features. This reveals the importance of cross-cell type learning, and the constructed model can be applied to investigate potential active enhancers of a novel cell type which does not have the H3K27ac modification data yet.AvailabilityThe accuEnhancer package can be freely accessed at: https://github.com/callsobing/accuEnhancer


2009 ◽  
Vol 191 (23) ◽  
pp. 7243-7252 ◽  
Author(s):  
M. Carolina Pilonieta ◽  
Kimberly D. Erickson ◽  
Robert K. Ernst ◽  
Corrella S. Detweiler

ABSTRACT Antimicrobial peptides (AMPs) kill or prevent the growth of microbes. AMPs are made by virtually all single and multicellular organisms and are encountered by bacteria in diverse environments, including within a host. Bacteria use sensor-kinase systems to respond to AMPs or damage caused by AMPs. Salmonella enterica deploys at least three different sensor-kinase systems to modify gene expression in the presence of AMPs: PhoP-PhoQ, PmrA-PmrB, and RcsB-RcsC-RcsD. The ydeI gene is regulated by the RcsB-RcsC-RcsD pathway and encodes a 14-kDa predicted oligosaccharide/oligonucleotide binding-fold (OB-fold) protein important for polymyxin B resistance in broth and also for virulence in mice. We report here that ydeI is additionally regulated by the PhoP-PhoQ and PmrA-PmrB sensor-kinase systems, which confer resistance to cationic AMPs by modifying lipopolysaccharide (LPS). ydeI, however, is not important for known LPS modifications. Two independent biochemical methods found that YdeI copurifies with OmpD/NmpC, a member of the trimeric β-barrel outer membrane general porin family. Genetic analysis indicates that ompD contributes to polymyxin B resistance, and both ydeI and ompD are important for resistance to cathelicidin antimicrobial peptide, a mouse AMP produced by multiple cell types and expressed in the gut. YdeI localizes to the periplasm, where it could interact with OmpD. A second predicted periplasmic OB-fold protein, YgiW, and OmpF, another general porin, also contribute to polymyxin B resistance. Collectively, the data suggest that periplasmic OB-fold proteins can interact with porins to increase bacterial resistance to AMPs.


2019 ◽  
Vol 28 (19) ◽  
pp. 3293-3300 ◽  
Author(s):  
Lucy M McGowan ◽  
George Davey Smith ◽  
Tom R Gaunt ◽  
Tom G Richardson

Abstract Immune-mediated diseases (IMDs) arise when tolerance is lost and chronic inflammation is targeted towards healthy tissues. Despite their growing prevalence, therapies to treat IMDs are lacking. Cytokines and their receptors orchestrate inflammatory responses by regulating elaborate signalling networks across multiple cell types making it challenging to pinpoint therapeutically relevant drivers of IMDs. We developed an analytical framework that integrates Mendelian randomization (MR) and multiple-trait colocalization (moloc) analyses to highlight putative cell-specific drivers of IMDs. MR evaluated causal associations between the levels of 10 circulating cytokines and 9 IMDs within human populations. Subsequently, we undertook moloc analyses to assess whether IMD trait, cytokine protein and corresponding gene expression are driven by a shared causal variant. Moreover, we leveraged gene expression data from three separate cell types (monocytes, neutrophils and T cells) to discern whether associations may be attributed to cell type-specific drivers of disease. MR analyses supported a causal role for IL-18 in inflammatory bowel disease (IBD) (P = 1.17 × 10−4) and eczema/dermatitis (P = 2.81 × 10−3), as well as associations between IL-2rα and IL-6R with several other IMDs. Moloc strengthened evidence of a causal association for these results, as well as providing evidence of a monocyte and neutrophil-driven role for IL-18 in IBD pathogenesis. In contrast, IL-2rα and IL-6R associations were found to be T cell specific. Our analytical pipeline can help to elucidate putative molecular pathways in the pathogeneses of IMDs, which could be applied to other disease contexts.


1992 ◽  
Vol 12 (3) ◽  
pp. 1202-1208 ◽  
Author(s):  
R A Graves ◽  
P Tontonoz ◽  
B M Spiegelman

The molecular basis of adipocyte-specific gene expression is not well understood. We have previously identified a 518-bp enhancer from the adipocyte P2 gene that stimulates adipose-specific gene expression in both cultured cells and transgenic mice. In this analysis of the enhancer, we have defined and characterized a 122-bp DNA fragment that directs differentiation-dependent gene expression in cultured preadipocytes and adipocytes. Several cis-acting elements have been identified and shown by mutational analysis to be important for full enhancer activity. One pair of sequences, ARE2 and ARE4, binds a nuclear factor (ARF2) present in extracts derived from many cell types. Multiple copies of these elements stimulate gene expression from a minimal promoter in preadipocytes, adipocytes, and several other cultured cell lines. A second pair of elements, ARE6 and ARE7, binds a separate factor (ARF6) that is detected only in nuclear extracts derived from adipocytes. The ability of multimers of ARE6 or ARE7 to stimulate promoter activity is strictly adipocyte specific. Mutations in the ARE6 sequence greatly reduce the activity of the 518-bp enhancer. These data demonstrate that several cis- and trans-acting components contribute to the activity of the adipocyte P2 enhancer and suggest that ARF6, a novel differentiation-dependent factor, may be a key regulator of adipogenic gene expression.


2019 ◽  
Author(s):  
Esther Liu ◽  
Behram Radmanesh ◽  
Byungha H. Chung ◽  
Michael D. Donnan ◽  
Dan Yi ◽  
...  

ABSTRACTBackgroundDNA variants in APOL1 associate with kidney disease, but the pathophysiological mechanisms remain incompletely understood. Model organisms lack the APOL1 gene, limiting the degree to which disease states can be recapitulated. Here we present single-cell RNA sequencing (scRNA-seq) of genome-edited human kidney organoids as a platform for profiling effects of APOL1 risk variants in diverse nephron cell types.MethodsWe performed footprint-free CRISPR-Cas9 genome editing of human induced pluripotent stem cells (iPSCs) to knock in APOL1 high-risk G1 variants at the native genomic locus. iPSCs were differentiated into kidney organoids, treated with vehicle, IFN-γ, or the combination of IFN-γ and tunicamycin, and analyzed with scRNA-seq to profile cell-specific changes in differential gene expression patterns, compared to isogenic G0 controls.ResultsBoth G0 and G1 iPSCs differentiated into kidney organoids containing nephron-like structures with glomerular epithelial cells, proximal tubules, distal tubules, and endothelial cells. Organoids expressed detectable APOL1 only after exposure to IFN-γ. scRNA-seq revealed cell type-specific differences in G1 organoid response to APOL1 induction. Additional stress of tunicamycin exposure led to increased glomerular epithelial cell dedifferentiation in G1 organoids.ConclusionsSingle-cell transcriptomic profiling of human genome-edited kidney organoids expressing APOL1 risk variants provides a novel platform for studying the pathophysiology of APOL1-mediated kidney disease.SIGNIFICANCE STATEMENTGaps persist in our mechanistic understanding of APOL1-mediated kidney disease. The authors apply genome-edited human kidney organoids, combined with single-cell transcriptomics, to profile APOL1 risk variants at the native genomic locus in different cell types. This approach captures interferon-mediated induction of APOL1 gene expression and reveals cellular dedifferentiation after a secondary insult of endoplasmic reticulum stress. This system provides a human cellular platform to interrogate complex mechanisms and human-specific regulators underlying APOL1-mediated kidney disease.


2020 ◽  
Author(s):  
William C.W. Chen ◽  
Leonid Gaidukov ◽  
Ming-Ru Wu ◽  
Jicong Cao ◽  
Gigi C.G. Choi ◽  
...  

Precise, scalable, and sustainable control of genetic and cellular activities in mammalian cells is key to developing precision therapeutics and smart biomanufacturing. We created a highly tunable, modular, versatile CRISPR-based synthetic transcription system for the programmable control of gene expression and cellular phenotypes in mammalian cells. Genetic circuits consisting of well-characterized libraries of guide RNAs, binding motifs of synthetic operators, transcriptional activators, and additional genetic regulatory elements expressed mammalian genes in a highly predictable and tunable manner. We demonstrated the programmable control of reporter genes episomally and chromosomally, with up to 25-fold more EF1[alpha]; promoter activity, in multiple cell types. We used these circuits to program secretion of human monoclonal antibodies and to control T cell effector function marked by interferon-[gamma] production. Antibody titers and interferon-[gamma]; concentrations were significantly correlated with synthetic promoter strengths, providing a platform for programming gene expression and cellular function for biological, biomanufacturing, and biomedical applications.


2020 ◽  
Author(s):  
Bharat Panwar ◽  
Benjamin J. Schmiedel ◽  
Shu Liang ◽  
Brandie White ◽  
Enrique Rodriguez ◽  
...  

ABSTRACTThe systemic lupus erythematosus (SLE) is an incurable autoimmune disease disproportionately affecting women and may lead to damage in multiple different organs. The marked heterogeneity in its clinical manifestations is a major obstacle in finding targeted treatments and involvement of multiple immune cell types further increases this complexity. Thus, identifying molecular subtypes that best correlate with disease heterogeneity and severity as well as deducing molecular cross-talk among major immune cell types that lead to disease progression are critical steps in the development of more informed therapies for SLE. Here we profile and analyze gene expression of six major circulating immune cell types from patients with well-characterized SLE (classical monocytes (n=64), T cells (n=24), neutrophils (n=24), B cells (n=20), conventional (n=20) and plasmacytoid (n=22) dendritic cells) and from healthy control subjects. Our results show that the interferon (IFN) response signature was the major molecular feature that classified SLE patients into two distinct groups: IFN-signature negative (IFNneg) and positive (IFNpos). We show that the gene expression signature of IFN response was consistent (i) across all immune cell types, (ii) all single cells profiled from three IFNpos donors using single-cell RNA-seq, and (iii) longitudinal samples of the same patient. For a better understanding of molecular differences of IFNpos versus IFNneg patients, we combined differential gene expression analysis with differential Weighted Gene Co-expression Network Analysis (WGCNA), which revealed a relatively small list of genes from classical monocytes including two known immune modulators, one the target of an approved therapeutic for SLE (TNFSF13B/BAFF: belimumab) and one itself a therapeutic for Rheumatoid Arthritis (IL1RN: anakinra). For a more integrative understanding of the cross-talk among different cell types and to identify potentially novel gene or pathway connections, we also developed a novel gene co-expression analysis method for joint analysis of multiple cell types named integrated WGNCA (iWGCNA). This method revealed an interesting cross-talk between T and B cells highlighted by a significant enrichment in the expression of known markers of T follicular helper cells (Tfh), which also correlate with disease severity in the context of IFNpos patients. Interestingly, higher expression of BAFF from all myeloid cells also shows a strong correlation with enrichment in the expression of genes in T cells that may mark circulating Tfh cells or related memory cell populations. These cell types have been shown to promote B cell class-switching and antibody production, which are well-characterized in SLE patients. In summary, we generated a large-scale gene expression dataset from sorted immune cell populations and present a novel computational approach to analyze such data in an integrative fashion in the context of an autoimmune disease. Our results reveal the power of a hypothesis-free and data-driven approach to discover drug targets and reveal novel cross-talk among multiple immune cell types specific to a subset of SLE patients. This approach is immediately useful for studying autoimmune diseases and is applicable in other contexts where gene expression profiling is possible from multiple cell types within the same tissue compartment.


Sign in / Sign up

Export Citation Format

Share Document