Application of topic models to a compendium of ChIP-Seq datasets uncovers recurrent transcriptional regulatory modules

2020 ◽  
Vol 36 (8) ◽  
pp. 2352-2358
Author(s):  
Guodong Yang ◽  
Aiqun Ma ◽  
Zhaohui S Qin ◽  
Li Chen

Abstract Motivation The availability of thousands of genome-wide coupling chromatin immunoprecipitation (ChIP)-Seq datasets across hundreds of transcription factors (TFs) and cell lines provides an unprecedented opportunity to jointly analyze large-scale TF-binding in vivo, making possible the discovery of the potential interaction and cooperation among different TFs. The interacted and cooperated TFs can potentially form a transcriptional regulatory module (TRM) (e.g. co-binding TFs), which helps decipher the combinatorial regulatory mechanisms. Results We develop a computational method tfLDA to apply state-of-the-art topic models to multiple ChIP-Seq datasets to decipher the combinatorial binding events of multiple TFs. tfLDA is able to learn high-order combinatorial binding patterns of TFs from multiple ChIP-Seq profiles, interpret and visualize the combinatorial patterns. We apply the tfLDA to two cell lines with a rich collection of TFs and identify combinatorial binding patterns that show well-known TRMs and related TF co-binding events. Availability and implementation A software R package tfLDA is freely available at https://github.com/lichen-lab/tfLDA. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Yang Lin ◽  
Xiaoyong Pan ◽  
Hong-Bin Shen

Abstract Motivation Long non-coding RNAs (lncRNAs) are generally expressed in a tissue-specific way, and subcellular localizations of lncRNAs depend on the tissues or cell lines that they are expressed. Previous computational methods for predicting subcellular localizations of lncRNAs do not take this characteristic into account, they train a unified machine learning model for pooled lncRNAs from all available cell lines. It is of importance to develop a cell-line-specific computational method to predict lncRNA locations in different cell lines. Results In this study, we present an updated cell-line-specific predictor lncLocator 2.0, which trains an end-to-end deep model per cell line, for predicting lncRNA subcellular localization from sequences.We first construct benchmark datasets of lncRNA subcellular localizations for 15 cell lines. Then we learn word embeddings using natural language models, and these learned embeddings are fed into convolutional neural network, long short-term memory and multilayer perceptron to classify subcellular localizations. lncLocator 2.0 achieves varying effectiveness for different cell lines and demonstrates the necessity of training cell-line-specific models. Furthermore, we adopt Integrated Gradients to explain the proposed model in lncLocator 2.0, and find some potential patterns that determine the subcellular localizations of lncRNAs, suggesting that the subcellular localization of lncRNAs is linked to some specific nucleotides. Availability The lncLocator 2.0 is available at www.csbio.sjtu.edu.cn/bioinf/lncLocator2 and the source code can be found at https://github.com/Yang-J-LIN/lncLocator2. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 36 (8) ◽  
pp. 2466-2473 ◽  
Author(s):  
Jiao Sun ◽  
Jae-Woong Chang ◽  
Teng Zhang ◽  
Jeongsik Yong ◽  
Rui Kuang ◽  
...  

Abstract Motivation Accurate estimation of transcript isoform abundance is critical for downstream transcriptome analyses and can lead to precise molecular mechanisms for understanding complex human diseases, like cancer. Simplex mRNA Sequencing (RNA-Seq) based isoform quantification approaches are facing the challenges of inherent sampling bias and unidentifiable read origins. A large-scale experiment shows that the consistency between RNA-Seq and other mRNA quantification platforms is relatively low at the isoform level compared to the gene level. In this project, we developed a platform-integrated model for transcript quantification (IntMTQ) to improve the performance of RNA-Seq on isoform expression estimation. IntMTQ, which benefits from the mRNA expressions reported by the other platforms, provides more precise RNA-Seq-based isoform quantification and leads to more accurate molecular signatures for disease phenotype prediction. Results In the experiments to assess the quality of isoform expression estimated by IntMTQ, we designed three tasks for clustering and classification of 46 cancer cell lines with four different mRNA quantification platforms, including newly developed NanoString’s nCounter technology. The results demonstrate that the isoform expressions learned by IntMTQ consistently provide more and better molecular features for downstream analyses compared with five baseline algorithms which consider RNA-Seq data only. An independent RT-qPCR experiment on seven genes in twelve cancer cell lines showed that the IntMTQ improved overall transcript quantification. The platform-integrated algorithms could be applied to large-scale cancer studies, such as The Cancer Genome Atlas (TCGA), with both RNA-Seq and array-based platforms available. Availability and implementation Source code is available at: https://github.com/CompbioLabUcf/IntMTQ. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Zachary B Abrams ◽  
Dwayne G Tally ◽  
Lynne V Abruzzo ◽  
Kevin R Coombes

Abstract Summary Cytogenetics data, or karyotypes, are among the most common clinically used forms of genetic data. Karyotypes are stored as standardized text strings using the International System for Human Cytogenomic Nomenclature (ISCN). Historically, these data have not been used in large-scale computational analyses due to limitations in the ISCN text format and structure. Recently developed computational tools such as CytoGPS have enabled large-scale computational analyses of karyotypes. To further enable such analyses, we have now developed RCytoGPS, an R package that takes JSON files generated from CytoGPS.org and converts them into objects in R. This conversion facilitates the analysis and visualizations of karyotype data. In effect this tool streamlines the process of performing large-scale karyotype analyses, thus advancing the field of computational cytogenetic pathology. Availability and Implementation Freely available at https://CRAN.R-project.org/package=RCytoGPS. The code for the underlying CytoGPS software can be found at https://github.com/i2-wustl/CytoGPS. Supplementary information There is no supplementary data.


Blood ◽  
2021 ◽  
Vol 138 (Supplement 1) ◽  
pp. 266-266
Author(s):  
Shan Lin ◽  
Clement Larrue ◽  
Nastassja K. Scheidegger ◽  
Bo Kyung A. Seong ◽  
Neekesh V Dharia ◽  
...  

Abstract First-generation, large-scale functional genomic screens have revealed hundreds of potential genetic vulnerabilities in acute myeloid leukemia (AML), a devastating hematologic malignancy with poor overall survival. Because these large-scale genetic screens were primarily performed in vitro in established AML cell lines, their translational relevance has been debated. Therefore, we established a protocol for CRISPR screening in orthotopic xenograft models of human AML, including patient-derived-xenograft (PDX) models that are tractable for CRISPR-editing. We first defined experimental conditions necessary for an optimal in vivo screen via barcoding experiments. We determined that sub-lethal irradiation was necessary for improved barcode representation in bone marrow and to reduce mouse-to-mouse variation. Moreover, it was critical to combine samples from multiple mice to achieve complete in vivo library representation. Next, using the Broad DepMap and other publicly available functional genomic screen data, we identified 200 genes that were stronger dependencies in AML cell lines compared to all other cancer types screened. Using this list, we created a secondary library and performed parallel in vivo and in vitro screens using the MV4-11 and U937 cell lines and a PDX model. In vitro and in vivo hits were surprisingly well correlated, although a modest number of targets did not score well in vivo. Notably, dependencies identified across AML cell line models were strongly recapitulated in the PDX model, validating the application of AML cell lines for dependency discovery. Our in vivo screens nominated the mitochondria-localized RING-type ubiquitin E3 ligase MARCH5 as a potential therapeutic target in AML. Using CRISPR, we first validated this in vitro dependency on MARCH5 and determined that MARCH5 is a critical guardian to prevent apoptosis in AML. MARCH5 depletion activates the canonical mitochondrial apoptosis pathway in a BAX/BAK1-dependent manner. Multiple genome-wide screens revealed that a dependency on MARCH5 is strongly correlated with a dependency on MCL1, but not other anti-apoptotic BCL2-family members, across the AML cell lines in the screen. As observed with MCL1 inhibition, MARCH5 depletion sensitized AML cells to venetoclax, a BCL2-specific inhibitor FDA-approved in combination with a hypomethylating agent for the treatment of older adults with AML. Importantly, MARCH5 depletion diminished the venetoclax resistance induced by MCL1 overexpression but not that caused by BCLXL overexpression. Altogether, these results suggest that MARCH5 is required for maintaining MCL1 activity specifically. Since there are no small molecule inhibitors directed against MARCH5, we deployed a dTAG system as an approximation of pharmacological inhibition. This approach uses a hetero-bifunctional small molecule that binds the FKBP12 F36V-fused MARCH5 and the E3 ligase VHL, leading to the ubiquitination and proteasome-mediated degradation of the fusion protein. dTAG-MARCH5 cells were established via deleting endogenous MARCH5 by CRISPR and expressing exogenous FKBP-tagged MARCH5 protein. MARCH5 degradation with the dTAG molecule dTAG V-1 markedly impaired cell growth in vitro. Additionally, we demonstrated the utility of dTAG system in vivo using a PDX model. The combination treatment of dTAG V-1 and venetoclax elicited a much stronger anti-leukemic effect compared to the treatment with only venetoclax or dTAG V-1, further highlighting MARCH5 as a promising synergistic target for enhancing the efficacy of venetoclax in AML. Taken together, our in vivo screening approach, coupled with CRISPR-competent PDX models and dTAG-directed protein degradation, constitute a useful platform for prioritizing AML targets emerging from in vitro screens to serve as the starting point for therapy development. Disclosures Dharia: Genentech: Current Employment. Piccioni: Merck Research Laboratories: Current Employment. Stegmaier: Bristol Myers Squibb: Consultancy; KronosBio: Consultancy; AstraZeneca: Consultancy; Auron Therapeutics: Consultancy, Current equity holder in publicly-traded company; Novartis: Research Funding.


2019 ◽  
Author(s):  
L Cao ◽  
C Clish ◽  
FB Hu ◽  
MA Martínez-González ◽  
C Razquin ◽  
...  

AbstractMotivationLarge-scale untargeted metabolomics experiments lead to detection of thousands of novel metabolic features as well as false positive artifacts. With the incorporation of pooled QC samples and corresponding bioinformatics algorithms, those measurement artifacts can be well quality controlled. However, it is impracticable for all the studies to apply such experimental design.ResultsWe introduce a post-alignment quality control method called genuMet, which is solely based on injection order of biological samples to identify potential false metabolic features. In terms of the missing pattern of metabolic signals, genuMet can reach over 95% true negative rate and 85% true positive rate with suitable parameters, compared with the algorithm utilizing pooled QC samples. genu-Met makes it possible for studies without pooled QC samples to reduce false metabolic signals and perform robust statistical analysis.Availability and implementationgenuMet is implemented in a R package and available on https://github.com/liucaomics/genuMet under GPL-v2 license.ContactLiming Liang: [email protected] informationSupplementary data are available at ….


Blood ◽  
2019 ◽  
Vol 134 (Supplement_1) ◽  
pp. 280-280
Author(s):  
Maxim Pimkin ◽  
Juliana Xavier Ferrucio ◽  
Neekesh Vijay Dharia ◽  
Taku Harada ◽  
Andrew Kossenkov ◽  
...  

Core transcriptional regulatory circuitries (CRCs) are tightly integrated networks of master transcription factors (TFs) that establish and maintain lineage-specific programs of gene expression. We hypothesized that divergent CRCs establish distinct subtypes of acute myeloid leukemia (AML). CRCs are defined as sets of master TFs that are marked by superenhancers (SEs) and bind to their own genes and those of the other core TFs, forming feed-forward auto-regulatory loops. We have performed large-scale H3K27ac ChIP-seq experiments to map the SE landscape in 20 AML cell lines, 3 normal hematopoietic tissues and 50 patient-derived xenograft (PDX) models of human AML. These experiments highlighted a core set of SE-marked, highly expressed TF genes shared by all the examined AML subtypes, corresponding to a putative pan-AML CRC. Importantly, a significant majority (>70%) of these transcription factors correspond to AML-specific genetic dependencies in the Project Achilles genome-scale RNAi and CRISPR-Cas9 screening efforts undertaken by colleagues at the Broad Institute, confirming the specific reliance of AML on these TFs for survival. We reasoned that CRCs can be predicted by integrating the epigenomic and functional dependency datasets. Indeed, intersecting SE-marked TF genes with preferential AML dependencies resulted in 32 candidate TFs. We have validated these TFs as AML dependencies in a low-throughput system with lentiviral delivery of Cas9 and specific gRNAs. ChiP-seq experiments with antibodies against 23 of these candidates (GATA2, PU1, IRF8, GSE1, GFI1, MEIS1, LYL1, CEBPA, MEF2D, MEF2C, IKZF1, ZEB2, FLI1, ETV6, ELF2, MAX, RUNX1, MYB, IRF2BP2, LMO2, SP1, ZMYND8, MYC) resulted in nearly 100% validation rate. They demonstrated that CRC TFs tend to co-occupy DNA and bind their own and each other's promoters and SEs, suggesting that CRC members function in higher-order chromatin complexes and establish reciprocal feed-forward regulatory loops. Analysis of TF co-binding revealed 237,636 unique binding sites, with most occupied by at least two CRC TFs. Specific combinatorial patterns of TF binding appear to be associated with promoters, enhancers and super-enhancers. Importantly, in addition to the pan-AML CRCs, highly specific dependencies restricted only to a subset of AML cell lines can be accurately predicted from examination of divergent (subtype-specific) AML CRCs, lending support to our hypothesis that context-specific vulnerabilities can be robustly inferred from a systematic study of TF circuits. Specifically, we identified MEF2D and IRF8 as TFs that are selectively marked by SEs in a subset of AML cell lines and PDXs, most notably in samples carrying an MLL rearrangement. At the same time, these genes are strong preferential dependencies in a subset of AML cell lines, most of which also carry an MLL fusion. This suggests specific roles of IRF8 and MEF2D in MLL-induced leukemogenesis. Interestingly, functional dependency scores for these two TFs show an extremely high degree of correlation, indicating tightly integrated functions. While IRF8 is a known regulator of macrophage/dendritic cell function, MEF2D has no recognized roles in hematopoiesis. We have validated MEF2D as a dependency in a low-throughput CRISPR-Cas9 drop out experiment in an MLL-rearranged cell line. At the same time, we observed no functional effect of MEF2D knock out in a human CD34+ cell colony forming assay. This confirms context-specific transcriptional addiction to MEF2D induced by an MLL fusion and suggests a potential "Achilles heel" for leukemia-specific therapy with little or no detrimental effects on normal hematopoiesis. In summary, our data allow us to draw the following conclusions: 1) Intersection of lineage-restricted gene dependencies with SE profiling permits highly specific discovery of CRCs. 2) Transcriptional control in AML is orchestrated by a large CRC of >30 essential TFs. 3) Divergent CRCs are diagnostic of cancer- and context-specific transcriptional addiction. 4) AML CRC is a highly integrated network of co-binding TFs that orchestrate both promoter- and enhancer-centric regulation. Figure Disclosures Lin: Syros Pharmaceuticals: Equity Ownership, Patents & Royalties. Stegmaier:Novartis: Research Funding; Rigel Pharmaceuticals: Consultancy. Orkin:Syros: Consultancy; Novartis: Consultancy.


2019 ◽  
Author(s):  
Zachary B. Abrams ◽  
Caitlin E. Coombes ◽  
Suli Li ◽  
Kevin R. Coombes

AbstractSummaryUnsupervised data analysis in many scientific disciplines is based on calculating distances between observations and finding ways to visualize those distances. These kinds of unsupervised analyses help researchers uncover patterns in large-scale data sets. However, researchers can select from a vast number of different distance metrics, each designed to highlight different aspects of different data types. There are also numerous visualization methods with their own strengths and weaknesses. To help researchers perform unsupervised analyses, we developed the Mercator R package. Mercator enables users to see important patterns in their data by generating multiple visualizations using different standard algorithms, making it particularly easy to compare and contrast the results arising from different metrics. By allowing users to select the distance metric that best fits their needs, Mercator helps researchers perform unsupervised analyses that use pattern identification through computation and visual inspection.Availability and ImplementationMercator is freely available at the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/Mercator/index.html)[email protected] informationSupplementary data are available at Bioinformatics online.


Author(s):  
Grégoire Versmée ◽  
Laura Versmée ◽  
Mikaël Dusenne ◽  
Niloofar Jalali ◽  
Paul Avillach

Abstract Summary Based on the Genomic Data Sharing Policy issued in August 2007, the National Institutes of Health (NIH) has supported several repositories such as the database of Genotypes and Phenotypes (dbGaP). dbGaP is an online repository that provides access to large-scale genetic and phenotypic datasets with more than 1,000 studies. However, navigating the website and understanding the relationship between the studies are not easy tasks. Moreover, the decryption of the files is a complex procedure. In this study we propose the dbgap2x R package that covers a broad range of functions for searching dbGaP studies, exploring the characteristics of a study and easily decrypting the files from dbGaP. Availability and implementation dbgap2x is an R package with the code available at https://github.com/gversmee/dbgap2x. A containerized version including the package, a Jupyter server and with a Notebook example is available at https://hub.docker.com/r/gversmee/dbgap2x. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 35 (11) ◽  
pp. 1901-1906 ◽  
Author(s):  
Mary D Fortune ◽  
Chris Wallace

Abstract Motivation Methods for analysis of GWAS summary statistics have encouraged data sharing and democratized the analysis of different diseases. Ideal validation for such methods is application to simulated data, where some ‘truth’ is known. As GWAS increase in size, so does the computational complexity of such evaluations; standard practice repeatedly simulates and analyses genotype data for all individuals in an example study. Results We have developed a novel method based on an alternative approach, directly simulating GWAS summary data, without individual data as an intermediate step. We mathematically derive the expected statistics for any set of causal variants and their effect sizes, conditional upon control haplotype frequencies (available from public reference datasets). Simulation of GWAS summary output can be conducted independently of sample size by simulating random variates about these expected values. Across a range of scenarios, our method, produces very similar output to that from simulating individual genotypes with a substantial gain in speed even for modest sample sizes. Fast simulation of GWAS summary statistics will enable more complete and rapid evaluation of summary statistic methods as well as opening new potential avenues of research in fine mapping and gene set enrichment analysis. Availability and implementation Our method is available under a GPL license as an R package from http://github.com/chr1swallace/simGWAS. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (9) ◽  
pp. 2862-2871
Author(s):  
Chiung-Ting Wu ◽  
Yizhi Wang ◽  
Yinxue Wang ◽  
Timothy Ebbels ◽  
Ibrahim Karaman ◽  
...  

Abstract Motivation Liquid chromatography–mass spectrometry (LC-MS) is a standard method for proteomics and metabolomics analysis of biological samples. Unfortunately, it suffers from various changes in the retention times (RT) of the same compound in different samples, and these must be subsequently corrected (aligned) during data processing. Classic alignment methods such as in the popular XCMS package often assume a single time-warping function for each sample. Thus, the potentially varying RT drift for compounds with different masses in a sample is neglected in these methods. Moreover, the systematic change in RT drift across run order is often not considered by alignment algorithms. Therefore, these methods cannot effectively correct all misalignments. For a large-scale experiment involving many samples, the existence of misalignment becomes inevitable and concerning. Results Here, we describe an integrated reference-free profile alignment method, neighbor-wise compound-specific Graphical Time Warping (ncGTW), that can detect misaligned features and align profiles by leveraging expected RT drift structures and compound-specific warping functions. Specifically, ncGTW uses individualized warping functions for different compounds and assigns constraint edges on warping functions of neighboring samples. Validated with both realistic synthetic data and internal quality control samples, ncGTW applied to two large-scale metabolomics LC-MS datasets identifies many misaligned features and successfully realigns them. These features would otherwise be discarded or uncorrected using existing methods. The ncGTW software tool is developed currently as a plug-in to detect and realign misaligned features present in standard XCMS output. Availability and implementation An R package of ncGTW is freely available at Bioconductor and https://github.com/ChiungTingWu/ncGTW. A detailed user’s manual and a vignette are provided within the package. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document