scholarly journals Dictionary learning for transcriptomics data reveals type-specific gene modules in a multi-class setting

2020 ◽  
Vol 62 (3-4) ◽  
pp. 119-134
Author(s):  
Mona Rams ◽  
Tim Conrad

AbstractExtracting information from large biological datasets is a challenging task, due to the large data size, high-dimensionality, noise, and errors in the data. Gene expression data contains information about which gene products have been formed by a cell, thus representing which genes have been read to activate a particular biological process. Understanding which of these gene products can be related to which processes can for example give insights about how diseases evolve and might give hints about how to fight them.The Next Generation RNA-sequencing method emerged over a decade ago and is nowadays state-of-the-art in the field of gene expression analyses. However, analyzing these large, complex datasets is still a challenging task. Many of the existing methods do not take into account the underlying structure of the data.In this paper, we present a new approach for RNA-sequencing data analysis based on dictionary learning. Dictionary learning is a sparsity enforcing method that has widely been used in many fields, such as image processing, pattern classification, signal denoising and more. We show how for RNA-sequencing data, the atoms in the dictionary matrix can be interpreted as modules of genes that either capture patterns specific to different types, or else represent modules that are reused across different scenarios. We evaluate our approach on four large datasets with samples from multiple types. A Gene Ontology term analysis, which is a standard tool indicated to help understanding the functions of genes, shows that the found gene-sets are in agreement with the biological context of the sample types. Further, we find that the sparse representations of samples using the dictionary can be used to identify type-specific differences.

2021 ◽  
Vol 36 (Supplement_1) ◽  
Author(s):  
Floranne Boulogne ◽  
Laura Claus ◽  
Henry Wiersma ◽  
Roy Oelen ◽  
Floor Schukking ◽  
...  

Abstract Background and Aims Genetic testing in patients with suspected hereditary kidney disease does not always reveal the genetic cause for the patient's disorder. Potentially pathogenic variants can reside in genes that are not known to be involved in kidney disease, which makes it difficult to prioritize and interpret the relevance of these variants. As such, there is a clear need for methods that predict the phenotypic consequences of gene expression in a way that is as unbiased as possible. To help identify candidate genes we have developed KidneyNetwork, in which tissue-specific expression is utilized to predict kidney-specific gene functions. Method We combined gene co-expression in 878 publicly available kidney RNA-sequencing samples with the co-expression of a multi-tissue RNA-sequencing dataset of 31,499 samples to build KidneyNetwork. The expression patterns were used to predict which genes have a kidney-related function, and which (disease) phenotypes might be caused when these genes are mutated. By integrating the information from the HPO database, in which known phenotypic consequences of disease genes are annotated, with the gene co-expression network we obtained prediction scores for each gene per HPO term. As proof of principle, we applied KidneyNetwork to prioritize variants in exome-sequencing data from 13 kidney disease patients without a genetic diagnosis. Results We assessed the prediction performance of KidneyNetwork by comparing it to GeneNetwork, a multi-tissue co-expression network we previously developed. In KidneyNetwork, we observe a significantly improved prediction accuracy of kidney-related HPO-terms, as well as an increase in the total number of significantly predicted kidney-related HPO-terms (figure 1). To examine its clinical utility, we applied KidneyNetwork to 13 patients with a suspected hereditary kidney disease without a genetic diagnosis. Based on the HPO terms “Renal cyst” and “Hepatic cysts”, combined with a list of potentially damaging variants in one of the undiagnosed patients with mild ADPKD/PCLD, we identified ALG6 as a new candidate gene. ALG6 bears a high resemblance to other genes implicated in this phenotype in recent years. Through the 100,000 Genomes Project and collaborators we identified three additional patients with kidney and/or liver cysts carrying a suspected deleterious variant in ALG6. Conclusion We present KidneyNetwork, a kidney specific co-expression network that accurately predicts what genes have kidney-specific functions and may result in kidney disease. Gene-phenotype associations of genes unknown for kidney-related phenotypes can be predicted by KidneyNetwork. We show the added value of KidneyNetwork by applying it to exome sequencing data of kidney disease patients without a molecular diagnosis and consequently we propose ALG6 as a promising candidate gene. KidneyNetwork can be applied to clinically unsolved kidney disease cases, but it can also be used by researchers to gain insight into individual genes to better understand kidney physiology and pathophysiology. Acknowledgments This research was made possible through access to the data and findings generated by the 100,000 Genomes Project; http://www.genomicsengland.co.uk.


Blood ◽  
2019 ◽  
Vol 134 (Supplement_1) ◽  
pp. 3766-3766
Author(s):  
Mark Wunderlich ◽  
Jing Chen ◽  
Eric O'Brien ◽  
Nicole Manning ◽  
Christina Sexton ◽  
...  

Therapies for pediatric acute myeloid leukemia (AML) remain unsatisfactory and generally do not incorporate molecularly-targeted agents aside from FLT3 inhibitors outside of the relapse setting. Patient-derived xenograft (PDX) models of AML are increasingly accessible for the preclinical evaluation of targeted therapies, though the degree to which these systems recapitulate the disease state as found in patients has not been well defined for AML. Gene expression profiling of patient blasts has been successfully used to discriminate distinct subtypes of AML, to uncover sub-type specific vulnerabilities, and to predict response to therapy and outcomes. We sought to systematically examine PDX models of pediatric AML for their ability to replicate global gene expression patterns and preserve mutational signatures found in patients. In addition, we conducted in-depth bioinformatic analyses of samples with cryptic CBA2T3-GLIS2 fusion generated by the inv(16)(p13.3q24.3) for identification of potential novel targeted therapies. We performed detailed analyses of RNA sequencing data from a diverse series of 24 pediatric AML PDX models established from samples obtained from patients with relapse and refractory disease. Initially we compared our PDX data against 49 selected relapse and refractory patient sample data files found in the NCI TARGET dataset of pediatric AML. When applying unsupervised hierarchical clustering to the PDX samples, we found that clustering was associated with MLL status. Clustering of the combined sets of samples by MLL status showed integration of samples according to mutation profile, regardless of data source (PDX or patient). The expression levels of all detectable transcripts were highly conserved between PDX and patient MLL-r samples. Separate analysis of each dataset yielded MLL specific gene lists that included a subset of overlapping genes which may point to a unique relapse and refractory pediatric MLL-r signature. This list contains several interesting new targets for further study. A subset of 12 PDX models were compared directly to the matched patient sample from which they were established. This analysis revealed strong similarity, with each PDX most closely related to its matched patient sample, suggesting retention of sample-specific gene expression in immune deficient mice. We set up our PDX models in NSG mice with transgenic expression of human myelo-supportive cytokines SCF, GM-CSF, and IL-3 in order to promote the most efficient and robust engraftment of precious patient material. In order to detect any skewing effects due to the host mouse strain, we compared NSGS PDX RNA sequencing data to 10 matched NSG PDX models. This comparison revealed consistent differences in only 9 transcripts, which were almost entirely related to increased JAK/STAT signaling and macrophage activation pathways in NSGS mice relative to NSG mice. Interestingly, during this analysis we observed a distinct PCA-driven clustering of a pair of PDX samples with previously clinically unidentified driver mutations. Reanalysis of the RNA sequencing data revealed evidence of a cryptic GLIS2 rearrangement (found in ~1% of pediatric AML cases) as the driver mutation, which was subsequently confirmed by RT-PCR in both samples. The unique CBFA2T3/GLIS2 RNA signature was mined to guide the composition of a focused 75-molecule in vitro drug screen against ex vivo PDX samples with an emphasis on the SHH, WNT, and BCL2 pathways. This screen identified the Wnt-C59 PORCN inhibitor as having specific activity against CBFA2T3/GLIS2+ AMLs. Further testing of C-59 in combinatorial studies revealed enhanced effects with the addition of the BCL2 inhibitor, venetoclax. In vivo experiments are currently underway to determine the pre-clinical efficacy of this novel combination. In summary, we found highly significant fidelity of gene expression in PDX models of relapse and refractory pediatric AML. Analysis of this dataset has led to several insights, including potential targeted therapies, highlighting how this system could be a valuable tool for discovery of novel targeted therapies, especially for very rare, distinct subtypes of disease. Disclosures Perentesis: Kurome Therapeutics: Consultancy.


Development ◽  
2020 ◽  
Vol 147 (24) ◽  
pp. dev193854
Author(s):  
Chad Cockrum ◽  
Kiyomi R. Kaneshiro ◽  
Andreas Rechtsteiner ◽  
Tomoko M. Tabuchi ◽  
Susan Strome

ABSTRACTTranscriptomic approaches have provided a growing set of powerful tools with which to study genome-wide patterns of gene expression. Rapidly evolving technologies enable analysis of transcript abundance data from particular tissues and even single cells. This Primer discusses methods that can be used to collect and profile RNAs from specific tissues or cells, process and analyze high-throughput RNA-sequencing data, and define sets of genes that accurately represent a category, such as tissue-enriched or tissue-specific gene expression.


Author(s):  
Rebecca Elyanow ◽  
Ron Zeira ◽  
Max Land ◽  
Benjamin J. Raphael

AbstractTumors are highly heterogeneous, consisting of cell populations with both transcriptional and genetic diversity. These diverse cell populations are spatially organized within a tumor, creating a distinct tumor microenvironment. A new technology called spatial transcriptomics can measure spatial patterns of gene expression within a tissue by sequencing RNA transcripts from a grid of spots, each containing a small number of cells. In tumor cells, these gene expression patterns represent the combined contribution of regulatory mechanisms, which alter the rate at which a gene is transcribed, and genetic diversity, particularly copy number aberrations (CNAs) which alter the number of copies of a gene in the genome. CNAs are common in tumors and often promote cancer growth through upregulation of oncogenes or downregulation of tumor-suppressor genes. We introduce a new method STARCH (Spatial Transcriptomics Algorithm Reconstructing Copy-number Heterogeneity) to infer CNAs from spatial transcriptomics data. STARCH overcomes challenges in inferring CNAs from RNA-sequencing data by leveraging the observation that cells located nearby in a tumor are likely to share similar CNAs. We find that STARCH outperforms existing methods for inferring CNAs from RNA-sequencing data without incorporating spatial information.


2020 ◽  
Author(s):  
Kwangbom Choi ◽  
Hao He ◽  
Daniel M. Gatti ◽  
Vivek M. Philip ◽  
Narayanan Raghupathy ◽  
...  

AbstractMulti-parent populations (MPPs), genetically segregating model systems derived from two or more inbred founder strains, are widely used in biomedical and agricultural research. Gene expression profiling by direct RNA sequencing (RNA-Seq) is commonly applied to MPPs to investigate gene expression regulation and to identify candidate genes. In genetically diverse populations, including most MPPs, quantification of gene expression is improved when the RNA-Seq reads are aligned to individualized transcriptomes that incorporate known polymorphic loci. However, the process of constructing and analyzing individual genomes can be computationally demanding and error prone. We propose a new approach, genome reconstruction by RNA-Seq (GBRS), that relies on simultaneous alignment of RNA-Seq reads to the founder strain transcriptomes. GBRS can reconstruct the diploid genome of each individual and quantify both total and allele-specific gene expression. We demonstrate that GBRS performs as well as methods that rely on high-density genotyping arrays to reconstruct the founder haplotype mosaic of MPP individuals. Using GBRS in addition to other genotyping methods provides quality control for detecting sample mix-ups and improves power to detect expression quantitative trait loci. GBRS software is freely available at https://github.com/churchill-lab/gbrs.


Author(s):  
Anju Karki ◽  
Noah E Berlow ◽  
Jin-Ah Kim ◽  
Esther Hulleman ◽  
Qianqian Liu ◽  
...  

Abstract Background Diffuse intrinsic pontine glioma (DIPG) is a devastating pediatric cancer with unmet clinical need. DIPG is invasive in nature, where tumor cells interweave into the fiber nerve tracts of the pons making the tumor unresectable. Accordingly, novel approaches in combating the disease is of utmost importance and receptor-driven cell invasion in the context of DIPG is under-researched area. Here we investigated the impact on cell invasion mediated by PLEXINB1, PLEXINB2, platelet growth factor receptor (PDGFR)α, PDGFRβ, epithelial growth factor receptor (EGFR), activin receptor 1 (ACVR1), chemokine receptor 4 (CXCR4) and NOTCH1. Methods We used previously published RNA-sequencing data to measure gene expression of selected receptors in DIPG tumor tissue versus matched normal tissue controls (n=18). We assessed protein expression of the corresponding genes using DIPG cell culture models. Then, we performed cell viability and cell invasion assays of DIPG cells stimulated with chemoattractants/ligands. Results RNA-sequencing data showed increased gene expression of receptor genes such as PLEXINB2, PDGFRα, EGFR, ACVR1, CXCR4 and NOTCH1 in DIPG tumors compared to the control tissues. Representative DIPG cell lines demonstrated correspondingly increased protein expression levels of these genes. Cell viability assays showed minimal effects of growth factors/chemokines on tumor cell growth in most instances. Recombinant SEMA4C, SEM4D, PDGF-AA, PDGF-BB, ACVA, CXCL12 and DLL4 ligand stimulation altered invasion in DIPG cells. Conclusions We show that no single growth factor-ligand pair universally induces DIPG cell invasion. However, our results reveal a potential to create a composite of cytokines or anti-cytokines to modulate DIPG cell invasion.


2020 ◽  
Author(s):  
Benedict Hew ◽  
Qiao Wen Tan ◽  
William Goh ◽  
Jonathan Wei Xiong Ng ◽  
Kenny Koh ◽  
...  

AbstractBacterial resistance to antibiotics is a growing problem that is projected to cause more deaths than cancer in 2050. Consequently, novel antibiotics are urgently needed. Since more than half of the available antibiotics target the bacterial ribosomes, proteins that are involved in protein synthesis are thus prime targets for the development of novel antibiotics. However, experimental identification of these potential antibiotic target proteins can be labor-intensive and challenging, as these proteins are likely to be poorly characterized and specific to few bacteria. In order to identify these novel proteins, we established a Large-Scale Transcriptomic Analysis Pipeline in Crowd (LSTrAP-Crowd), where 285 individuals processed 26 terabytes of RNA-sequencing data of the 17 most notorious bacterial pathogens. In total, the crowd processed 26,269 RNA-seq experiments and used the data to construct gene co-expression networks, which were used to identify more than a hundred uncharacterized genes that were transcriptionally associated with protein synthesis. We provide the identity of these genes together with the processed gene expression data. The data can be used to identify other vulnerabilities or bacteria, while our approach demonstrates how the processing of gene expression data can be easily crowdsourced.


Sign in / Sign up

Export Citation Format

Share Document