NCMP-04. BRAIN-UMAP: THE GENETIC INTERSECTION BETWEEN NEUROSCIENCE, NEUROLOGY, PSYCHIATRY, AND ONCOLOGY

Abstract Whole transcriptome sequencing (RNA-seq) is an important tool for understanding genetic mechanisms underlying human diseases and gaining a better insight into complex human diseases. Several ground-breaking projects have uniformly processed RNASeq data from publicly available studies to enable cross-comparison. One noteworthy study is the recount2 pipeline, which in 2017, has reprocessed ~70,0000 samples from Short Read Archive(SRA), The Cancer Genome Atlas (TCGA), and Genotype-Tissue Expression (GTEx). This vast dataset also includes gene expression data for GTEx-defined brain regions, neurological and psychiatric disorders (such as Parkinson's, Alzheimer’s, Huntington’s) and gliomas (such as TCGA, Chinese Glioma Genome Atlas (CGGA)). We apply uniform manifold approximation and projection (UMAP), a non-linear dimension reduction tool, to bulk gene expression data from brain-related diseases to build a BRAIN-UMAP, which allows for visualization of gene expression profiles across datasets. This UMAP shows that while gliomas form a distinct cluster, the neurological and psychiatric diseases are similar to GTEX-defined normal brain regions which exhibit tissue-specific profiles and patterns. Incorporating gliomas from various publicly available datasets also allows for the ability to observe unique clustering of particular subtypes, which can increase our genetic understanding of the disease. We also present a resource where researchers interested in mechanisms, can easily compare, and contrast the expression of a given gene and/or pathway of interest across various diseases, gliomas, and normal brain regions. Our current study, focusing on brain related diseases, offers insight into what may be possible for the broader neuroscientific community if we continually reprocess newly available brain related RNASeq samples using recount2. Additionally, if we build similar uniformly processing pipelines for other kinds of next-generation sequencing data, we would be able to use multi-omic sequencing data to find novel associations between biological entities and increase our mechanistic knowledge of the disease.

Download Full-text

Mining The Cancer Genome Atlas gene expression data for lineage markers in distinguishing bladder urothelial carcinoma and prostate adenocarcinoma

Scientific Reports ◽

10.1038/s41598-021-85993-x ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ewe Seng Ch’ng

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

The Cancer Genome Atlas ◽

Relative Importance ◽

Expression Data ◽

Gene Expressions ◽

Urothelial Carcinomas ◽

Cancer Genome Atlas ◽

Lineage Markers ◽

Genome Atlas

AbstractDistinguishing bladder urothelial carcinomas from prostate adenocarcinomas for poorly differentiated carcinomas derived from the bladder neck entails the use of a panel of lineage markers to help make this distinction. Publicly available The Cancer Genome Atlas (TCGA) gene expression data provides an avenue to examine utilities of these markers. This study aimed to verify expressions of urothelial and prostate lineage markers in the respective carcinomas and to seek the relative importance of these markers in making this distinction. Gene expressions of these markers were downloaded from TCGA Pan-Cancer database for bladder and prostate carcinomas. Differential gene expressions of these markers were analyzed. Standard linear discriminant analyses were applied to establish the relative importance of these markers in lineage determination and to construct the model best in making the distinction. This study shows that all urothelial lineage genes except for the gene for uroplakin III were significantly expressed in bladder urothelial carcinomas (p < 0.001). In descending order of importance to distinguish from prostate adenocarcinomas, genes for uroplakin II, S100P, GATA3 and thrombomodulin had high discriminant loadings (> 0.3). All prostate lineage genes were significantly expressed in prostate adenocarcinomas(p < 0.001). In descending order of importance to distinguish from bladder urothelial carcinomas, genes for NKX3.1, prostate specific antigen (PSA), prostate-specific acid phosphatase, prostein, and prostate-specific membrane antigen had high discriminant loadings (> 0.3). Combination of gene expressions for uroplakin II, S100P, NKX3.1 and PSA approached 100% accuracy in tumor classification both in the training and validation sets. Mining gene expression data, a combination of four lineage markers helps distinguish between bladder urothelial carcinomas and prostate adenocarcinomas.

Download Full-text

LSTrAP-Crowd: Prediction of novel components of bacterial ribosomes with crowd-sourced analysis of RNA sequencing data

10.1101/2020.04.20.005249 ◽

2020 ◽

Author(s):

Benedict Hew ◽

Qiao Wen Tan ◽

William Goh ◽

Jonathan Wei Xiong Ng ◽

Kenny Koh ◽

...

Keyword(s):

Gene Expression ◽

Protein Synthesis ◽

Rna Sequencing ◽

Gene Expression Data ◽

Large Scale ◽

Bacterial Resistance ◽

Expression Data ◽

Sequencing Data ◽

Novel Proteins ◽

Novel Antibiotics

AbstractBacterial resistance to antibiotics is a growing problem that is projected to cause more deaths than cancer in 2050. Consequently, novel antibiotics are urgently needed. Since more than half of the available antibiotics target the bacterial ribosomes, proteins that are involved in protein synthesis are thus prime targets for the development of novel antibiotics. However, experimental identification of these potential antibiotic target proteins can be labor-intensive and challenging, as these proteins are likely to be poorly characterized and specific to few bacteria. In order to identify these novel proteins, we established a Large-Scale Transcriptomic Analysis Pipeline in Crowd (LSTrAP-Crowd), where 285 individuals processed 26 terabytes of RNA-sequencing data of the 17 most notorious bacterial pathogens. In total, the crowd processed 26,269 RNA-seq experiments and used the data to construct gene co-expression networks, which were used to identify more than a hundred uncharacterized genes that were transcriptionally associated with protein synthesis. We provide the identity of these genes together with the processed gene expression data. The data can be used to identify other vulnerabilities or bacteria, while our approach demonstrates how the processing of gene expression data can be easily crowdsourced.

Download Full-text

Shrinkage of dispersion parameters in the binomial family, with application to differential exon skipping

10.1101/012823 ◽

2014 ◽

Author(s):

Sean Ruddy ◽

Marla Johnson ◽

Elizabeth Purdom

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Empirical Bayes ◽

Simulated Data ◽

Exon Skipping ◽

Expression Data ◽

Weighted Likelihood ◽

Sequencing Data ◽

Dispersion Parameters ◽

Per Gene

The prevalence of sequencing experiments in genomics has led to an increased use of methods for count data in analyzing high-throughput genomic data to perform analyses. The importance of shrinkage methods in improving the performance of statistical methods remains. A common example is that of gene expression data, where the counts per gene are often modeled as some form of an over-dispersed Poisson. In this case, shrinkage estimates of the per-gene dispersion parameter have led to improved estimation of dispersion in the case of a small number of samples. We address a different count setting introduced by the use of sequencing data: comparing differential proportional usage via an over-dispersed binomial model. This is motivated by our interest in testing for differential exon skipping in mRNA-Seq experiments. We introduce a novel method that is developed by modeling the dispersion based on the double binomial distribution proposed by Efron (1986). Our method (WEB-Seq) is an empirical bayes strategy for producing a shrunken estimate of dispersion and effectively detects differential proportional usage, and has close ties to the weighted-likelihood strategy of edgeR developed for gene expression data (Robinson and Smyth, 2007; Robinson et al., 2010). We analyze its behavior on simulated data sets as well as real data and show that our method is fast, powerful and gives accurate control of the FDR compared to alternative approaches. We provide implementation of our methods in the R package DoubleExpSeq available on CRAN.

Download Full-text

The gene expression data of Mycobacterium tuberculosis based on Affymetrix gene chips provide insight into regulatory and hypothetical genes

BMC Microbiology ◽

10.1186/1471-2180-7-37 ◽

2007 ◽

Vol 7 (1) ◽

pp. 37 ◽

Cited By ~ 15

Author(s):

Li M Fu ◽

Casey S Fu-Liu

Keyword(s):

Gene Expression ◽

Mycobacterium Tuberculosis ◽

Gene Expression Data ◽

Expression Data ◽

Gene Chips ◽

Hypothetical Genes ◽

Insight Into

Download Full-text

Inferring TF activities and activity regulators from gene expression data with constraints from TF perturbation data

10.1101/2020.05.25.108654 ◽

2020 ◽

Author(s):

Cynthia Ma ◽

Michael R. Brent

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Target Genes ◽

Activity Levels ◽

Expression Data ◽

Performance Constraints ◽

Perturbation Data ◽

Carry Over ◽

Insight Into

ABSTRACTBackgroundThe activity of a transcription factor (TF) in a sample of cells is the extent to which it is exerting its regulatory potential. Many methods of inferring TF activity from gene expression data have been described, but due to the lack of appropriate large-scale datasets, systematic and objective validation has not been possible until now.ResultsUsing a new dataset, we systematically evaluate and optimize the approach to TF activity inference in which a gene expression matrix is factored into a condition-independent matrix of control strengths and a condition-dependent matrix of TF activity levels. These approaches require a TF network map, which specifies the target genes of each TF, as input. We evaluate different approaches to building the network map and deriving constraints on the matrices. We find that such constraints are essential for good performance. Constraints can be obtained from expression data in which the activities of individual TFs have been perturbed, and we find that such data are both necessary and sufficient for obtaining good performance. Remaining uncertainty about whether a TF activates or represses a target is a major source of error. To a considerable extent, control strengths inferred using expression data from one growth condition carry over to other conditions. As a result, the control strength matrices derived here can be used for other applications. Finally, we apply these methods to gain insight into the upstream factors that regulate the activities of four yeast TFs: Gcr2, Gln3, Gcn4, and Msn2. Evaluation code and data available at https://github.com/BrentLab/TFA-evaluationConclusionsWhen a high-quality network map, constraints, and perturbation-response data are available, inferring TF activity levels by factoring gene expression matrices is effective. Furthermore, it provides insight into regulators of TF activity.

Download Full-text