Large-scale labeling and assessment of sex bias in publicly available expression data

Abstract Background Women are at more than 1.5-fold higher risk for clinically relevant adverse drug events. While this higher prevalence is partially due to gender-related effects, biological sex differences likely also impact drug response. Publicly available gene expression databases provide a unique opportunity for examining drug response at a cellular level. However, missingness and heterogeneity of metadata prevent large-scale identification of drug exposure studies and limit assessments of sex bias. To address this, we trained organism-specific models to infer sample sex from gene expression data, and used entity normalization to map metadata cell line and drug mentions to existing ontologies. Using this method, we inferred sex labels for 450,371 human and 245,107 mouse microarray and RNA-seq samples from refine.bio. Results Overall, we find slight female bias (52.1%) in human samples and (62.5%) male bias in mouse samples; this corresponds to a majority of mixed sex studies in humans and single sex studies in mice, split between female-only and male-only (25.8% vs. 18.9% in human and 21.6% vs. 31.1% in mouse, respectively). In drug studies, we find limited evidence for sex-sampling bias overall; however, specific categories of drugs, including human cancer and mouse nervous system drugs, are enriched in female-only and male-only studies, respectively. We leverage our expression-based sex labels to further examine the complexity of cell line sex and assess the frequency of metadata sex label misannotations (2–5%). Conclusions Our results demonstrate limited overall sex bias, while highlighting high bias in specific subfields and underscoring the importance of including sex labels to better understand the underlying biology. We make our inferred and normalized labels, along with flags for misannotated samples, publicly available to catalyze the routine use of sex as a study variable in future analyses.

Download Full-text

Large-Scale Labeling and Assessment of Sex Bias in Publicly Available Expression Data

10.1101/2020.10.26.356287 ◽

2020 ◽

Author(s):

Emily Flynn ◽

Annie Chang ◽

Russ B. Altman

Keyword(s):

Gene Expression ◽

Cell Line ◽

Large Scale ◽

Drug Response ◽

Drug Exposure ◽

Adverse Drug Events ◽

Human Cancer ◽

Sampling Bias ◽

Sex Bias ◽

Expression Data

ABSTRACTWomen are at more than 1.5-fold higher risk for clinically relevant adverse drug events. While this higher prevalence is partially due to gender-related effects, biological sex differences likely also impact drug response. Publicly available gene expression databases provide a unique opportunity for examining drug response at a cellular level. However, missingness and heterogeneity of metadata prevent large-scale identification of drug exposure studies and limit assessments of sex bias. To address this, we trained organism-specific models to infer sample sex from gene expression data, and used entity normalization to map metadata cell line and drug mentions to existing ontologies. Using this method, we infer sex labels for 450,371 human and 245,107 mouse microarray and RNA-seq samples from refine.bio. Overall, we find slight female bias (52.1%) in human samples and (62.5%) male bias in mouse samples; this corresponds to a majority of single sex studies, split between female-only and male-only (33.3% vs 18.4% in human and 31.0% vs 30.4% in mouse respectively). In drug studies, we find limited evidence for sex-sampling bias overall; however, specific categories of drugs, including human cancer and mouse nervous system drugs, are enriched in female-only and male-only studies respectively. Our expression-based sex labels allow us to further examine the complexity of cell line sex and assess the frequency of metadata sex label misannotations (2-5%). We make our inferred and normalized labels, along with flags for misannotated samples, publicly available to catalyze the routine use of sex as a study variable in future analyses.

Download Full-text

Graph Convolutional Network for Drug Response Prediction Using Gene Expression Data

Mathematics ◽

10.3390/math9070772 ◽

2021 ◽

Vol 9 (7) ◽

pp. 772

Author(s):

Seonghun Kim ◽

Seockhun Bae ◽

Yinhua Piao ◽

Kyuri Jo

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Drug Response ◽

Response Prediction ◽

Biological Data ◽

Expression Data ◽

Convolutional Network ◽

Essential Information ◽

Protein Protein Interaction

Genomic profiles of cancer patients such as gene expression have become a major source to predict responses to drugs in the era of personalized medicine. As large-scale drug screening data with cancer cell lines are available, a number of computational methods have been developed for drug response prediction. However, few methods incorporate both gene expression data and the biological network, which can harbor essential information about the underlying process of the drug response. We proposed an analysis framework called DrugGCN for prediction of Drug response using a Graph Convolutional Network (GCN). DrugGCN first generates a gene graph by combining a Protein-Protein Interaction (PPI) network and gene expression data with feature selection of drug-related genes, and the GCN model detects the local features such as subnetworks of genes that contribute to the drug response by localized filtering. We demonstrated the effectiveness of DrugGCN using biological data showing its high prediction accuracy among the competing methods.

Download Full-text

Addition of Drug-Response Specific Micro-RNAs to the International Prognostic Index Improves Prognostic Stratification of GCB-DLBCL Patients Treated with R-CHOP

Blood ◽

10.1182/blood-2019-122351 ◽

2019 ◽

Vol 134 (Supplement_1) ◽

pp. 1623-1623 ◽

Cited By ~ 1

Author(s):

Karen Dybkær ◽

Hanne Due ◽

Rasmus Froberg Brøndum ◽

Ken H. Young ◽

Martin Bøgsted

Keyword(s):

Disease Progression ◽

Cell Line ◽

Cell Lines ◽

Mirna Expression ◽

Drug Response ◽

Drug Exposure ◽

International Prognostic Index ◽

Prognostic Index ◽

Expression Data ◽

Prognostic Stratification

Background: Patients with Diffuse large B-cell lymphoma (DLBCL) in approximately 40% of cases suffer from primary refractory disease and treatment induced immuno-chemotherapy resistance demonstrating that standard provided treatment regimens are not sufficient to cure all patients. Early detection of resistance is of great importance and defining microRNA (miRNA) involvement in resistance could be useful to guide treatment selection and help monitor treatment administration while sparing patients for inefficient, but still toxic therapy. Concept and Aims: With information on drug-response specific miRNAs, we hypothesized that multi-miRNA panels can improve robustness of individual clinical markers and serve as a prognostic classifier predicting disease progression in DLBCL patients. Methods: Fifteen DLBCL cell lines were tested for sensitivity towards rituximab (R), cyclophosphamide (C), doxorubicin (H), and vincristine (O). Cell line specific seeding concentrations was used to ensure exponential growth and each cell line was subjected to 16 concentrations in serial 2-fold dilutions and number of metabolic active cells was evaluated after 48 hours of drug exposure using MTS assay. For each drug, we ranked the cell lines according to their sensitivity and categorized them as sensitive, intermediate responsive, or resistant. Differential miRNA expression analysis between sensitive and resistant cell lines identified 43 miRNAs to be associated with response to compounds of the R-CHOP regimen, by selecting probes with a log fold change larger than 2. Baseline miRNA expression data were obtained for each cell line in untreated condition, and differential miRNA expression analysis identified 43 miRNAs associated to response to R-CHOP. Using the Affymetrix HG-U133+2 platform, expression levels of the miRNA precursors were assessed in 701 diagnostic DLBCL biopsies, and miRNA-panel classifiers were build using multiple Cox regression or random survival forest. Results: Generated prognostic miRNA-panel classifiers were tested for predictive accuracies and were subsequently evaluated by Brier scores and time varying area under the ROC curves (tAUC). Progression-free survival (PFS) was chosen as the outcome, since it is a treatment evaluation parameter as closely as possible to the time of drug exposure and the tested miRNAs were all associated directly to drug specific response. Furthermore, overall survival (OS) was used for verification of findings. Comparison of analyses conducted for the respective cohorts (All DLBCL, ABC, and GCB patients) showed the lowest prediction errors for all models within the GCB subclass with a multivariate Cox miRNA-panel model including miR-146a, miR-155, miR-21, miR-34a, and miR-23a~miR-27a~miR-24-2 cluster performed the best and successfully stratified GCB-DLBCL patients into high- and low-risk of disease progression. In addition, combination of the miRNA-panel and international prognostic index (IPI) substantially increased prognostic performance in GCB classified patients, indicating a prognostic signal from the response-specific miRNAs independent of IPI. In conclusion: We found as proof of concept that adding gene expression data detecting drug-response specific miRNAs to the clinically established IPI improved the prognostic stratification of GCB-DLBCL patients treated with R-CHOP. Disclosures No relevant conflicts of interest to declare.

Download Full-text

Bridging the gap between cancer cell line models and tumours using gene expression data

British Journal of Cancer ◽

10.1038/s41416-021-01359-0 ◽

2021 ◽

Author(s):

Javad Noorbakhsh ◽

Francisca Vazquez ◽

James M. McFarland

Keyword(s):

Gene Expression ◽

Cell Line ◽

Cancer Cell ◽

Gene Expression Data ◽

Cancer Cell Line ◽

Expression Data

Download Full-text

GENE DISCOVERY METHODS FROM LARGE-SCALE GENE EXPRESSION DATA

Quantum Bio-Informatics III ◽

10.1142/9789814304061_0040 ◽

2010 ◽

Author(s):

AKIFUMI SHIMIZU ◽

KENTARO YANO

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Gene Discovery ◽

Expression Data

Download Full-text

LSTrAP-Crowd: Prediction of novel components of bacterial ribosomes with crowd-sourced analysis of RNA sequencing data

10.1101/2020.04.20.005249 ◽

2020 ◽

Author(s):

Benedict Hew ◽

Qiao Wen Tan ◽

William Goh ◽

Jonathan Wei Xiong Ng ◽

Kenny Koh ◽

...

Keyword(s):

Gene Expression ◽

Protein Synthesis ◽

Rna Sequencing ◽

Gene Expression Data ◽

Large Scale ◽

Bacterial Resistance ◽

Expression Data ◽

Sequencing Data ◽

Novel Proteins ◽

Novel Antibiotics

AbstractBacterial resistance to antibiotics is a growing problem that is projected to cause more deaths than cancer in 2050. Consequently, novel antibiotics are urgently needed. Since more than half of the available antibiotics target the bacterial ribosomes, proteins that are involved in protein synthesis are thus prime targets for the development of novel antibiotics. However, experimental identification of these potential antibiotic target proteins can be labor-intensive and challenging, as these proteins are likely to be poorly characterized and specific to few bacteria. In order to identify these novel proteins, we established a Large-Scale Transcriptomic Analysis Pipeline in Crowd (LSTrAP-Crowd), where 285 individuals processed 26 terabytes of RNA-sequencing data of the 17 most notorious bacterial pathogens. In total, the crowd processed 26,269 RNA-seq experiments and used the data to construct gene co-expression networks, which were used to identify more than a hundred uncharacterized genes that were transcriptionally associated with protein synthesis. We provide the identity of these genes together with the processed gene expression data. The data can be used to identify other vulnerabilities or bacteria, while our approach demonstrates how the processing of gene expression data can be easily crowdsourced.

Download Full-text

SOMDE: A scalable method for identifying spatially variable genes with self-organizing map

10.1101/2020.12.10.419549 ◽

2020 ◽

Author(s):

Minsheng Hao ◽

Kui Hua ◽

Xuegong Zhang

Keyword(s):

Gene Expression ◽

Large Scale ◽

Expression Patterns ◽

Self Organizing Map ◽

Expression Data ◽

Spatial Expression ◽

Variable Expression ◽

Sequencing Technologies ◽

Physical Context ◽

Variable Genes

AbstractRecent developments of spatial transcriptomic sequencing technologies provide powerful tools for understanding cells in the physical context of tissue micro-environments. A fundamental task in spatial gene expression analysis is to identify genes with spatially variable expression patterns, or spatially variable genes (SVgenes). Several computational methods have been developed for this task. Their high computational complexity limited their scalability to the latest and future large-scale spatial expression data.We present SOMDE, an efficient method for identifying SVgenes in large-scale spatial expression data. SOMDE uses selforganizing map (SOM) to cluster neighboring cells into nodes, and then uses a Gaussian Process to fit the node-level spatial gene expression to identify SVgenes. Experiments show that SOMDE is about 5-50 times faster than existing methods with comparable results. The adjustable resolution of SOMDE makes it the only method that can give results in ~5 minutes in large datasets of more than 20,000 sequencing sites. SOMDE is available as a python package on PyPI at https://pypi.org/project/somde.

Download Full-text