A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes

Abstract While there are >2 million publicly-available human microarray gene-expression profiles, these profiles were measured using a variety of platforms that each cover a pre-defined, limited set of genes. Therefore, key to reanalyzing and integrating this massive data collection are methods that can computationally reconstitute the complete transcriptome in partially-measured microarray samples by imputing the expression of unmeasured genes. Current state-of-the-art imputation methods are tailored to samples from a specific platform and rely on gene-gene relationships regardless of the biological context of the target sample. We show that sparse regression models that capture sample-sample relationships (termed SampleLASSO), built on-the-fly for each new target sample to be imputed, outperform models based on fixed gene relationships. Extensive evaluation involving three machine learning algorithms (LASSO, k-nearest-neighbors, and deep-neural-networks), two gene subsets (GPL96–570 and LINCS), and multiple imputation tasks (within and across microarray/RNA-seq datasets) establishes that SampleLASSO is the most accurate model. Additionally, we demonstrate the biological interpretability of this method by showing that, for imputing a target sample from a certain tissue, SampleLASSO automatically leverages training samples from the same tissue. Thus, SampleLASSO is a simple, yet powerful and flexible approach for harmonizing large-scale gene-expression data.

Download Full-text

A Flexible, Interpretable, and Accurate Approach for Imputing the Expression of Unmeasured Genes

10.1101/2020.03.30.016675 ◽

2020 ◽

Author(s):

Christopher A Mancuso ◽

Jacob L Canfield ◽

Deepak Singla ◽

Arjun Krishnan

Keyword(s):

Gene Expression ◽

Large Scale ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Machine Learning Algorithms ◽

K Nearest Neighbors ◽

Microarray Gene Expression ◽

Extensive Evaluation ◽

Training Samples ◽

Target Sample

AbstractWhile there are >2 million publicly-available human microarray gene-expression profiles, these profiles were measured using a variety of platforms that each cover a pre-defined, limited set of genes. Therefore, key to reanalyzing and integrating this massive data collection are methods that can computationally reconstitute the complete transcriptome in partially-measured microarray samples by imputing the expression of unmeasured genes. Current state-of-the-art imputation methods are tailored to samples from a specific platform and rely on gene-gene relationships regardless of the biological context of the target sample. We show that sparse regression models that capture sample-sample relationships (termed SampleLASSO), built on-the-fly for each new target sample to be imputed, outperform models based on fixed gene relationships. Extensive evaluation involving three machine learning algorithms (LASSO, k-nearest-neighbors, and deep-neural-networks), two gene subsets (GPL96-570 and LINCS), and three imputation tasks (within and across microarray/RNA-seq) establishes that SampleLASSO is the most accurate model. Additionally, we demonstrate the biological interpretability of this method by showing that, for imputing a target sample from a certain tissue, SampleLASSO automatically leverages training samples from the same tissue. Thus, SampleLASSO is a simple, yet powerful and flexible approach for harmonizing large-scale gene-expression data.

Download Full-text

A Gene-Based Machine Learning Classifier Associated to the Colorectal Adenoma—Carcinoma Sequence

Biomedicines ◽

10.3390/biomedicines9121937 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1937

Author(s):

Antonio Lacalamita ◽

Emanuele Piccinno ◽

Viviana Scalavino ◽

Roberto Bellotti ◽

Gianluigi Giannelli ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Selection Procedure ◽

Normal Mucosa ◽

Gene Expression Omnibus ◽

Machine Learning Algorithms ◽

Microarray Gene Expression ◽

Potential Biomarker

Colorectal cancer (CRC) carcinogenesis is generally the result of the sequential mutation and deletion of various genes; this is known as the normal mucosa–adenoma–carcinoma sequence. The aim of this study was to develop a predictor-classifier during the “adenoma-carcinoma” sequence using microarray gene expression profiles of primary CRC, adenoma, and normal colon epithelial tissues. Four gene expression profiles from the Gene Expression Omnibus database, containing 465 samples (105 normal, 155 adenoma, and 205 CRC), were preprocessed to identify differentially expressed genes (DEGs) between adenoma tissue and primary CRC. The feature selection procedure, using the sequential Boruta algorithm and Stepwise Regression, determined 56 highly important genes. K-Means methods showed that, using the selected 56 DEGs, the three groups were clearly separate. The classification was performed with machine learning algorithms such as Linear Model (LM), Random Forest (RF), k-Nearest Neighbors (k-NN), and Artificial Neural Network (ANN). The best classification method in terms of accuracy (88.06 ± 0.70) and AUC (92.04 ± 0.47) was k-NN. To confirm the relevance of the predictive models, we applied the four models on a validation cohort: the k-NN model remained the best model in terms of performance, with 91.11% accuracy. Among the 56 DEGs, we identified 17 genes with an ascending or descending trend through the normal mucosa–adenoma–carcinoma sequence. Moreover, using the survival information of the TCGA database, we selected six DEGs related to patient prognosis (SCARA5, PKIB, CWH43, TEX11, METTL7A, and VEGFA). The six-gene-based classifier described in the current study could be used as a potential biomarker for the early diagnosis of CRC.

Download Full-text

LncGSEA: a versatile tool to infer lncRNA associated pathways from large-scale cancer transcriptome sequencing data

BMC Genomics ◽

10.1186/s12864-021-07900-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yanan Ren ◽

Ting-You Wang ◽

Leah C. Anderton ◽

Qi Cao ◽

Rendong Yang

Keyword(s):

Gene Expression ◽

Large Scale ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Clinical Samples ◽

Sequencing Data ◽

Multiple Cancer ◽

Regulatory Pathways ◽

Cancer Transcriptome ◽

Versatile Tool

Abstract Background Long non-coding RNAs (lncRNAs) are a growing focus in cancer research. Deciphering pathways influenced by lncRNAs is important to understand their role in cancer. Although knock-down or overexpression of lncRNAs followed by gene expression profiling in cancer cell lines are established approaches to address this problem, these experimental data are not available for a majority of the annotated lncRNAs. Results As a surrogate, we present lncGSEA, a convenient tool to predict the lncRNA associated pathways through Gene Set Enrichment Analysis of gene expression profiles from large-scale cancer patient samples. We demonstrate that lncGSEA is able to recapitulate lncRNA associated pathways supported by literature and experimental validations in multiple cancer types. Conclusions LncGSEA allows researchers to infer lncRNA regulatory pathways directly from clinical samples in oncology. LncGSEA is written in R, and is freely accessible at https://github.com/ylab-hi/lncGSEA.

Download Full-text

Direct comparison of microarray gene expression profiles between non-amplification and a modified cDNA amplification procedure applicable for needle biopsy tissues

Cancer Detection and Prevention ◽

10.1016/s0361-090x(03)00105-3 ◽

2003 ◽

Vol 27 (5) ◽

pp. 405-411 ◽

Cited By ~ 11

Author(s):

Yiwei Li ◽

Shadan Ali ◽

Philip A Philip ◽

Fazlul H Sarkar

Keyword(s):

Gene Expression ◽

Needle Biopsy ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Microarray Gene Expression ◽

Microarray Gene ◽

Amplification Procedure

Download Full-text

Analysis of blood-based gene expression in idiopathic Parkinson disease

Neurology ◽

10.1212/wnl.0000000000004516 ◽

2017 ◽

Vol 89 (16) ◽

pp. 1676-1683 ◽

Cited By ~ 36

Author(s):

Ron Shamir ◽

Christine Klein ◽

David Amar ◽

Eva-Juliane Vollstedt ◽

Michael Bonin ◽

...

Keyword(s):

Gene Expression ◽

Parkinson Disease ◽

Gene Networks ◽

Large Scale ◽

Expression Profiles ◽

Area Under The Curve ◽

Gene Expression Profiles ◽

Gene Signature ◽

Gene Profiles ◽

Independent Test

Objective:To examine whether gene expression analysis of a large-scale Parkinson disease (PD) patient cohort produces a robust blood-based PD gene signature compared to previous studies that have used relatively small cohorts (≤220 samples).Methods:Whole-blood gene expression profiles were collected from a total of 523 individuals. After preprocessing, the data contained 486 gene profiles (n = 205 PD, n = 233 controls, n = 48 other neurodegenerative diseases) that were partitioned into training, validation, and independent test cohorts to identify and validate a gene signature. Batch-effect reduction and cross-validation were performed to ensure signature reliability. Finally, functional and pathway enrichment analyses were applied to the signature to identify PD-associated gene networks.Results:A gene signature of 100 probes that mapped to 87 genes, corresponding to 64 upregulated and 23 downregulated genes differentiating between patients with idiopathic PD and controls, was identified with the training cohort and successfully replicated in both an independent validation cohort (area under the curve [AUC] = 0.79, p = 7.13E–6) and a subsequent independent test cohort (AUC = 0.74, p = 4.2E–4). Network analysis of the signature revealed gene enrichment in pathways, including metabolism, oxidation, and ubiquitination/proteasomal activity, and misregulation of mitochondria-localized genes, including downregulation of COX4I1, ATP5A1, and VDAC3.Conclusions:We present a large-scale study of PD gene expression profiling. This work identifies a reliable blood-based PD signature and highlights the importance of large-scale patient cohorts in developing potential PD biomarkers.

Download Full-text

Discovering Distinct Patterns in Gene Expression Profiles

Journal of Integrative Bioinformatics ◽

10.1515/jib-2008-105 ◽

2008 ◽

Vol 5 (2) ◽

Cited By ~ 1

Author(s):

Li Teng ◽

Laiwan Chan

Keyword(s):

Gene Expression ◽

Large Scale ◽

Expression Profiles ◽

Expression Patterns ◽

Gene Expression Profiles ◽

Clustering Methods ◽

Gene Expressions ◽

Real Gene ◽

Large Scale Dataset ◽

Coexpressed Genes

SummaryTraditional analysis of gene expression profiles use clustering to find groups of coexpressed genes which have similar expression patterns. However clustering is time consuming and could be diffcult for very large scale dataset. We proposed the idea of Discovering Distinct Patterns (DDP) in gene expression profiles. Since patterns showing by the gene expressions reveal their regulate mechanisms. It is significant to find all different patterns existing in the dataset when there is little prior knowledge. It is also a helpful start before taking on further analysis. We propose an algorithm for DDP by iteratively picking out pairs of gene expression patterns which have the largest dissimilarities. This method can also be used as preprocessing to initialize centers for clustering methods, like K-means. Experiments on both synthetic dataset and real gene expression datasets show our method is very effective in finding distinct patterns which have gene functional significance and is also effcient.

Download Full-text

CFTR ΔF508 mutation has minimal effect on the gene expression profile of differentiated human airway epithelia

AJP Lung Cellular and Molecular Physiology ◽

10.1152/ajplung.00065.2005 ◽

2005 ◽

Vol 289 (4) ◽

pp. L545-L553 ◽

Cited By ~ 29

Author(s):

Joseph Zabner ◽

Todd E. Scheetz ◽

Hakeem G. Almabrazi ◽

Thomas L. Casavant ◽

Jian Huang ◽

...

Keyword(s):

Gene Expression ◽

Cystic Fibrosis ◽

Large Scale ◽

Expression Profiles ◽

Expression Patterns ◽

Primary Cultures ◽

Gene Expression Profiles ◽

Filter Method ◽

Tissue Destruction ◽

Airway Epithelia

Cystic fibrosis (CF) is caused by mutations in the cystic fibrosis transmembrane conductance regulator (CFTR), an epithelial chloride channel regulated by phosphorylation. Most of the disease-associated morbidity is the consequence of chronic lung infection with progressive tissue destruction. As an approach to investigate the cellular effects of CFTR mutations, we used large-scale microarray hybridization to contrast the gene expression profiles of well-differentiated primary cultures of human CF and non-CF airway epithelia grown under resting culture conditions. We surveyed the expression profiles for 10 non-CF and 10 ΔF508 homozygote samples. Of the 22,283 genes represented on the Affymetrix U133A GeneChip, we found evidence of significant changes in expression in 24 genes by two-sample t-test ( P < 0.00001). A second, three-filter method of comparative analysis found no significant differences between the groups. The levels of CFTR mRNA were comparable in both groups. There were no significant differences in the gene expression patterns between male and female CF specimens. There were 18 genes with significant increases and 6 genes with decreases in CF relative to non-CF samples. Although the function of many of the differentially expressed genes is unknown, one transcript that was elevated in CF, the KCl cotransporter (KCC4), is a candidate for further study. Overall, the results indicate that CFTR dysfunction has little direct impact on airway epithelial gene expression in samples grown under these conditions.

Download Full-text

Dynamical consequences of regional heterogeneity in the brain’s transcriptional landscape

10.1101/2020.10.28.359943 ◽

2020 ◽

Cited By ~ 1

Author(s):

Gustavo Deco ◽

Kevin Aquino ◽

Aurina Arnatkevičiūtė ◽

Stuart Oldham ◽

Kristina Sabaroedin ◽

...

Keyword(s):

Gene Expression ◽

Large Scale ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Global Gene Expression ◽

Brain Regions ◽

Biophysical Model ◽

Neuronal Dynamics ◽

Regional Heterogeneity ◽

Magnetic Resonance Imaging Mri

AbstractBrain regions vary in their molecular and cellular composition, but how this heterogeneity shapes neuronal dynamics is unclear. Here, we investigate the dynamical consequences of regional heterogeneity using a biophysical model of whole-brain functional magnetic resonance imaging (MRI) dynamics in humans. We show that models in which transcriptional variations in excitatory and inhibitory receptor (E:I) gene expression constrain regional heterogeneity more accurately reproduce the spatiotemporal structure of empirical functional connectivity estimates than do models constrained by global gene expression profiles and MRI-derived estimates of myeloarchitecture. We further show that regional heterogeneity is essential for yielding both ignition-like dynamics, which are thought to support conscious processing, and a wide variance of regional activity timescales, which supports a broad dynamical range. We thus identify a key role for E:I heterogeneity in generating complex neuronal dynamics and demonstrate the viability of using transcriptional data to constrain models of large-scale brain function.

Download Full-text

Computational analysis of microarray gene expression profiles of lung cancer

Biopolymers and Cell ◽

10.7124/bc.00090f ◽

2016 ◽

Vol 32 (1) ◽

pp. 70-79 ◽

Cited By ~ 14

Author(s):

S. A. Babichev ◽

A. I. Kornelyuk ◽

V. I. Lytvynenko ◽

V. V. Osypenko

Keyword(s):

Gene Expression ◽

Lung Cancer ◽

Computational Analysis ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

Sex differences in viral entry protein expression and host transcript responses to SARS-CoV-2

10.21203/rs.3.rs-100914/v2 ◽

2020 ◽

Author(s):

Mengying Sun ◽

Rama Shankar ◽

Meehyun Ko ◽

Christopher Daniel Chang ◽

Shan-Ju Yeh ◽

...

Keyword(s):

Gene Expression ◽

Sex Differences ◽

Viral Entry ◽

Large Scale ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Scale Analysis ◽

Transcriptional Responses ◽

Deconvolution Analysis ◽

Large Scale Analysis

Abstract Epidemiological studies suggest that men exhibit a higher mortality rate to COVID-19 than women, yet the underlying biology is largely unknown. Here, we seek to delineate sex differences in the gene expression of viral entry proteins ACE2 and TMPRSS2, and host transcriptional responses to SARS-CoV-2 through large-scale analysis of genomic and clinical data. We first compiled 220,000 human gene expression profiles from three databases and completed the meta-information through machine learning and manual annotation. Large scale analysis of these profiles indicated that male samples show higher expression levels of ACE2 and TMPRSS2 than female samples, especially in the older group (>60 years) and in the kidney. Subsequent analysis of 6,031 COVID-19 patients at Mount Sinai Health System revealed that men have significantly higher creatinine levels, an indicator of impaired kidney function. Further analysis of 782 COVID-19 patient gene expression profiles taken from upper airway and blood suggested men and women present distinct expression changes. Computational deconvolution analysis of these profiles revealed male COVID-19 patients have enriched kidney-specific mesangial cells in blood compared to healthy patients. Together, this study suggests biological differences in the kidney between sexes may contribute to sex disparity in COVID-19.

Download Full-text