Heterogeneous Gene Expression Cross-Evaluation of Robust Biomarkers Using Machine Learning Techniques Applied to Lung Cancer

Background: Nowadays, gene expression analysis is one of the most promising pillars for understanding and uncovering the mechanisms underlying the development and spread of cancer . In this sense, Next Generation Sequencing technologies, such as RNA-Seq, are currently leading the market due to their precision and cost. Nevertheless, there is still an enormous amount of non-analyzed data obtained from older technologies, such as Microarray, which could still be useful to extract relevant knowledge. Methods: Throughout this research, a complete machine learning methodology to cross-evaluate the compatibility between both RNA-Seq and Microarray sequencing technologies is described and implemented. In order to show a real application of the designed pipeline, a lung cancer case study is addressed by considering two detected subtypes: adenocarcinoma and squamous cell carcinoma. Transcriptomic datasets considered for our study have been obtained from the public repositories NCBI/GEO, ArrayExpress and GDC-Portal. From them, several gene experiments have been carried out with the aim of finding gene signatures for these lung cancer subtypes, linked to both transcriptomic technologies. With these DEGs selected, intelligent predictive models capable of classifying new samples belonging to these cancer subtypes were developed. Results: The predictive models built using one technology are capable of discerning samples from a different technology. The classification results are evaluated in terms of accuracy, F1-score and ROC curves along with AUC. Finally, the biological information of the gene sets obtained and their relationship with lung cancer is reviewed, encountering strong biological evidence linking them to the disease. Conclusion: Our method has the capability of finding strong gene signatures which are also independent of the transcriptomic technology used to develop the analysis. In addition, our article highlighted the potential of using heterogeneous transcriptomic data to increase the amount of samples for the studies, increasing the statistical significance of the results.

Download Full-text

Analysis of gene expression profiles of lung cancer subtypes with machine learning algorithms

Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease ◽

10.1016/j.bbadis.2020.165822 ◽

2020 ◽

Vol 1866 (8) ◽

pp. 165822 ◽

Cited By ~ 2

Author(s):

Fei Yuan ◽

Lin Lu ◽

Quan Zou

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Lung Cancer ◽

Expression Profiles ◽

Learning Algorithms ◽

Gene Expression Profiles ◽

Machine Learning Algorithms ◽

Cancer Subtypes

Download Full-text

A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation

Frontiers in Genetics ◽

10.3389/fgene.2018.00313 ◽

2018 ◽

Vol 9 ◽

Cited By ~ 6

Author(s):

Adam McDermaid ◽

Xin Chen ◽

Yiran Zhang ◽

Cankun Wang ◽

Shaopeng Gu ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Uncertainty Analysis ◽

Rna Seq ◽

Read Alignment ◽

New Machine

Download Full-text

A machine learning texture model for classifying lung cancer subtypes using preliminary bronchoscopic findings

Medical Physics ◽

10.1002/mp.13241 ◽

2018 ◽

Vol 45 (12) ◽

pp. 5509-5514 ◽

Cited By ~ 2

Author(s):

Po‐Hao Feng ◽

Yin‐Tzu Lin ◽

Chung‐Ming Lo

Keyword(s):

Machine Learning ◽

Lung Cancer ◽

Cancer Subtypes ◽

Texture Model

Download Full-text

Gene Expression Analysis for Early Lung Cancer Prediction Using Machine Learning Techniques: An Eco-Genomics Approach

IEEE Access ◽

10.1109/access.2018.2886604 ◽

2019 ◽

Vol 7 ◽

pp. 4232-4238 ◽

Cited By ~ 5

Author(s):

Jayadeep Pati

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Lung Cancer ◽

Expression Analysis ◽

Gene Expression Analysis ◽

Machine Learning Techniques ◽

Cancer Prediction ◽

Early Lung Cancer ◽

Learning Techniques

Download Full-text

77 Prevalence of secondary immunotherapeutic targets in the absence of established immune biomarkers in solid tumors

Journal for ImmunoTherapy of Cancer ◽

10.1136/jitc-2021-sitc2021.077 ◽

2021 ◽

Vol 9 (Suppl 3) ◽

pp. A86-A86

Author(s):

Paul DePietro ◽

Mary Nesline ◽

Yong Hee Lee ◽

RJ Seager ◽

Erik Van Roey ◽

...

Keyword(s):

Gene Expression ◽

Lung Cancer ◽

Reference Population ◽

List Type ◽

Tumor Type ◽

Genomic Profiling ◽

Rna Seq ◽

Immune Biomarkers ◽

Cancer Types ◽

Immune Related Genes

BackgroundImmune checkpoint inhibitor-based therapies have achieved impressive success in the treatment of several cancer types. Predictive immune biomarkers, including PD-L1, MSI and TMB are well established as surrogate markers for immune evasion and tumor-specific neoantigens across many tumors. Positive detection across cancer types varies, but overall ~50% of patients test negative for these primary immune markers.1 In this study, we investigated the prevalence of secondary immune biomarkers outside of PD-L1, TMB and MSI.MethodsComprehensive genomic and immune profiling, including PD-L1 IHC, TMB, MSI and gene expression of 395 immune related genes was performed on 6078 FFPE tumors representing 34 cancer types, predominantly composed of lung cancer (36.7%), colorectal cancer (11.9%) and breast cancer (8.5%). Expression levels by RNA-seq of 36 genes targeted by immunotherapies in solid tumor clinical trials, identified as secondary immune biomarkers, were ranked against a reference population. Genes with a rank value ≥75th percentile were considered high and values were associated with PD-L1 (positive ≥1%), MSI (MSI-H or MSS) and TMB (high ≥10 Mut/Mb) status. Additionally, secondary immune biomarker status was segmented by tumor type and cancer immune cycle roles.ResultsIn total, 41.0% of cases were PD-L1+, 6.4% TMB+, and 0.1% MSI-H. 12.6% of cases were positive for >2 of these markers while 39.9% were triple negative (PD-L1-/TMB-/MSS). Of the PD-L1-/TMB-/MSS cases, 89.1% were high for at least one secondary immune biomarker, with 69.3% having ≥3 markers. PD-L1-/TMB-/MSS tumor types with ≥50% prevalence of high secondary immune biomarkers included brain, prostate, kidney, sarcoma, gallbladder, breast, colorectal, and liver cancer. High expression of cancer testis antigen secondary immune biomarkers (e.g., NY-ESO-1, LAGE-1A, MAGE-A4) was most commonly observed in bladder, ovarian, sarcoma, liver, and prostate cancer (≥15%). Tumors demonstrating T-cell priming (e.g., CD40, OX40, CD137), trafficking (e.g., TGFB1, TLR9, TNF) and/or recognition (e.g., CTLA4, LAG3, TIGIT) secondary immune biomarkers were most represented by kidney, gallbladder, and sarcoma (≥40%), with melanoma, esophageal, head & neck, cervical, stomach, and lung cancer least represented (≥15%).ConclusionsOur studies show comprehensive tumor profiling that includes gene expression can detect secondary immune biomarkers targeted by investigational therapies in ~90% of PD-L1-/TMB-/MSS cases. While genomic profiling could also provide therapeutic choices for a percentage of these patients, detection of secondary immune biomarkers by RNA-seq provides additional options for patients without a clear therapeutic path as determined by PD-L1 testing and genomic profiling alone.ReferenceHuang R S P, Haberberger J, Severson E, et al. A pan-cancer analysis of PD-L1 immunohistochemistry and gene amplification, tumor mutation burden and microsatellite instability in 48,782 cases. Mod Pathol 2021;34: 252–263.

Download Full-text

Machine learning applied to whole-blood RNA-sequencing data uncovers distinct subsets of patients with systemic lupus erythematosus

10.1101/647719 ◽

2019 ◽

Author(s):

William A Figgett ◽

Katherine Monaghan ◽

Milica Ng ◽

Monther Alhamdoosh ◽

Eugene Maraskovsky ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Systemic Lupus Erythematosus ◽

Clinical Trials ◽

Lupus Erythematosus ◽

Whole Blood ◽

Rna Seq ◽

Systemic Lupus ◽

Disease Heterogeneity ◽

Healthy Donors

ABSTRACTObjectiveSystemic lupus erythematosus (SLE) is a heterogeneous autoimmune disease that is difficult to treat. There is currently no optimal stratification of patients with SLE, and thus responses to available treatments are unpredictable. Here, we developed a new stratification scheme for patients with SLE, based on the whole-blood transcriptomes of patients with SLE.MethodsWe applied machine learning approaches to RNA-sequencing (RNA-seq) datasets to stratify patients with SLE into four distinct clusters based on their gene expression profiles. A meta-analysis on two recently published whole-blood RNA-seq datasets was carried out and an additional similar dataset of 30 patients with SLE and 29 healthy donors was contributed in this research; 141 patients with SLE and 51 healthy donors were analysed in total.ResultsExamination of SLE clusters, as opposed to unstratified SLE patients, revealed underappreciated differences in the pattern of expression of disease-related genes relative to clinical presentation. Moreover, gene signatures correlated to flare activity were successfully identified.ConclusionGiven that disease heterogeneity has confounded research studies and clinical trials, our approach addresses current unmet medical needs and provides a greater understanding of SLE heterogeneity in humans. Stratification of patients based on gene expression signatures may be a valuable strategy to harness disease heterogeneity and identify patient populations that may be at an increased risk of disease symptoms. Further, this approach can be used to understand the variability in responsiveness to therapeutics, thereby improving the design of clinical trials and advancing personalised therapy.

Download Full-text

A Pathway-Based Strategy to Identify Biomarkers for Lung Cancer Diagnosis and Prognosis

Evolutionary Bioinformatics ◽

10.1177/1176934319838494 ◽

2019 ◽

Vol 15 ◽

pp. 117693431983849 ◽

Cited By ~ 2

Author(s):

Mengying Sheng ◽

Xueying Xie ◽

Jun Wang ◽

Wanjun Gu

Keyword(s):

Gene Expression ◽

Lung Cancer ◽

Cancer Diagnosis ◽

Large Scale ◽

Gene Signature ◽

Multiple Sources ◽

Cancer Subtypes ◽

Cancer Management ◽

Lung Cancer Diagnosis ◽

Gene Expression Signatures

Current research has identified several potential biomarkers for lung cancer diagnosis or prognosis. However, most of these biomarkers are derived from a relatively small number of samples using algorithms at the gene level. Hence, gene expression signatures discovered in these studies have little overlaps. In this study, we proposed a new strategy to identify biomarkers from multiple datasets at the pathway level. We integrated the genome-wide expression data of lung cancer tissues from 13 published studies and applied our strategy to identify lung cancer diagnostic and prognostic biomarkers. We identified a 32-gene signature that differentiates lung adenocarcinomas from other lung cancer subtypes. We also discovered a 43-gene signature that can predict the outcome of human lung cancers. We tested their performance in several independent cohorts, which confirmed their robust prognostic and diagnostic power. Furthermore, we showed that the proposed gene expression signatures were independent of several traditional clinical indicators in lung cancer management. Our results suggest that the pathway-based strategy is useful to identify transcriptomic biomarkers from large-scale gene expression datasets that were collected from multiple sources.

Download Full-text

Predicting Complete Remission of Acute Myeloid Leukemia: Machine Learning Applied to Gene Expression

Cancer Informatics ◽

10.1177/1176935119835544 ◽

2019 ◽

Vol 18 ◽

pp. 117693511983554 ◽

Cited By ~ 5

Author(s):

Ophir Gal ◽

Noam Auslander ◽

Yu Fan ◽

Daoud Meerzaman

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Acute Myeloid Leukemia ◽

Complete Remission ◽

Myeloid Leukemia ◽

Statistical Significance ◽

Expression Patterns ◽

Pathway Enrichment Analysis ◽

Data Set ◽

Acute Myeloid

Machine learning (ML) is a useful tool for advancing our understanding of the patterns and significance of biomedical data. Given the growing trend on the application of ML techniques in precision medicine, here we present an ML technique which predicts the likelihood of complete remission (CR) in patients diagnosed with acute myeloid leukemia (AML). In this study, we explored the question of whether ML algorithms designed to analyze gene-expression patterns obtained through RNA sequencing (RNA-seq) can be used to accurately predict the likelihood of CR in pediatric AML patients who have received induction therapy. We employed tests of statistical significance to determine which genes were differentially expressed in the samples derived from patients who achieved CR after 2 courses of treatment and the samples taken from patients who did not benefit. We tuned classifier hyperparameters to optimize performance and used multiple methods to guide our feature selection as well as our assessment of algorithm performance. To identify the model which performed best within the context of this study, we plotted receiver operating characteristic (ROC) curves. Using the top 75 genes from the k-nearest neighbors algorithm (K-NN) model ( K = 27) yielded the best area-under-the-curve (AUC) score that we obtained: 0.84. When we finally tested the previously unseen test data set, the top 50 genes yielded the best AUC = 0.81. Pathway enrichment analysis for these 50 genes showed that the guanosine diphosphate fucose (GDP-fucose) biosynthesis pathway is the most significant with an adjusted P value = .0092, which may suggest the vital role of N-glycosylation in AML.

Download Full-text

Leveraging TCGA gene expression data to build predictive models for cancer drug response

BMC Bioinformatics ◽

10.1186/s12859-020-03690-4 ◽

2020 ◽

Vol 21 (S14) ◽

Cited By ~ 3

Author(s):

Evan A. Clayton ◽

Toyya A. Pujol ◽

John F. McDonald ◽

Peng Qiu

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Expression Data ◽

Predictive Models ◽

Drug Response ◽

Cancer Drug ◽

Expression Data ◽

Classification Methods ◽

Clustering And Classification ◽

Machine Learning Models

Abstract Background Machine learning has been utilized to predict cancer drug response from multi-omics data generated from sensitivities of cancer cell lines to different therapeutic compounds. Here, we build machine learning models using gene expression data from patients’ primary tumor tissues to predict whether a patient will respond positively or negatively to two chemotherapeutics: 5-Fluorouracil and Gemcitabine. Results We focused on 5-Fluorouracil and Gemcitabine because based on our exclusion criteria, they provide the largest numbers of patients within TCGA. Normalized gene expression data were clustered and used as the input features for the study. We used matching clinical trial data to ascertain the response of these patients via multiple classification methods. Multiple clustering and classification methods were compared for prediction accuracy of drug response. Clara and random forest were found to be the best clustering and classification methods, respectively. The results show our models predict with up to 86% accuracy; despite the study’s limitation of sample size. We also found the genes most informative for predicting drug response were enriched in well-known cancer signaling pathways and highlighted their potential significance in chemotherapy prognosis. Conclusions Primary tumor gene expression is a good predictor of cancer drug response. Investment in larger datasets containing both patient gene expression and drug response is needed to support future work of machine learning models. Ultimately, such predictive models may aid oncologists with making critical treatment decisions.

Download Full-text

ECMarker: interpretable machine learning model identifies gene expression biomarkers predicting clinical outcomes and reveals molecular mechanisms of human disease in early stages

Bioinformatics ◽

10.1093/bioinformatics/btaa935 ◽

2020 ◽

Author(s):

Ting Jin ◽

Nam D Nguyen ◽

Flaminia Talos ◽

Daifeng Wang

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Lung Cancer ◽

Cancer Patients ◽

Human Disease ◽

Gene Networks ◽

Regulatory Mechanisms ◽

Disease Development ◽

Lung Cancer Patients ◽

Machine Learning Model

Abstract Motivation Gene expression and regulation, a key molecular mechanism driving human disease development, remains elusive, especially at early stages. Integrating the increasing amount of population-level genomic data and understanding gene regulatory mechanisms in disease development are still challenging. Machine learning has emerged to solve this, but many machine learning methods were typically limited to building an accurate prediction model as a ‘black box’, barely providing biological and clinical interpretability from the box. Results To address these challenges, we developed an interpretable and scalable machine learning model, ECMarker, to predict gene expression biomarkers for disease phenotypes and simultaneously reveal underlying regulatory mechanisms. Particularly, ECMarker is built on the integration of semi- and discriminative-restricted Boltzmann machines, a neural network model for classification allowing lateral connections at the input gene layer. This interpretable model is scalable without needing any prior feature selection and enables directly modeling and prioritizing genes and revealing potential gene networks (from lateral connections) for the phenotypes. With application to the gene expression data of non-small-cell lung cancer patients, we found that ECMarker not only achieved a relatively high accuracy for predicting cancer stages but also identified the biomarker genes and gene networks implying the regulatory mechanisms in the lung cancer development. In addition, ECMarker demonstrates clinical interpretability as its prioritized biomarker genes can predict survival rates of early lung cancer patients (P-value < 0.005). Finally, we identified a number of drugs currently in clinical use for late stages or other cancers with effects on these early lung cancer biomarkers, suggesting potential novel candidates on early cancer medicine. Availabilityand implementation ECMarker is open source as a general-purpose tool at https://github.com/daifengwanglab/ECMarker. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text