scholarly journals Classification models for Invasive Ductal Carcinoma Progression, based on gene expression data-trained supervised machine learning

2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Shikha Roy ◽  
Rakesh Kumar ◽  
Vaibhav Mittal ◽  
Dinesh Gupta
2019 ◽  
Author(s):  
Shikha Roy ◽  
Rakesh Kumar ◽  
Vaibhav Mittal ◽  
Dinesh Gupta

AbstractEarly detection of breast cancer and its correct stage determination are important for prognosis and rendering appropriate personalized clinical treatment to breast cancer patients. However, despite considerable efforts and progress, there is a need to identify the specific genomic factors responsible for, or accompanying Invasive Ductal Carcinoma (IDC) progression stages, which can aid the determination of the correct cancer stages. We have developed two-class machine-learning classification models to differentiate the early and late stages of invasive ductal carcinoma. The prediction models are trained with RNA-seq gene expression profiles representing different IDC stages of 610 patients, obtained from The Cancer Genome Atlas (TCGA). Different supervised learning algorithms were trained and evaluated with an enriched model learning, facilitated by different feature selection methods. We also developed a machine-learning classifier trained on the same datasets with training sets reduced data corresponding to IDC driver genes. Based on these two classifiers, we have developed a web-server Duct-BRCA-CSP to predict early stage from late stages of IDC based on input RNA-seq gene expression profiles. The analysis conducted by us also enables deeper insights into the stage-dependent molecular events accompanying breast ductal carcinoma progression. The server is publicly available at http://bioinfo.icgeb.res.in/duct-BRCA-CSP.


2020 ◽  
Vol 6 (6) ◽  
Author(s):  
Ali Farzane ◽  
Maryam Akbarzadeh ◽  
Reza Ferdousi ◽  
Mohammadreza Rashidi ◽  
Reza Safdari

Objectives: In this study, we aimed to identify putative biomarkers for identification and characterization of these cells in liver cancer. Methods: We employed a supervised machine learning method, XGBoost, to data from 13 GEO data series to classify samples using gene expression data. Results.  Across the 376 samples (129 CSCs and 247 non-CSCs cases), XGBoost displayed high performance in the classification of data. XGBoost feature importance scores and SHAP (Shapley Additive explanation) values were used for the interpretation of results and analysis of individual gene importance. We confirmed that expression levels of a 10-gene set (PTGER3, AURKB, C15orf40, IDI2, OR8D1, NACA2, SERPINB6, L1CAM, SMC1A, and RASGRF1) were predictive. The results showed that these 10 genes can detect CSCs robustly with accuracy, sensitivity, and specificity of 97 %, 100 %, and 95 %, respectively. Conclusions. We suggest that the ten-gene set may be used as a biomarker set for detecting and characterizing CSCs using gene expression data.


Cell Cycle ◽  
2018 ◽  
Vol 17 (4) ◽  
pp. 486-491 ◽  
Author(s):  
Nicolas Borisov ◽  
Victor Tkachev ◽  
Maria Suntsova ◽  
Olga Kovalchuk ◽  
Alex Zhavoronkov ◽  
...  

BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Yuanyuan Li ◽  
David M. Umbach ◽  
Adrienna Bingham ◽  
Qi-Jing Li ◽  
Yuan Zhuang ◽  
...  

Abstract Background Tumor purity is the percent of cancer cells present in a sample of tumor tissue. The non-cancerous cells (immune cells, fibroblasts, etc.) have an important role in tumor biology. The ability to determine tumor purity is important to understand the roles of cancerous and non-cancerous cells in a tumor. Methods We applied a supervised machine learning method, XGBoost, to data from 33 TCGA tumor types to predict tumor purity using RNA-seq gene expression data. Results Across the 33 tumor types, the median correlation between observed and predicted tumor-purity ranged from 0.75 to 0.87 with small root mean square errors, suggesting that tumor purity can be accurately predicted υσινγ expression data. We further confirmed that expression levels of a ten-gene set (CSF2RB, RHOH, C1S, CCDC69, CCL22, CYTIP, POU2AF1, FGR, CCL21, and IL7R) were predictive of tumor purity regardless of tumor type. We tested whether our set of ten genes could accurately predict tumor purity of a TCGA-independent data set. We showed that expression levels from our set of ten genes were highly correlated (ρ = 0.88) with the actual observed tumor purity. Conclusions Our analyses suggested that the ten-gene set may serve as a biomarker for tumor purity prediction using gene expression data.


2020 ◽  
Vol 21 (S14) ◽  
Author(s):  
Evan A. Clayton ◽  
Toyya A. Pujol ◽  
John F. McDonald ◽  
Peng Qiu

Abstract Background Machine learning has been utilized to predict cancer drug response from multi-omics data generated from sensitivities of cancer cell lines to different therapeutic compounds. Here, we build machine learning models using gene expression data from patients’ primary tumor tissues to predict whether a patient will respond positively or negatively to two chemotherapeutics: 5-Fluorouracil and Gemcitabine. Results We focused on 5-Fluorouracil and Gemcitabine because based on our exclusion criteria, they provide the largest numbers of patients within TCGA. Normalized gene expression data were clustered and used as the input features for the study. We used matching clinical trial data to ascertain the response of these patients via multiple classification methods. Multiple clustering and classification methods were compared for prediction accuracy of drug response. Clara and random forest were found to be the best clustering and classification methods, respectively. The results show our models predict with up to 86% accuracy; despite the study’s limitation of sample size. We also found the genes most informative for predicting drug response were enriched in well-known cancer signaling pathways and highlighted their potential significance in chemotherapy prognosis. Conclusions Primary tumor gene expression is a good predictor of cancer drug response. Investment in larger datasets containing both patient gene expression and drug response is needed to support future work of machine learning models. Ultimately, such predictive models may aid oncologists with making critical treatment decisions.


2019 ◽  
Vol 15 (2) ◽  
pp. e1006826 ◽  
Author(s):  
David G. P. van IJzendoorn ◽  
Karoly Szuhai ◽  
Inge H. Briaire-de Bruijn ◽  
Marie Kostine ◽  
Marieke L. Kuijjer ◽  
...  

2019 ◽  
Vol 3 (s1) ◽  
pp. 2-2
Author(s):  
Megan C Hollister ◽  
Jeffrey D. Blume

OBJECTIVES/SPECIFIC AIMS: To examine and compare the claims in Bzdok, Altman, and Brzywinski under a broader set of conditions by using unbiased methods of comparison. To explore how to accurately use various machine learning and traditional statistical methods in large-scale translational research by estimating their accuracy statistics. Then we will identify the methods with the best performance characteristics. METHODS/STUDY POPULATION: We conducted a simulation study with a microarray of gene expression data. We maintained the original structure proposed by Bzdok, Altman, and Brzywinski. The structure for gene expression data includes a total of 40 genes from 20 people, in which 10 people are phenotype positive and 10 are phenotype negative. In order to find a statistical difference 25% of the genes were set to be dysregulated across phenotype. This dysregulation forced the positive and negative phenotypes to have different mean population expressions. Additional variance was included to simulate genetic variation across the population. We also allowed for within person correlation across genes, which was not done in the original simulations. The following methods were used to determine the number of dysregulated genes in simulated data set: unadjusted p-values, Benjamini-Hochberg adjusted p-values, Bonferroni adjusted p-values, random forest importance levels, neural net prediction weights, and second-generation p-values. RESULTS/ANTICIPATED RESULTS: Results vary depending on whether a pre-specified significance level is used or the top 10 ranked values are taken. When all methods are given the same prior information of 10 dysregulated genes, the Benjamini-Hochberg adjusted p-values and the second-generation p-values generally outperform all other methods. We were not able to reproduce or validate the finding that random forest importance levels via a machine learning algorithm outperform classical methods. Almost uniformly, the machine learning methods did not yield improved accuracy statistics and they depend heavily on the a priori chosen number of dysregulated genes. DISCUSSION/SIGNIFICANCE OF IMPACT: In this context, machine learning methods do not outperform standard methods. Because of this and their additional complexity, machine learning approaches would not be preferable. Of all the approaches the second-generation p-value appears to offer significant benefit for the cost of a priori defining a region of trivially null effect sizes. The choice of an analysis method for large-scale translational data is critical to the success of any statistical investigation, and our simulations clearly highlight the various tradeoffs among the available methods.


Sign in / Sign up

Export Citation Format

Share Document