scholarly journals Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression

2021 ◽  
Vol 12 ◽  
Author(s):  
Qin Jiang ◽  
Min Jin

Exploring the molecular mechanisms of breast cancer is essential for the early prediction, diagnosis, and treatment of cancer patients. The large scale of data obtained from the high-throughput sequencing technology makes it difficult to identify the driver mutations and a minimal optimal set of genes that are critical to the classification of cancer. In this study, we propose a novel method without any prior information to identify mutated genes associated with breast cancer. For the somatic mutation data, it is processed to a mutated matrix, from which the mutation frequency of each gene can be obtained. By setting a reasonable threshold for the mutation frequency, a mutated gene set is filtered from the mutated matrix. For the gene expression data, it is used to generate the gene expression matrix, while the mutated gene set is mapped onto the matrix to construct a co-expression profile. In the stage of feature selection, we propose a staged feature selection algorithm, using fold change, false discovery rate to select differentially expressed genes, mutual information to remove the irrelevant and redundant features, and the embedded method based on gradient boosting decision tree with Bayesian optimization to obtain an optimal model. In the stage of evaluation, we propose a weighted metric to modify the traditional accuracy to solve the sample imbalance problem. We apply the proposed method to The Cancer Genome Atlas breast cancer data and identify a mutated gene set, among which the implicated genes are oncogenes or tumor suppressors previously reported to be associated with carcinogenesis. As a comparison with the integrative network, we also perform the optimal model on the individual gene expression and the gold standard PMA50. The results show that the integrative network outperforms the gene expression and PMA50 in the average of most metrics, which indicate the effectiveness of our proposed method by integrating multiple data sources, and can discover the associated mutated genes in breast cancer.

2020 ◽  
Author(s):  
Nageswara Rao Eluri

UNSTRUCTURED Gene selection is considered as the fundamental process under the bioinformatics field, as the cancer classification accuracy completely focused on the genes, which provides biological relevance to the classifying problems. The accurate classification of diverse types of tumor is seeking immense demand in the cancer diagnosis task. However, the existing methodologies pertain to cancer classification are mostly clinical basis, and so its diagnosis capability is limited. Nowadays, the significant problems of cancer diagnosis are solved by the utilization of gene expression data, by which, the researchers have been introducing many possibilities to diagnose cancer in an appropriate and effective way. This paper plans to develop the cancer data classification using gene expression data. Initially, five benchmark gene expression datasets, i.e., “Colon cancer, defused B-cell Lymphoma, Leukaemia, Wisconsin Diagnostic Breast Cancer and Wisconsin Breast Cancer Data” are collected for performing the experiment. The proposed classification model involves three main phases: “(a) Feature extraction, (b) Optimal Feature Selection, and (c) Classification”. From the collected gene expression data, the feature extraction is performed using the first order and second-order statistical measures after data pre-processing. In order to diminish the length of the feature vectors, optimal feature selection is performed, in which a new meta-heuristic algorithm termed as Quantum Inspired Immune Clone Optimization Algorithm (QICO) is used. Once the relevant features are selected, the classification is performed by a deep learning model called Recurrent Neural Network (RNN). Moreover, the number of hidden neurons of RNN is optimized by the same Q-ICOA. The optimal feature selection and classification is performed for selecting the most suitable features and thus maximizing the classification accuracy. Finally, the experimental analysis reveals that the proposed model outperforms the QICO-based feature selection over other heuristic-based feature selection and optimized RNN over other machine learning algorithms


2018 ◽  
Vol 21 (2) ◽  
pp. 74-83
Author(s):  
Tzu-Hung Hsiao ◽  
Yu-Chiao Chiu ◽  
Yu-Heng Chen ◽  
Yu-Ching Hsu ◽  
Hung-I Harry Chen ◽  
...  

Aim and Objective: The number of anticancer drugs available currently is limited, and some of them have low treatment response rates. Moreover, developing a new drug for cancer therapy is labor intensive and sometimes cost prohibitive. Therefore, “repositioning” of known cancer treatment compounds can speed up the development time and potentially increase the response rate of cancer therapy. This study proposes a systems biology method for identifying new compound candidates for cancer treatment in two separate procedures. Materials and Methods: First, a “gene set–compound” network was constructed by conducting gene set enrichment analysis on the expression profile of responses to a compound. Second, survival analyses were applied to gene expression profiles derived from four breast cancer patient cohorts to identify gene sets that are associated with cancer survival. A “cancer–functional gene set– compound” network was constructed, and candidate anticancer compounds were identified. Through the use of breast cancer as an example, 162 breast cancer survival-associated gene sets and 172 putative compounds were obtained. Results: We demonstrated how to utilize the clinical relevance of previous studies through gene sets and then connect it to candidate compounds by using gene expression data from the Connectivity Map. Specifically, we chose a gene set derived from a stem cell study to demonstrate its association with breast cancer prognosis and discussed six new compounds that can increase the expression of the gene set after the treatment. Conclusion: Our method can effectively identify compounds with a potential to be “repositioned” for cancer treatment according to their active mechanisms and their association with patients’ survival time.


2021 ◽  
pp. 1063293X2110160
Author(s):  
Dinesh Morkonda Gunasekaran ◽  
Prabha Dhandayudam

Nowadays women are commonly diagnosed with breast cancer. Feature based Selection method plays an important step while constructing a classification based framework. We have proposed Multi filter union (MFU) feature selection method for breast cancer data set. The feature selection process based on random forest algorithm and Logistic regression (LG) algorithm based union model is used for selecting important features in the dataset. The performance of the data analysis is evaluated using optimal features subset from selected dataset. The experiments are computed with data set of Wisconsin diagnostic breast cancer center and next the real data set from women health care center. The result of the proposed approach shows high performance and efficient when comparing with existing feature selection algorithms.


2020 ◽  
Author(s):  
Yang Liu ◽  
Qian Du ◽  
Dan Sun ◽  
Ruiying Han ◽  
Mengmeng Teng ◽  
...  

Abstract Background: SQSTM1 (Sequestosome 1, p62) is degraded by activated autophagy and involved in the progression of in various types of cancers. However, the prognostic role and underlying regulation mechanism of SQSTM1 in the progression and development of breast cancer remain unclear.Methods: In this study, 1336 samples with available mRNA data from Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) database and 27 formalin fixation and paraffin embedding (FFPE) tissue samples from the First Affiliated Hospital of Xi’an Jiaotong University were collected to evaluate SQSTM1 expression in mRNA and protein levels. Kaplan–Meier and Cox regression were used for revealing prognostic value in three independent breast cancer independent datasets. Tumor Immune Estimation Resource (TIMER) database and Gene Set Variation Analysis (GSVA) was used to explore the relationship of SQSTM1 mRNA expression and immune infiltration level in breast cancer. Dysregulation mechanisms of SQSTM1 were also explored including copy number variation (CNV), somatic mutation, epigenetic alterations and other transcription and post-transcription level using multiple datasets. Finally, Gene Set Enrichment Analysis (GSEA) was constructed to elucidate functional regulating performance of SQSTM1 in breast cancer.Results: The results showed that mRNA and protein level of SQSTM1 were significantly elevated in breast cancer and receiver operating characteristic (ROC) curve showed that p62 may act as diagnostic biomarker. Lower expression of SQSTM1 predicted better outcome through multiple datasets. It was also found that SQSTM1 correlated with immune infiltrates in breast cancer. Moreover, CNV and methylation of SQSTM1 DNA was correlated with SQSTM1 dysregulation and act as prognostic factors for breast cancer patients. Yet, somatic mutation status of SQSTM1 didn’t show any prognostic relevance. We also identified diverse transcription factors that directly bound to SQSTM1 DNA and the miRNAs which may regulate SQSTM1 mRNA. Finally, functional enrichment analysis revealed that SQSTM1 is related to cell signal transduction, oxidative stress and autophagy in breast cancer.Conclusion: Our findings revealed that SQSTM1 plays a key role in the progression of breast cancer and might be a promising biomarker for the diagnosis and personalized treatment of breast cancer patients.


2020 ◽  
Author(s):  
Yang Liu ◽  
Qian Du ◽  
Dan Sun ◽  
Ruiying Han ◽  
Mengmeng Teng ◽  
...  

Abstract Background: SQSTM1 (Sequestosome 1, p62) is degraded by activated autophagy and involved in the progression of in various types of cancers. However, the prognostic role and underlying regulation mechanism of SQSTM1 in the progression and development of breast cancer remain unclear.Methods: In this study, 1336 samples with available mRNA data from Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) database and 27 formalin fixation and paraffin embedding (FFPE) tissue samples from the First Affiliated Hospital of Xi’an Jiaotong University were collected to evaluate SQSTM1 expression in mRNA and protein levels. Kaplan–Meier and Cox regression were used for revealing prognostic value in three independent breast cancer independent datasets. Tumor Immune Estimation Resource (TIMER) database and Gene Set Variation Analysis (GSVA) was used to explore the relationship of SQSTM1 mRNA expression and immune infiltration level in breast cancer. Dysregulation mechanisms of SQSTM1 were also explored including copy number variation (CNV), somatic mutation, epigenetic alterations and other transcription and post-transcription level using multiple datasets. Finally, Gene Set Enrichment Analysis (GSEA) was constructed to elucidate functional regulating performance of SQSTM1 in breast cancer.Results: The results showed that mRNA and protein level of SQSTM1 were significantly elevated in breast cancer and receiver operating characteristic (ROC) curve showed that p62 may act as diagnostic biomarker. Lower expression of SQSTM1 predicted better outcome through multiple datasets. It was also found that SQSTM1 correlated with immune infiltrates in breast cancer. Moreover, CNV and methylation of SQSTM1 DNA was correlated with SQSTM1 dysregulation and act as prognostic factors for breast cancer patients. Yet, somatic mutation status of SQSTM1 didn’t show any prognostic relevance. We also identified diverse transcription factors that directly bound to SQSTM1 DNA and the miRNAs which may regulate SQSTM1 mRNA. Finally, functional enrichment analysis revealed that SQSTM1 is related to cell signal transduction, oxidative stress and autophagy in breast cancer.Conclusion: Our findings revealed that overexpression of SQSTM1 significantly to poor survival and immune infiltrations in breast cancer. In addition, SQSTM1 plays a key role in the progression of breast cancer and might be a promising biomarker for the diagnosis and personalized treatment of breast cancer patients.


2019 ◽  
Vol 8 (2S11) ◽  
pp. 2353-2355 ◽  

Human health is most important than anything in the world, one should take care of it. Among various disease, cancer is the most terrible and deadly disease, so it is necessary to predict such disease in early stage. In this paper different feature selection methods used for feature extraction with different feature classification methods to identify the breast cancer. Breast cancer data is taken from UCI repository and is processed using WEKA tool and proposed techniques are applied to classify data accurately. This study well defines that data mining approach is suitable for predicting breast cancer.


Cancers ◽  
2020 ◽  
Vol 12 (6) ◽  
pp. 1559
Author(s):  
Jiande Wu ◽  
Tarun Karthik Kumar Mamidi ◽  
Lu Zhang ◽  
Chindo Hicks

Background: The recent surge of next generation sequencing of breast cancer genomes has enabled development of comprehensive catalogues of somatic mutations and expanded the molecular classification of subtypes of breast cancer. However, somatic mutations and gene expression data have not been leveraged and integrated with epigenomic data to unravel the genomic-epigenomic interaction landscape of triple negative breast cancer (TNBC) and non-triple negative breast cancer (non-TNBC). Methods: We performed integrative data analysis combining somatic mutation, epigenomic and gene expression data from The Cancer Genome Atlas (TCGA) to unravel the possible oncogenic interactions between genomic and epigenomic variation in TNBC and non-TNBC. We hypothesized that within breast cancers, there are differences in somatic mutation, DNA methylation and gene expression signatures between TNBC and non-TNBC. We further hypothesized that genomic and epigenomic alterations affect gene regulatory networks and signaling pathways driving the two types of breast cancer. Results: The investigation revealed somatic mutated, epigenomic and gene expression signatures unique to TNBC and non-TNBC and signatures distinguishing the two types of breast cancer. In addition, the investigation revealed molecular networks and signaling pathways enriched for somatic mutations and epigenomic changes unique to each type of breast cancer. The most significant pathways for TNBC were: retinal biosynthesis, BAG2, LXR/RXR, EIF2 and P2Y purigenic receptor signaling pathways. The most significant pathways for non-TNBC were: UVB-induced MAPK, PCP, Apelin endothelial, Endoplasmatic reticulum stress and mechanisms of viral exit from host signaling Pathways. Conclusion: The investigation revealed integrated genomic, epigenomic and gene expression signatures and signing pathways unique to TNBC and non-TNBC, and a gene signature distinguishing the two types of breast cancer. The study demonstrates that integrative analysis of multi-omics data is a powerful approach for unravelling the genomic-epigenomic interaction landscape in TNBC and non-TNBC.


2019 ◽  
Vol 12 (4) ◽  
pp. 317-328 ◽  
Author(s):  
Rajalakshmi Krishnamurthi ◽  
Niyati Aggrawal ◽  
Lokendra Sharma ◽  
Diva Srivastava ◽  
Shivangi Sharma

Background: Breast cancer is one of the most common forms of cancers among women and the leading cause of death among them. Countries like United States, England and Canada have reported a high number of breast cancer patients every year and this number is continuously increasing due to detection at later stages. Hence, it is very important to create awareness among women and develop such algorithms which help to detect malignant cancer. Several research studies have been conducted to analyze the breast cancer data. Objective: This paper presents an effective method in predicting breast cancer and its stage and will also analyze the performance of different supervised learning algorithms such as Random Classifier, Chi2 Square test used in order to predict. The paper focuses on the three important aspects such as the feature selection, the corresponding data visualisation and finally making a prediction call on different machine learning models. Methods: The dataset used for this work is breast cancer Wisconsin data taken from UCI library. The dataset has been used to show the different 32 features which are all important and how it can be achieved using data visualisation. Secondly, after the feature selection, different machine learning models have been applied. Conclusion: The machine learning models involved are namely Support Vector Machine (SVM), KNearest Neighbour (KNN), Random Forest, Principal Component Analysis (PCA), Neural Network using Perceptron (NNP). This has been done to check which type of model is better under what conditions. At different stages several charts have been plotted and eliminated based on relative comparison. Results have shown that Random Tree classifier along with Chi2 Square proves to be an efficient one.


Sign in / Sign up

Export Citation Format

Share Document