scholarly journals Prediction of Breast Cancer Metastasis by Gene Expression Profiles: A Comparison of Metagenes and Single Genes

2012 ◽  
Vol 11 ◽  
pp. CIN.S10375 ◽  
Author(s):  
Mark Burton ◽  
Mads Thomassen ◽  
Qihua Tan ◽  
Torben A. Kruse

Background The popularity of a large number of microarray applications has in cancer research led to the development of predictive or prognostic gene expression profiles. However, the diversity of microarray platforms has made the full validation of such profiles and their related gene lists across studies difficult and, at the level of classification accuracies, rarely validated in multiple independent datasets. Frequently, while the individual genes between such lists may not match, genes with same function are included across such gene lists. Development of such lists does not take into account the fact that genes can be grouped together as metagenes (MGs) based on common characteristics such as pathways, regulation, or genomic location. Such MGs might be used as features in building a predictive model applicable for classifying independent data. It is, therefore, demanding to systematically compare independent validation of gene lists or classifiers based on metagene or individual gene (SG) features. Methods In this study we compared the performance of either metagene- or single gene-based feature sets and classifiers using random forest and two support vector machines for classifier building. The performance within the same dataset, feature set validation performance, and validation performance of entire classifiers in strictly independent datasets were assessed by 10 times repeated 10-fold cross validation, leave-one-out cross validation, and one-fold validation, respectively. To test the significance of the performance difference between MG- and SG-features/classifiers, we used a repeated down-sampled binomial test approach. Results MG- and SG-feature sets are transferable and perform well for training and testing prediction of metastasis outcome in strictly independent data sets, both between different and within similar microarray platforms, while classifiers had a poorer performance when validated in strictly independent datasets. The study showed that MG- and SG-feature sets perform equally well in classifying independent data. Furthermore, SG-classifiers significantly outperformed MG-classifier when validation is conducted between datasets using similar platforms, while no significant performance difference was found when validation was performed between different platforms. Conclusion Prediction of metastasis outcome in lymph node–negative patients by MG- and SG-classifiers showed that SG-classifiers performed significantly better than MG-classifiers when validated in independent data based on the same microarray platform as used for developing the classifier. However, the MG- and SG-classifiers had similar performance when conducting classifier validation in independent data based on a different microarray platform. The latter was also true when only validating sets of MG- and SG-features in independent datasets, both between and within similar and different platforms.

2004 ◽  
Vol 3 (1) ◽  
pp. 1-19 ◽  
Author(s):  
Minhui Paik ◽  
Yuhong Yang

Various discriminant methods have been applied for classification of tumors based on gene expression profiles, among which the nearest neighbor (NN) method has been reported to perform relatively well. Usually cross-validation (CV) is used to select the neighbor size as well as the number of variables for the NN method. However, CV can perform poorly when there is considerable uncertainty in choosing the best candidate classifier. As an alternative to selecting a single “winner," we propose a weighting method to combine the multiple NN rules. Four gene expression data sets are used to compare its performance with CV methods. The results show that when the CV selection is unstable, the combined classifier performs much better.


Author(s):  
Bong-Hyun Kim ◽  
Kijin Yu ◽  
Peter C W Lee

Abstract Motivation Cancer classification based on gene expression profiles has provided insight on the causes of cancer and cancer treatment. Recently, machine learning-based approaches have been attempted in downstream cancer analysis to address the large differences in gene expression values, as determined by single-cell RNA sequencing (scRNA-seq). Results We designed cancer classifiers that can identify 21 types of cancers and normal tissues based on bulk RNA-seq as well as scRNA-seq data. Training was performed with 7398 cancer samples and 640 normal samples from 21 tumors and normal tissues in TCGA based on the 300 most significant genes expressed in each cancer. Then, we compared neural network (NN), support vector machine (SVM), k-nearest neighbors (kNN) and random forest (RF) methods. The NN performed consistently better than other methods. We further applied our approach to scRNA-seq transformed by kNN smoothing and found that our model successfully classified cancer types and normal samples. Availability and implementation Cancer classification by neural network. Supplementary information Supplementary data are available at Bioinformatics online.


2011 ◽  
Vol 10 ◽  
pp. CIN.S7789 ◽  
Author(s):  
Hiroshi Matsumoto ◽  
Yoshikuni Yakabe ◽  
Fumiyo Saito ◽  
Koichi Saito ◽  
Kayo Sumida ◽  
...  

We have previously shown the hepatic gene expression profiles of carcinogens in 28-day toxicity tests were clustered into three major groups (Group-1 to 3). Here, we developed a new prediction method for Group-1 carcinogens which consist mainly of genotoxic rat hepatocarcinogens. The prediction formula was generated by a support vector machine using 5 selected genes as the predictive genes and predictive score was introduced to judge carcinogenicity. It correctly predicted the carcinogenicity of all 17 Group-1 chemicals and 22 of 24 non-carcinogens regardless of genotoxicity. In the dose-response study, the prediction score was altered from negative to positive as the dose increased, indicating that the characteristic gene expression profile emerged over a range of carcinogen-specific doses. We conclude that the prediction formula can quantitatively predict the carcinogenicity of Group-1 carcinogens. The same method may be applied to other groups of carcinogens to build a total system for prediction of carcinogenicity.


2021 ◽  
Vol 12 ◽  
Author(s):  
Dongfang Jia ◽  
Cheng Chen ◽  
Chen Chen ◽  
Fangfang Chen ◽  
Ningrui Zhang ◽  
...  

Mastering the molecular mechanism of breast cancer (BC) can provide an in-depth understanding of BC pathology. This study explored existing technologies for diagnosing BC, such as mammography, ultrasound, magnetic resonance imaging (MRI), computed tomography (CT), and positron emission tomography (PET) and summarized the disadvantages of the existing cancer diagnosis. The purpose of this article is to use gene expression profiles of The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) to classify BC samples and normal samples. The method proposed in this article triumphs over some of the shortcomings of traditional diagnostic methods and can conduct BC diagnosis more rapidly with high sensitivity and have no radiation. This study first selected the genes most relevant to cancer through weighted gene co-expression network analysis (WGCNA) and differential expression analysis (DEA). Then it used the protein–protein interaction (PPI) network to screen 23 hub genes. Finally, it used the support vector machine (SVM), decision tree (DT), Bayesian network (BN), artificial neural network (ANN), convolutional neural network CNN-LeNet and CNN-AlexNet to process the expression levels of 23 hub genes. For gene expression profiles, the ANN model has the best performance in the classification of cancer samples. The ten-time average accuracy is 97.36% (±0.34%), the F1 value is 0.8535 (±0.0260), the sensitivity is 98.32% (±0.32%), the specificity is 89.59% (±3.53%) and the AUC is 0.99. In summary, this method effectively classifies cancer samples and normal samples and provides reasonable new ideas for the early diagnosis of cancer in the future.


Blood ◽  
2004 ◽  
Vol 104 (11) ◽  
pp. 993-993
Author(s):  
Wolfgang Kern ◽  
Alexander Kohlmann ◽  
Claudia Schoch ◽  
Martin Dugas ◽  
Sylvia Merk ◽  
...  

Abstract Diagnosis and classification of acute lymphoblastic leukemias (ALL) and their distinction from biphenotypic acute leukemias (BAL) and acute myeloid leukemias with minimal differentiation (AML M0) is largely based on immunophenotyping. The EGIL classification, adopted by the WHO classification, defines 4 different subtypes of both B-precursor and T-precursor ALL as well as detailed criteria for BAL. Specific cytogenetic features useful for classificationare found in some cases only. We analyzed gene expression profiles in 173 such patients (Pro-B-ALL n=25, c-ALL/Pre-B-ALL n=65 (with t(9;22) n=35, without t(9;22) n=30), mature B-ALL n=13, Pro-T-ALL n=6, Pre-T-ALL n=13, cortical T-ALL n=20, BAL (myeloid and T-lineage) n=17, AML M0 n=14). All cases were assessed by cytomorphology, immunophenotyping, cytogenetics, and molecular genetics. All cases with Pro-B-ALL had t(4;11)/MLL-AF4, all cases with mature B-ALL had t(8;14). Samples were hybridized to both U133A and U133B microarrays (Affymetrix). Top 300 differentially expressed genes were identified for each group in comparison to all other groups and individual other groups and used for classification by various Support Vector Machines (SVM) with 10-fold cross validation (CV). Prediction accuracy for discriminating T- from B-precursor ALL was 100%. Accordingly, principal component analysis (PCA) yielded a complete separation of both groups. PCA of B-precursor ALL cases showed distinct clusters for Pro-B-ALL, c-ALL/Pre-B-ALL, and mature B-ALL, however, c-ALL/Pre-B-ALL with t(9;22) were not completely discriminated from those without. Accordingly, classifying B-precursor ALL with SVM resulted in a 87.4% accuracy. Pre-T-ALL cases clustered distinct from cortical T-ALL with hte exception of two cases. The other Pre-T-ALLs clustered together with Pro-T-ALL. Analyzing T-precusor ALL with SVM and 10-fold CV resulted in an accuracy of only 56.4%. Including BAL and AML M0 into these analyses revealed significant overlaps between samples from these entities and T-ALL cases in PCA; prediction accuracy using SVM and 10-fold CV was 79.8%. This accuracy was confirmed applying 100 runs of SVM with 2/3 of samples being randomly selected as training set and 1/3 as test set which resulted in a median accuracy of 77.2% (range, 67.5% to 85.1%). A 100% prediction accuracy was achieved in Pro-B-ALL and mature B-ALL. Misclassifications were: c-ALL/Pre-B-ALL with t(9;22) as c-ALL/Pre-B-ALL without t(9;22) (6/35) and vice versa (6/30). Of the 13 Pre-T-ALL cases 4 were classified as BAL and 3 as cortical T-ALL. Of the 6 Pro-T-ALL cases 2 were classified as AML M0, 3 as BAL, and 1 as Pre-T-ALL. Of the 17 BAL cases 2 were classified as AML M0, 1 as c-ALL/Pre-B-ALL, 2 as Pre-T-ALL, and 1 as Pro-T-ALL. These analyses confirm that gene expression profiles allow the identification of Pro-B-ALL with t(4;11) and mature B-ALL with t(8;14) but do not unequivocally identify the presence of t(9;22) in c-ALL/Pre-B-ALL. Cortical T-ALL are characterized by a specific gene expression profile which is, however, shared by few cases currently diagnosed as Pre-T-ALL. Thus, diagnostic criteria (surface expression of CD1a only) should be optimized. The same applies to diagnostic criteria for more immature T-ALL, BAL, and AML M0. Loss of 5q is frequently observed in all of these latter entities and may be a future diagnostic marker superseding flow cytometry.


Blood ◽  
2007 ◽  
Vol 110 (11) ◽  
pp. 2606-2606
Author(s):  
N.A. Johnson ◽  
T. Nayar ◽  
S.S. Dave ◽  
G. Wright ◽  
A. Rosenwald ◽  
...  

Abstract Background: FL is a common NHL that has a broad spectrum of clinical outcomes. Over time some pts will transform to an aggressive histology (Tly) associated with inferior survival. In 2004, the LLMPP constructed a model that was predictive of overall survival (OS) based on the gene expression profiles (GEP) of 191 specimens taken from pts with untreated FL. The genes associated with survival were derived from the non-neoplastic immune response (IR) cells. However the risk of developing Tly was not addressed in this study. Thus we re-analyzed the GEP with updated clinical data. Our goal was to validate our previous model with extended follow-up and to create a model that would predict the risk of developing TLy. Methods: 170 of 191 previously untreated FL pts had updated clinical information but only 142 had transformation outcome. Transformation was defined as biopsy proven DLBCL or clinically based on the presence of at least one of the following: hypercalcemia, a sudden rise in LDH >twice baseline, unusual extranodal growth or rapid discordant nodal growth. Raw CEL files from Affymetrix U133A arrays were pre-processed and normalized using Bioconductor’s GCRMA package. Models were developed using SignS package (http://signs/bioinfo.cnio.es/), with 10 times cross-validation. All gene lists produced in these analyses were then re-tested for association with outcome using Bioconductor’s Globaltest package. Over Representation Analysis of signature components was performed using Dchip. Results: The median OS of these patients was 8 yrs. A new 7-component survival model (85 genes) was developed that was significantly associated with survival (p= 2.9×10−13). In Globaltest, these gene lists were associated with survival at a level of (p=2.6×10−5). The previous model using IR-1 and IR-2 signatures was associated with survival at a level of p=2.6×10−4. Although there is little overlap between the 2 models, the new model confirms the importance of IR genes and extracellular matrix genes as being prognostically important. Interestingly, one component containing 10 genes on chromosome 6q was associated with a superior survival (p<1×107). 27% developed Tly over a median follow-up time of 11.2 yrs (69% biopsy proven). Our transformation model included 53 genes divided into 3 components (p=0.001). The Globaltest analysis for association of these genes with transformation was significant (p=0.018). 54 genes overlapped between the survival genes and transformation genes that were present in >1 cross validation run. These were significantly enriched in genes important in immune response like T cell and macrophage activation. Conclusion: Our survival model is stable and confirms the importance of key genes involved in the immune response and lymph node remodeling. It also introduces new genes that are potentially important for survival. Our transformation model may shed light on the mechanisms involved in the progression of FL to DLBCL but it is less stable and less reliable than our survival model at predicting outcome.


2013 ◽  
Vol 850-851 ◽  
pp. 1238-1242
Author(s):  
Tao Chen

Gene expression profiles of tumor have the limited amount of samples in comparison to the high dimensionality of the samples;this paper proposed a classification algorithm based on neighborhood rough set to improve classification accuracy.This paper first applied feature filtering method of kruskal-wallis rank sum test to select a set of top-ranked related genes, and then applied neighborhood rough set on these genes to generate a informative genes subset. Finally, SVM was used to classify the GEP data set. The result of the experiment indicates that this method can effectively improve classification accuracy, and it has higher generalization.


10.29007/3nzw ◽  
2019 ◽  
Author(s):  
Wageesha Rasanjana ◽  
Sandun Rajapaksa ◽  
Indika Perera ◽  
Dulani Meedeniya

Prostate cancer is widely known to be one of the most common cancers among men around the world. Due to its high heterogeneity, many of the studies carried out to identify the molecular level causes for cancer have only been partially successful. Among the techniques used in cancer studies, gene expression profiling is seen to be one of the most popular techniques due to its high usage. Gene expression profiles reveal information about the functionality of genes in different body tissues at different conditions. In order to identify cancer-decisive genes, differential gene expression analysis is carried out using statistical and machine learning methodologies. It helps to extract information about genes that have significant expression differences between healthy tissues and cancerous tissues. In this paper, we discuss a comprehensive supervised classification approach using Support Vector Machine (SVM) models to investigate differentially expressed Y-chromosome genes in prostate cancer. 8 SVM models, which are tuned to have 98.3% average accuracy have been used for the analysis. We were able to capture genes like CD99 (MIC2), ASMTL, DDX3Y and TXLNGY to come out as the best candidates. Some of our results support existing findings while introducing novel findings to be possible prostate cancer candidates.


Sign in / Sign up

Export Citation Format

Share Document