Distinguishing three subtypes of hematopoietic cells based on gene expression profiles using a support vector machine

Abstract Motivation Cancer classification based on gene expression profiles has provided insight on the causes of cancer and cancer treatment. Recently, machine learning-based approaches have been attempted in downstream cancer analysis to address the large differences in gene expression values, as determined by single-cell RNA sequencing (scRNA-seq). Results We designed cancer classifiers that can identify 21 types of cancers and normal tissues based on bulk RNA-seq as well as scRNA-seq data. Training was performed with 7398 cancer samples and 640 normal samples from 21 tumors and normal tissues in TCGA based on the 300 most significant genes expressed in each cancer. Then, we compared neural network (NN), support vector machine (SVM), k-nearest neighbors (kNN) and random forest (RF) methods. The NN performed consistently better than other methods. We further applied our approach to scRNA-seq transformed by kNN smoothing and found that our model successfully classified cancer types and normal samples. Availability and implementation Cancer classification by neural network. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

New Short Term Prediction Method for Chemical Carcinogenicity by Hepatic Transcript Profiling following 28-Day Toxicity Tests in Rats

Cancer Informatics ◽

10.4137/cin.s7789 ◽

2011 ◽

Vol 10 ◽

pp. CIN.S7789 ◽

Cited By ~ 4

Author(s):

Hiroshi Matsumoto ◽

Yoshikuni Yakabe ◽

Fumiyo Saito ◽

Koichi Saito ◽

Kayo Sumida ◽

...

Keyword(s):

Gene Expression ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Prediction Method ◽

Predictive Score ◽

Toxicity Tests ◽

Support Vector ◽

Total System ◽

Prediction Formula ◽

Group 1

We have previously shown the hepatic gene expression profiles of carcinogens in 28-day toxicity tests were clustered into three major groups (Group-1 to 3). Here, we developed a new prediction method for Group-1 carcinogens which consist mainly of genotoxic rat hepatocarcinogens. The prediction formula was generated by a support vector machine using 5 selected genes as the predictive genes and predictive score was introduced to judge carcinogenicity. It correctly predicted the carcinogenicity of all 17 Group-1 chemicals and 22 of 24 non-carcinogens regardless of genotoxicity. In the dose-response study, the prediction score was altered from negative to positive as the dose increased, indicating that the characteristic gene expression profile emerged over a range of carcinogen-specific doses. We conclude that the prediction formula can quantitatively predict the carcinogenicity of Group-1 carcinogens. The same method may be applied to other groups of carcinogens to build a total system for prediction of carcinogenicity.

Download Full-text

Prediction of Breast Cancer Metastasis by Gene Expression Profiles: A Comparison of Metagenes and Single Genes

Cancer Informatics ◽

10.4137/cin.s10375 ◽

2012 ◽

Vol 11 ◽

pp. CIN.S10375 ◽

Cited By ~ 3

Author(s):

Mark Burton ◽

Mads Thomassen ◽

Qihua Tan ◽

Torben A. Kruse

Keyword(s):

Gene Expression ◽

Cross Validation ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Microarray Platform ◽

Support Vector ◽

Independent Data ◽

Performance Difference ◽

Feature Sets ◽

Prediction Of Metastasis

Background The popularity of a large number of microarray applications has in cancer research led to the development of predictive or prognostic gene expression profiles. However, the diversity of microarray platforms has made the full validation of such profiles and their related gene lists across studies difficult and, at the level of classification accuracies, rarely validated in multiple independent datasets. Frequently, while the individual genes between such lists may not match, genes with same function are included across such gene lists. Development of such lists does not take into account the fact that genes can be grouped together as metagenes (MGs) based on common characteristics such as pathways, regulation, or genomic location. Such MGs might be used as features in building a predictive model applicable for classifying independent data. It is, therefore, demanding to systematically compare independent validation of gene lists or classifiers based on metagene or individual gene (SG) features. Methods In this study we compared the performance of either metagene- or single gene-based feature sets and classifiers using random forest and two support vector machines for classifier building. The performance within the same dataset, feature set validation performance, and validation performance of entire classifiers in strictly independent datasets were assessed by 10 times repeated 10-fold cross validation, leave-one-out cross validation, and one-fold validation, respectively. To test the significance of the performance difference between MG- and SG-features/classifiers, we used a repeated down-sampled binomial test approach. Results MG- and SG-feature sets are transferable and perform well for training and testing prediction of metastasis outcome in strictly independent data sets, both between different and within similar microarray platforms, while classifiers had a poorer performance when validated in strictly independent datasets. The study showed that MG- and SG-feature sets perform equally well in classifying independent data. Furthermore, SG-classifiers significantly outperformed MG-classifier when validation is conducted between datasets using similar platforms, while no significant performance difference was found when validation was performed between different platforms. Conclusion Prediction of metastasis outcome in lymph node–negative patients by MG- and SG-classifiers showed that SG-classifiers performed significantly better than MG-classifiers when validated in independent data based on the same microarray platform as used for developing the classifier. However, the MG- and SG-classifiers had similar performance when conducting classifier validation in independent data based on a different microarray platform. The latter was also true when only validating sets of MG- and SG-features in independent datasets, both between and within similar and different platforms.

Download Full-text

CLASSIFICATION OF MULTIPLE CANCER TYPES USING FUZZY SUPPORT VECTOR MACHINES AND OUTLIER DETECTION METHODS

Biomedical Engineering Applications Basis and Communications ◽

10.4015/s1016237205000457 ◽

2005 ◽

Vol 17 (06) ◽

pp. 300-308 ◽

Cited By ~ 3

Author(s):

LI-YEH CHUANG ◽

CHENG-HONG YANG ◽

LI-CHENG JIN

Keyword(s):

Support Vector Machine ◽

Outlier Detection ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Detection Methods ◽

Support Vector ◽

Fuzzy Support Vector Machine ◽

Multiple Cancer ◽

Flexible Architecture ◽

Cancer Types

The support vector machine (SVM) is a new learning method and has shown comparable or better results than the neural networks on some applications. In this paper, we applied SVM to classify multiple cancer types by gene expression profiles and exploit some strategies of the SVM method, including fuzzy logic and statistical theories. Using the proposed strategies and outlier detection methods, the FSVM (fuzzy support vector machine) can achieve a comparable or better performance than other methods, and provide a more flexible architecture to discriminate against SRBCT and non-SRBCT samples.

Download Full-text

Breast Cancer Case Identification Based on Deep Learning and Bioinformatics Analysis

Frontiers in Genetics ◽

10.3389/fgene.2021.628136 ◽

2021 ◽

Vol 12 ◽

Author(s):

Dongfang Jia ◽

Cheng Chen ◽

Chen Chen ◽

Fangfang Chen ◽

Ningrui Zhang ◽

...

Keyword(s):

Breast Cancer ◽

Neural Network ◽

Gene Expression ◽

Expression Profiles ◽

Differential Expression Analysis ◽

Gene Expression Profiles ◽

Diagnostic Methods ◽

The Cancer Genome Atlas ◽

Support Vector ◽

Hub Genes

Mastering the molecular mechanism of breast cancer (BC) can provide an in-depth understanding of BC pathology. This study explored existing technologies for diagnosing BC, such as mammography, ultrasound, magnetic resonance imaging (MRI), computed tomography (CT), and positron emission tomography (PET) and summarized the disadvantages of the existing cancer diagnosis. The purpose of this article is to use gene expression profiles of The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) to classify BC samples and normal samples. The method proposed in this article triumphs over some of the shortcomings of traditional diagnostic methods and can conduct BC diagnosis more rapidly with high sensitivity and have no radiation. This study first selected the genes most relevant to cancer through weighted gene co-expression network analysis (WGCNA) and differential expression analysis (DEA). Then it used the protein–protein interaction (PPI) network to screen 23 hub genes. Finally, it used the support vector machine (SVM), decision tree (DT), Bayesian network (BN), artificial neural network (ANN), convolutional neural network CNN-LeNet and CNN-AlexNet to process the expression levels of 23 hub genes. For gene expression profiles, the ANN model has the best performance in the classification of cancer samples. The ten-time average accuracy is 97.36% (±0.34%), the F1 value is 0.8535 (±0.0260), the sensitivity is 98.32% (±0.32%), the specificity is 89.59% (±3.53%) and the AUC is 0.99. In summary, this method effectively classifies cancer samples and normal samples and provides reasonable new ideas for the early diagnosis of cancer in the future.

Download Full-text

Ensemble of Support Vector Machines to Improve the Cancer Class Prediction Based on the Gene Expression Profiles

Advances in Soft Computing - Innovations in Hybrid Intelligent Systems ◽

10.1007/978-3-540-74972-1_51 ◽

2007 ◽

pp. 393-400 ◽

Cited By ~ 4

Author(s):

Ángela Blanco ◽

Manuel Martín-Merino ◽

Javier De Las Rivas

Keyword(s):

Gene Expression ◽

Support Vector Machines ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Support Vector ◽

Class Prediction ◽

Vector Machines

Download Full-text

Gene Expression Profiling in Adult Acute Lymphoblastic Leukemia, Biphenotypic Acute Leukemia, and Acute Myeloid Leukemia M0: Confirmation of Immunophenotypic and Cytogenetic Diagnostic Findings.

Blood ◽

10.1182/blood.v104.11.993.993 ◽

2004 ◽

Vol 104 (11) ◽

pp. 993-993

Author(s):

Wolfgang Kern ◽

Alexander Kohlmann ◽

Claudia Schoch ◽

Martin Dugas ◽

Sylvia Merk ◽

...

Keyword(s):

Gene Expression ◽

Diagnostic Criteria ◽

Prediction Accuracy ◽

Expression Profiles ◽

Lymphoblastic Leukemia ◽

Gene Expression Profiles ◽

Support Vector ◽

Specific Gene ◽

Acute Leukemias ◽

Acute Myeloid

Abstract Diagnosis and classification of acute lymphoblastic leukemias (ALL) and their distinction from biphenotypic acute leukemias (BAL) and acute myeloid leukemias with minimal differentiation (AML M0) is largely based on immunophenotyping. The EGIL classification, adopted by the WHO classification, defines 4 different subtypes of both B-precursor and T-precursor ALL as well as detailed criteria for BAL. Specific cytogenetic features useful for classificationare found in some cases only. We analyzed gene expression profiles in 173 such patients (Pro-B-ALL n=25, c-ALL/Pre-B-ALL n=65 (with t(9;22) n=35, without t(9;22) n=30), mature B-ALL n=13, Pro-T-ALL n=6, Pre-T-ALL n=13, cortical T-ALL n=20, BAL (myeloid and T-lineage) n=17, AML M0 n=14). All cases were assessed by cytomorphology, immunophenotyping, cytogenetics, and molecular genetics. All cases with Pro-B-ALL had t(4;11)/MLL-AF4, all cases with mature B-ALL had t(8;14). Samples were hybridized to both U133A and U133B microarrays (Affymetrix). Top 300 differentially expressed genes were identified for each group in comparison to all other groups and individual other groups and used for classification by various Support Vector Machines (SVM) with 10-fold cross validation (CV). Prediction accuracy for discriminating T- from B-precursor ALL was 100%. Accordingly, principal component analysis (PCA) yielded a complete separation of both groups. PCA of B-precursor ALL cases showed distinct clusters for Pro-B-ALL, c-ALL/Pre-B-ALL, and mature B-ALL, however, c-ALL/Pre-B-ALL with t(9;22) were not completely discriminated from those without. Accordingly, classifying B-precursor ALL with SVM resulted in a 87.4% accuracy. Pre-T-ALL cases clustered distinct from cortical T-ALL with hte exception of two cases. The other Pre-T-ALLs clustered together with Pro-T-ALL. Analyzing T-precusor ALL with SVM and 10-fold CV resulted in an accuracy of only 56.4%. Including BAL and AML M0 into these analyses revealed significant overlaps between samples from these entities and T-ALL cases in PCA; prediction accuracy using SVM and 10-fold CV was 79.8%. This accuracy was confirmed applying 100 runs of SVM with 2/3 of samples being randomly selected as training set and 1/3 as test set which resulted in a median accuracy of 77.2% (range, 67.5% to 85.1%). A 100% prediction accuracy was achieved in Pro-B-ALL and mature B-ALL. Misclassifications were: c-ALL/Pre-B-ALL with t(9;22) as c-ALL/Pre-B-ALL without t(9;22) (6/35) and vice versa (6/30). Of the 13 Pre-T-ALL cases 4 were classified as BAL and 3 as cortical T-ALL. Of the 6 Pro-T-ALL cases 2 were classified as AML M0, 3 as BAL, and 1 as Pre-T-ALL. Of the 17 BAL cases 2 were classified as AML M0, 1 as c-ALL/Pre-B-ALL, 2 as Pre-T-ALL, and 1 as Pro-T-ALL. These analyses confirm that gene expression profiles allow the identification of Pro-B-ALL with t(4;11) and mature B-ALL with t(8;14) but do not unequivocally identify the presence of t(9;22) in c-ALL/Pre-B-ALL. Cortical T-ALL are characterized by a specific gene expression profile which is, however, shared by few cases currently diagnosed as Pre-T-ALL. Thus, diagnostic criteria (surface expression of CD1a only) should be optimized. The same applies to diagnostic criteria for more immature T-ALL, BAL, and AML M0. Loss of 5q is frequently observed in all of these latter entities and may be a future diagnostic marker superseding flow cytometry.

Download Full-text

Classification Algorithm on Gene Expression Profiles of Tumor Using Neighborhood Rough Set and Support Vector Machine

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.850-851.1238 ◽

2013 ◽

Vol 850-851 ◽

pp. 1238-1242

Author(s):

Tao Chen

Keyword(s):

Gene Expression ◽

Rough Set ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Classification Algorithm ◽

Support Vector ◽

Data Set ◽

Filtering Method ◽

Neighborhood Rough Set ◽

Feature Filtering

Gene expression profiles of tumor have the limited amount of samples in comparison to the high dimensionality of the samples;this paper proposed a classification algorithm based on neighborhood rough set to improve classification accuracy.This paper first applied feature filtering method of kruskal-wallis rank sum test to select a set of top-ranked related genes, and then applied neighborhood rough set on these genes to generate a informative genes subset. Finally, SVM was used to classify the GEP data set. The result of the experiment indicates that this method can effectively improve classification accuracy, and it has higher generalization.

Download Full-text