scholarly journals On the use of topological features of metabolic networks for the classification of cancer samples

2021 ◽  
Vol 22 ◽  
Author(s):  
Jeaneth Machicao ◽  
Francesco Craighero ◽  
Davide Maspero ◽  
Fabrizio Angaroni ◽  
Chiara Damiani ◽  
...  

Background: The increasing availability of omics data collected from patients affected by severe pathologies, such as cancer, is fostering the development of data science methods for their analysis. Introduction: The combination of data integration and machine learning approaches can provide new powerful instruments to tackle the complexity of cancer development and deliver effective diagnostic and prognostic strategies. Methods: We explore the possibility of exploiting the topological properties of sample-specific metabolic networks as features in a supervised classification task. Such networks are obtained by projecting transcriptomic data from RNA-seq experiments on genome-wide metabolic models to define weighted networks modeling the overall metabolic activity of a given sample. Results: We show the classification results on a labeled breast cancer dataset from the TCGA database, including 210 samples (cancer vs. normal). In particular, we investigate how the performance is affected by a threshold-based pruning of the networks by comparing Artificial Neural Networks, Support Vector Machines and Random Forests. Interestingly, the best classification performance is achieved within a small threshold range for all methods, suggesting that it might represent an effective choice to recover useful information while filtering out noise from data. Overall, the best accuracy is achieved with SVMs, which exhibit performances similar to those obtained when gene expression profiles are used as features. Conclusion: These findings demonstrate that the topological properties of sample-specific metabolic networks are effective in classifying cancer and normal samples, suggesting that useful information can be extracted from a relatively limited number of features.

2019 ◽  
Vol 8 (4) ◽  
pp. 4879-4881

One of the most dreadful disease is breast cancer and it has a potential cause for death in women. Every year, death rate increases drastically due to breast cancer. An effective way to classify data is through classification or data mining. This becomes very handy, especially in the medical field where diagnosis and analysis are done through these techniques. Wisconsin Breast cancer dataset is used to perform a comparison between SVM, Logistic Regression, Naïve Bayes and Random Forest. Evaluating the correctness in classifying data based on accuracy and time consumption is used to determine the efficiency of the algorithms, which is the main objective. Based on the result of performed experiments, the Random Forest algorithm shows the highest accuracy (99.76%) with the least error rate. ANACONDA Data Science Platform is used to execute all the experiments in a simulated environment.


2020 ◽  
Vol 14 ◽  

Breast Cancer (BC) is amongst the most common and leading causes of deaths in women throughout the world. Recently, classification and data analysis tools are being widely used in the medical field for diagnosis, prognosis and decision making to help lower down the risks of people dying or suffering from diseases. Advanced machine learning methods have proven to give hope for patients as this has helped the doctors in early detection of diseases like Breast Cancer that can be fatal, in support with providing accurate outcomes. However, the results highly depend on the techniques used for feature selection and classification which will produce a strong machine learning model. In this paper, a performance comparison is conducted using four classifiers which are Multilayer Perceptron (MLP), Support Vector Machine (SVM), K-Nearest Neighbors (KNN) and Random Forest on the Wisconsin Breast Cancer dataset to spot the most effective predictors. The main goal is to apply best machine learning classification methods to predict the Breast Cancer as benign or malignant using terms such as accuracy, f-measure, precision and recall. Experimental results show that Random forest is proven to achieve the highest accuracy of 99.26% on this dataset and features, while SVM and KNN show 97.78% and 97.04% accuracy respectively. MLP shows the least accuracy of 94.07%. All the experiments are conducted using RStudio as the data mining tool platform.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Pooja Rani ◽  
Rajneesh Kumar ◽  
Anurag Jain

PurposeDecision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases. However, the performance of these systems is adversely affected by the missing values in medical datasets. Imputation methods are used to predict these missing values. In this paper, a new imputation method called hybrid imputation optimized by the classifier (HIOC) is proposed to predict missing values efficiently.Design/methodology/approachThe proposed HIOC is developed by using a classifier to combine multivariate imputation by chained equations (MICE), K nearest neighbor (KNN), mean and mode imputation methods in an optimum way. Performance of HIOC has been compared to MICE, KNN, and mean and mode methods. Four classifiers support vector machine (SVM), naive Bayes (NB), random forest (RF) and decision tree (DT) have been used to evaluate the performance of imputation methods.FindingsThe results show that HIOC performed efficiently even with a high rate of missing values. It had reduced root mean square error (RMSE) up to 17.32% in the heart disease dataset and 34.73% in the breast cancer dataset. Correct prediction of missing values improved the accuracy of the classifiers in predicting diseases. It increased classification accuracy up to 18.61% in the heart disease dataset and 6.20% in the breast cancer dataset.Originality/valueThe proposed HIOC is a new hybrid imputation method that can efficiently predict missing values in any medical dataset.


Author(s):  
Bong-Hyun Kim ◽  
Kijin Yu ◽  
Peter C W Lee

Abstract Motivation Cancer classification based on gene expression profiles has provided insight on the causes of cancer and cancer treatment. Recently, machine learning-based approaches have been attempted in downstream cancer analysis to address the large differences in gene expression values, as determined by single-cell RNA sequencing (scRNA-seq). Results We designed cancer classifiers that can identify 21 types of cancers and normal tissues based on bulk RNA-seq as well as scRNA-seq data. Training was performed with 7398 cancer samples and 640 normal samples from 21 tumors and normal tissues in TCGA based on the 300 most significant genes expressed in each cancer. Then, we compared neural network (NN), support vector machine (SVM), k-nearest neighbors (kNN) and random forest (RF) methods. The NN performed consistently better than other methods. We further applied our approach to scRNA-seq transformed by kNN smoothing and found that our model successfully classified cancer types and normal samples. Availability and implementation Cancer classification by neural network. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 11 ◽  
Author(s):  
Long Li ◽  
Jing Ye ◽  
Houhua Li ◽  
Qianqian Shi

Primula vulgaris exhibits a wide range of flower colors and is a valuable ornamental plant. The combination of flavonols/anthocyanins and carotenoids provides various colorations ranging from yellow to violet-blue. However, the complex metabolic networks and molecular mechanisms underlying the different flower colors of P. vulgaris remain unclear. Based on comprehensive analysis of morphological anatomy, metabolites, and gene expression in different-colored flowers of P. vulgaris, the mechanisms relating color-determining compounds to gene expression profiles were revealed. In the case of P. vulgaris flower color, hirsutin, rosinin, petunidin-, and cyanidin-type anthocyanins and the copigment herbacetin contributed to the blue coloration, whereas peonidin-, cyandin-, and delphinidin-type anthocyanins showed high accumulation levels in pink flowers. The color formation of blue and pink were mainly via the regulation of F3′5′H (c53168), AOMT (c47583, c44905), and 3GT (c50034). Yellow coloration was mainly due to gossypetin and carotenoid, which were regulated by F3H (c43100), F3 1 (c53714), 3GT (c53907) as well as many carotenoid biosynthetic pathway-related genes. Co-expression network and transient expression analysis suggested a potential direct link between flavonoid and carotenoid biosynthetic pathways through MYB transcription factor regulation. This work reveals that transcription changes influence physiological characteristics, and biochemistry characteristics, and subsequently results in flower coloration in P. vulgaris.


2011 ◽  
Vol 10 ◽  
pp. CIN.S7789 ◽  
Author(s):  
Hiroshi Matsumoto ◽  
Yoshikuni Yakabe ◽  
Fumiyo Saito ◽  
Koichi Saito ◽  
Kayo Sumida ◽  
...  

We have previously shown the hepatic gene expression profiles of carcinogens in 28-day toxicity tests were clustered into three major groups (Group-1 to 3). Here, we developed a new prediction method for Group-1 carcinogens which consist mainly of genotoxic rat hepatocarcinogens. The prediction formula was generated by a support vector machine using 5 selected genes as the predictive genes and predictive score was introduced to judge carcinogenicity. It correctly predicted the carcinogenicity of all 17 Group-1 chemicals and 22 of 24 non-carcinogens regardless of genotoxicity. In the dose-response study, the prediction score was altered from negative to positive as the dose increased, indicating that the characteristic gene expression profile emerged over a range of carcinogen-specific doses. We conclude that the prediction formula can quantitatively predict the carcinogenicity of Group-1 carcinogens. The same method may be applied to other groups of carcinogens to build a total system for prediction of carcinogenicity.


2012 ◽  
Vol 11 ◽  
pp. CIN.S10375 ◽  
Author(s):  
Mark Burton ◽  
Mads Thomassen ◽  
Qihua Tan ◽  
Torben A. Kruse

Background The popularity of a large number of microarray applications has in cancer research led to the development of predictive or prognostic gene expression profiles. However, the diversity of microarray platforms has made the full validation of such profiles and their related gene lists across studies difficult and, at the level of classification accuracies, rarely validated in multiple independent datasets. Frequently, while the individual genes between such lists may not match, genes with same function are included across such gene lists. Development of such lists does not take into account the fact that genes can be grouped together as metagenes (MGs) based on common characteristics such as pathways, regulation, or genomic location. Such MGs might be used as features in building a predictive model applicable for classifying independent data. It is, therefore, demanding to systematically compare independent validation of gene lists or classifiers based on metagene or individual gene (SG) features. Methods In this study we compared the performance of either metagene- or single gene-based feature sets and classifiers using random forest and two support vector machines for classifier building. The performance within the same dataset, feature set validation performance, and validation performance of entire classifiers in strictly independent datasets were assessed by 10 times repeated 10-fold cross validation, leave-one-out cross validation, and one-fold validation, respectively. To test the significance of the performance difference between MG- and SG-features/classifiers, we used a repeated down-sampled binomial test approach. Results MG- and SG-feature sets are transferable and perform well for training and testing prediction of metastasis outcome in strictly independent data sets, both between different and within similar microarray platforms, while classifiers had a poorer performance when validated in strictly independent datasets. The study showed that MG- and SG-feature sets perform equally well in classifying independent data. Furthermore, SG-classifiers significantly outperformed MG-classifier when validation is conducted between datasets using similar platforms, while no significant performance difference was found when validation was performed between different platforms. Conclusion Prediction of metastasis outcome in lymph node–negative patients by MG- and SG-classifiers showed that SG-classifiers performed significantly better than MG-classifiers when validated in independent data based on the same microarray platform as used for developing the classifier. However, the MG- and SG-classifiers had similar performance when conducting classifier validation in independent data based on a different microarray platform. The latter was also true when only validating sets of MG- and SG-features in independent datasets, both between and within similar and different platforms.


2005 ◽  
Vol 14 (04) ◽  
pp. 641-660 ◽  
Author(s):  
MICHIHIRO KURAMOCHI ◽  
GEORGE KARYPIS

As various genome sequencing projects have already been completed or are near completion, genome researchers are shifting their focus to functional genomics. Functional genomics represents the next phase, that expands the biological investigation to studying the functionality of genes of a single organism as well as studying and correlating the functionality of genes across many different organisms. Recently developed methods for monitoring genome-wide mRNA expression changes hold the promise of allowing us to inexpensively gain insights into the function of unknown genes. In this paper we focus on evaluating the feasibility of using supervised machine learning methods for determining the function of genes based solely on their expression profiles. We experimentally evaluate the performance of traditional classification algorithms such as support vector machines and k-nearest neighbors on the yeast genome, and present new approaches for classification that improve the overall recall with moderate reductions in precision. Our experiments show that the accuracies achieved for different classes varies dramatically. In analyzing these results we show that the achieved accuracy is highly dependent on whether or not the genes of that class were significantly active during the various experimental conditions, suggesting that gene expression profiles can become a viable alternative to sequence similarity searches provided that the genes are observed under a wide range of experimental conditions.


2005 ◽  
Vol 17 (06) ◽  
pp. 300-308 ◽  
Author(s):  
LI-YEH CHUANG ◽  
CHENG-HONG YANG ◽  
LI-CHENG JIN

The support vector machine (SVM) is a new learning method and has shown comparable or better results than the neural networks on some applications. In this paper, we applied SVM to classify multiple cancer types by gene expression profiles and exploit some strategies of the SVM method, including fuzzy logic and statistical theories. Using the proposed strategies and outlier detection methods, the FSVM (fuzzy support vector machine) can achieve a comparable or better performance than other methods, and provide a more flexible architecture to discriminate against SRBCT and non-SRBCT samples.


Sign in / Sign up

Export Citation Format

Share Document