Increasing prediction accuracy of pathogenic staging by sample augmentation with a GAN

Accurate prediction of cancer stage is important in that it enables more appropriate treatment for patients with cancer. Many measures or methods have been proposed for more accurate prediction of cancer stage, but recently, machine learning, especially deep learning-based methods have been receiving increasing attention, mostly owing to their good prediction accuracy in many applications. Machine learning methods can be applied to high throughput DNA mutation or RNA expression data to predict cancer stage. However, because the number of genes or markers generally exceeds 10,000, a considerable number of data samples is required to guarantee high prediction accuracy. To solve this problem of a small number of clinical samples, we used a Generative Adversarial Networks (GANs) to augment the samples. Because GANs are not effective with whole genes, we first selected significant genes using DNA mutation data and random forest feature ranking. Next, RNA expression data for selected genes were expanded using GANs. We compared the classification accuracies using original dataset and expanded datasets generated by proposed and existing methods, using random forest, Deep Neural Networks (DNNs), and 1-Dimensional Convolutional Neural Networks (1DCNN). When using the 1DCNN, the F1 score of GAN5 (a 5-fold increase in data) was improved by 39% in relation to the original data. Moreover, the results using only 30% of the data were better than those using all of the data. Our attempt is the first to use GAN for augmentation using numeric data for both DNA and RNA. The augmented datasets obtained using the proposed method demonstrated significantly increased classification accuracy for most cases. By using GAN and 1DCNN in the prediction of cancer stage, we confirmed that good results can be obtained even with small amounts of samples, and it is expected that a great deal of the cost and time required to obtain clinical samples will be reduced. The proposed sample augmentation method could also be applied for other purposes, such as prognostic prediction or cancer classification.

Download Full-text

Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study

Journal of Medical Internet Research ◽

10.2196/18387 ◽

2020 ◽

Vol 22 (8) ◽

pp. e18387

Author(s):

Solbi Kweon ◽

Jeong Hoon Lee ◽

Younghee Lee ◽

Yu Rang Park

Keyword(s):

Machine Learning ◽

Predictive Accuracy ◽

Personal Information ◽

Genomic Data ◽

Support Vector ◽

Rna Expression ◽

Cancer Type ◽

Cancer Stage ◽

Expression Data ◽

Cancer Types

Background As the need for sharing genomic data grows, privacy issues and concerns, such as the ethics surrounding data sharing and disclosure of personal information, are raised. Objective The main purpose of this study was to verify whether genomic data is sufficient to predict a patient's personal information. Methods RNA expression data and matched patient personal information were collected from 9538 patients in The Cancer Genome Atlas program. Five personal information variables (age, gender, race, cancer type, and cancer stage) were recorded for each patient. Four different machine learning algorithms (support vector machine, decision tree, random forest, and artificial neural network) were used to determine whether a patient's personal information could be accurately predicted from RNA expression data. Performance measurement of the prediction models was based on the accuracy and area under the receiver operating characteristic curve. We selected five cancer types (breast carcinoma, kidney renal clear cell carcinoma, head and neck squamous cell carcinoma, low-grade glioma, and lung adenocarcinoma) with large samples sizes to verify whether predictive accuracy would differ between them. We also validated the efficacy of our four machine learning models in analyzing normal samples from 593 cancer patients. Results In most samples, personal information with high genetic relevance, such as gender and cancer type, could be predicted from RNA expression data alone. The prediction accuracies for gender and cancer type, which were the best models, were 0.93-0.99 and 0.78-0.94, respectively. Other aspects of personal information, such as age, race, and cancer stage, were difficult to predict from RNA expression data, with accuracies ranging from 0.0026-0.29, 0.76-0.96, and 0.45-0.79, respectively. Among the tested machine learning methods, the highest predictive accuracy was obtained using the support vector machine algorithm (mean accuracy 0.77), while the lowest accuracy was obtained using the random forest method (mean accuracy 0.65). Gender and race were predicted more accurately than other variables in the samples. On average, the accuracy of cancer stage prediction ranged between 0.71-0.67, while the age prediction accuracy ranged between 0.18-0.23 for the five cancer types. Conclusions We attempted to predict patient information using RNA expression data. We found that some identifiers could be predicted, but most others could not. This study showed that personal information available from RNA expression data is limited and this information cannot be used to identify specific patients.

Download Full-text

Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study (Preprint)

10.2196/preprints.18387 ◽

2020 ◽

Author(s):

Solbi Kweon ◽

Jeong Hoon Lee ◽

Younghee Lee ◽

Yu Rang Park

Keyword(s):

Machine Learning ◽

Predictive Accuracy ◽

Personal Information ◽

Genomic Data ◽

Support Vector ◽

Rna Expression ◽

Cancer Type ◽

Cancer Stage ◽

Expression Data ◽

Cancer Types

BACKGROUND As the need for sharing genomic data grows, privacy issues and concerns, such as the ethics surrounding data sharing and disclosure of personal information, are raised. OBJECTIVE The main purpose of this study was to verify whether genomic data is sufficient to predict a patient's personal information. METHODS RNA expression data and matched patient personal information were collected from 9538 patients in The Cancer Genome Atlas program. Five personal information variables (age, gender, race, cancer type, and cancer stage) were recorded for each patient. Four different machine learning algorithms (support vector machine, decision tree, random forest, and artificial neural network) were used to determine whether a patient's personal information could be accurately predicted from RNA expression data. Performance measurement of the prediction models was based on the accuracy and area under the receiver operating characteristic curve. We selected five cancer types (breast carcinoma, kidney renal clear cell carcinoma, head and neck squamous cell carcinoma, low-grade glioma, and lung adenocarcinoma) with large samples sizes to verify whether predictive accuracy would differ between them. We also validated the efficacy of our four machine learning models in analyzing normal samples from 593 cancer patients. RESULTS In most samples, personal information with high genetic relevance, such as gender and cancer type, could be predicted from RNA expression data alone. The prediction accuracies for gender and cancer type, which were the best models, were 0.93-0.99 and 0.78-0.94, respectively. Other aspects of personal information, such as age, race, and cancer stage, were difficult to predict from RNA expression data, with accuracies ranging from 0.0026-0.29, 0.76-0.96, and 0.45-0.79, respectively. Among the tested machine learning methods, the highest predictive accuracy was obtained using the support vector machine algorithm (mean accuracy 0.77), while the lowest accuracy was obtained using the random forest method (mean accuracy 0.65). Gender and race were predicted more accurately than other variables in the samples. On average, the accuracy of cancer stage prediction ranged between 0.71-0.67, while the age prediction accuracy ranged between 0.18-0.23 for the five cancer types. CONCLUSIONS We attempted to predict patient information using RNA expression data. We found that some identifiers could be predicted, but most others could not. This study showed that personal information available from RNA expression data is limited and this information cannot be used to identify specific patients.

Download Full-text

Application of Machine Learning Techniques in the Study of the Relevance of Environmental Factors in Prediction of Tropospheric Ozone

Soft Computing Methods for Practical Environment Solutions ◽

10.4018/978-1-61520-893-7.ch017 ◽

2010 ◽

pp. 278-292

Author(s):

Juan Gómez-Sanchis ◽

Emilio Soria-Olivas ◽

Marcelino Martinez-Sober ◽

Jose Blasco ◽

Juan Guerrero ◽

...

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Ozone Concentration ◽

Tropospheric Ozone ◽

Prediction Accuracy ◽

Linear Models ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Atmospheric Phenomena ◽

Atmospheric Concentrations

This work presents a new approach for one of the main problems in the analysis of atmospheric phenomena, the prediction of atmospheric concentrations of different elements. The proposed methodology is more efficient than other classical approaches and is used in this work to predict tropospheric ozone concentration. The relevance of this problem stems from the fact that excessive ozone concentrations may cause several problems related to public health. Previous research by the authors of this work has shown that the classical approach to this problem (linear models) does not achieve satisfactory results in tropospheric ozone concentration prediction. The authors’ approach is based on Machine Learning (ML) techniques, which include algorithms related to neural networks, fuzzy systems and advanced statistical techniques for data processing. In this work, the authors focus on one of the main ML techniques, namely, neural networks. These models demonstrate their suitability for this problem both in terms of prediction accuracy and information extraction.

Download Full-text

On the effects of pseudorandom and quantum-random number generators in soft computing

Soft Computing ◽

10.1007/s00500-019-04450-0 ◽

2019 ◽

Vol 24 (12) ◽

pp. 9243-9256

Author(s):

Jordan J. Bird ◽

Anikó Ekárt ◽

Diego R. Faria

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Random Forest ◽

Soft Computing ◽

Convolutional Neural Networks ◽

Random Number ◽

Random Tree ◽

Random Numbers ◽

Random Number Generators ◽

Eeg Classification

Abstract In this work, we argue that the implications of pseudorandom and quantum-random number generators (PRNG and QRNG) inexplicably affect the performances and behaviours of various machine learning models that require a random input. These implications are yet to be explored in soft computing until this work. We use a CPU and a QPU to generate random numbers for multiple machine learning techniques. Random numbers are employed in the random initial weight distributions of dense and convolutional neural networks, in which results show a profound difference in learning patterns for the two. In 50 dense neural networks (25 PRNG/25 QRNG), QRNG increases over PRNG for accent classification at + 0.1%, and QRNG exceeded PRNG for mental state EEG classification by + 2.82%. In 50 convolutional neural networks (25 PRNG/25 QRNG), the MNIST and CIFAR-10 problems are benchmarked, and in MNIST the QRNG experiences a higher starting accuracy than the PRNG but ultimately only exceeds it by 0.02%. In CIFAR-10, the QRNG outperforms PRNG by + 0.92%. The n-random split of a Random Tree is enhanced towards and new Quantum Random Tree (QRT) model, which has differing classification abilities to its classical counterpart, 200 trees are trained and compared (100 PRNG/100 QRNG). Using the accent and EEG classification data sets, a QRT seemed inferior to a RT as it performed on average worse by − 0.12%. This pattern is also seen in the EEG classification problem, where a QRT performs worse than a RT by − 0.28%. Finally, the QRT is ensembled into a Quantum Random Forest (QRF), which also has a noticeable effect when compared to the standard Random Forest (RF). Ten to 100 ensembles of trees are benchmarked for the accent and EEG classification problems. In accent classification, the best RF (100 RT) outperforms the best QRF (100 QRF) by 0.14% accuracy. In EEG classification, the best RF (100 RT) outperforms the best QRF (100 QRT) by 0.08% but is extremely more complex, requiring twice the amount of trees in committee. All differences are observed to be situationally positive or negative and thus are likely data dependent in their observed functional behaviour.

Download Full-text

3145 An Evaluation of Machine Learning and Traditional Statistical Methods for Discovery in Large-Scale Translational Data

Journal of Clinical and Translational Science ◽

10.1017/cts.2019.8 ◽

2019 ◽

Vol 3 (s1) ◽

pp. 2-2

Author(s):

Megan C Hollister ◽

Jeffrey D. Blume

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Random Forest ◽

Gene Expression Data ◽

Large Scale ◽

Second Generation ◽

A Priori ◽

Expression Data ◽

P Values ◽

Machine Learning Methods

OBJECTIVES/SPECIFIC AIMS: To examine and compare the claims in Bzdok, Altman, and Brzywinski under a broader set of conditions by using unbiased methods of comparison. To explore how to accurately use various machine learning and traditional statistical methods in large-scale translational research by estimating their accuracy statistics. Then we will identify the methods with the best performance characteristics. METHODS/STUDY POPULATION: We conducted a simulation study with a microarray of gene expression data. We maintained the original structure proposed by Bzdok, Altman, and Brzywinski. The structure for gene expression data includes a total of 40 genes from 20 people, in which 10 people are phenotype positive and 10 are phenotype negative. In order to find a statistical difference 25% of the genes were set to be dysregulated across phenotype. This dysregulation forced the positive and negative phenotypes to have different mean population expressions. Additional variance was included to simulate genetic variation across the population. We also allowed for within person correlation across genes, which was not done in the original simulations. The following methods were used to determine the number of dysregulated genes in simulated data set: unadjusted p-values, Benjamini-Hochberg adjusted p-values, Bonferroni adjusted p-values, random forest importance levels, neural net prediction weights, and second-generation p-values. RESULTS/ANTICIPATED RESULTS: Results vary depending on whether a pre-specified significance level is used or the top 10 ranked values are taken. When all methods are given the same prior information of 10 dysregulated genes, the Benjamini-Hochberg adjusted p-values and the second-generation p-values generally outperform all other methods. We were not able to reproduce or validate the finding that random forest importance levels via a machine learning algorithm outperform classical methods. Almost uniformly, the machine learning methods did not yield improved accuracy statistics and they depend heavily on the a priori chosen number of dysregulated genes. DISCUSSION/SIGNIFICANCE OF IMPACT: In this context, machine learning methods do not outperform standard methods. Because of this and their additional complexity, machine learning approaches would not be preferable. Of all the approaches the second-generation p-value appears to offer significant benefit for the cost of a priori defining a region of trivially null effect sizes. The choice of an analysis method for large-scale translational data is critical to the success of any statistical investigation, and our simulations clearly highlight the various tradeoffs among the available methods.

Download Full-text

A novel dimension reduction algorithm based on weighted kernel principal analysis for gene expression data

PLoS ONE ◽

10.1371/journal.pone.0258326 ◽

2021 ◽

Vol 16 (10) ◽

pp. e0258326

Author(s):

Wen Bo Liu ◽

Sheng Nan Liang ◽

Xi Wen Qin

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Random Forest ◽

Dimension Reduction ◽

Principal Component ◽

Classification Performance ◽

Kernel Functions ◽

Reduction Algorithm ◽

Expression Data ◽

Weighted Kernel

Gene expression data has the characteristics of high dimensionality and a small sample size and contains a large number of redundant genes unrelated to a disease. The direct application of machine learning to classify this type of data will not only incur a great time cost but will also sometimes fail to improved classification performance. To counter this problem, this paper proposes a dimension-reduction algorithm based on weighted kernel principal component analysis (WKPCA), constructs kernel function weights according to kernel matrix eigenvalues, and combines multiple kernel functions to reduce the feature dimensions. To further improve the dimensional reduction efficiency of WKPCA, t-class kernel functions are constructed, and corresponding theoretical proofs are given. Moreover, the cumulative optimal performance rate is constructed to measure the overall performance of WKPCA combined with machine learning algorithms. Naive Bayes, K-nearest neighbour, random forest, iterative random forest and support vector machine approaches are used in classifiers to analyse 6 real gene expression dataset. Compared with the all-variable model, linear principal component dimension reduction and single kernel function dimension reduction, the results show that the classification performance of the 5 machine learning methods mentioned above can be improved effectively by WKPCA dimension reduction.

Download Full-text

Comparison of the Validity and Generalizability of Machine Learning Algorithms for the Prediction of Energy Expenditure: Validation Study

JMIR mhealth and uhealth ◽

10.2196/23938 ◽

2021 ◽

Vol 9 (8) ◽

pp. e23938

Author(s):

Ruairi O'Driscoll ◽

Jake Turicchi ◽

Mark Hopkins ◽

Cristiana Duarte ◽

Graham W Horgan ◽

...

Keyword(s):

Physical Activity ◽

Machine Learning ◽

Neural Networks ◽

Random Forest ◽

Energy Expenditure ◽

Superior Performance ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Out Of Sample

Background Accurate solutions for the estimation of physical activity and energy expenditure at scale are needed for a range of medical and health research fields. Machine learning techniques show promise in research-grade accelerometers, and some evidence indicates that these techniques can be applied to more scalable commercial devices. Objective This study aims to test the validity and out-of-sample generalizability of algorithms for the prediction of energy expenditure in several wearables (ie, Fitbit Charge 2, ActiGraph GT3-x, SenseWear Armband Mini, and Polar H7) using two laboratory data sets comprising different activities. Methods Two laboratory studies (study 1: n=59, age 44.4 years, weight 75.7 kg; study 2: n=30, age=31.9 years, weight=70.6 kg), in which adult participants performed a sequential lab-based activity protocol consisting of resting, household, ambulatory, and nonambulatory tasks, were combined in this study. In both studies, accelerometer and physiological data were collected from the wearables alongside energy expenditure using indirect calorimetry. Three regression algorithms were used to predict metabolic equivalents (METs; ie, random forest, gradient boosting, and neural networks), and five classification algorithms (ie, k-nearest neighbor, support vector machine, random forest, gradient boosting, and neural networks) were used for physical activity intensity classification as sedentary, light, or moderate to vigorous. Algorithms were evaluated using leave-one-subject-out cross-validations and out-of-sample validations. Results The root mean square error (RMSE) was lowest for gradient boosting applied to SenseWear and Polar H7 data (0.91 METs), and in the classification task, gradient boost applied to SenseWear and Polar H7 was the most accurate (85.5%). Fitbit models achieved an RMSE of 1.36 METs and 78.2% accuracy for classification. Errors tended to increase in out-of-sample validations with the SenseWear neural network achieving RMSE values of 1.22 METs in the regression tasks and the SenseWear gradient boost and random forest achieving an accuracy of 80% in classification tasks. Conclusions Algorithms trained on combined data sets demonstrated high predictive accuracy, with a tendency for superior performance of random forests and gradient boosting for most but not all wearable devices. Predictions were poorer in the between-study validations, which creates uncertainty regarding the generalizability of the tested algorithms.

Download Full-text

Identifying New Potential Biomarkers in Adrenocortical Tumors Based on mRNA Expression Data Using Machine Learning

Cancers ◽

10.3390/cancers13184671 ◽

2021 ◽

Vol 13 (18) ◽

pp. 4671

Author(s):

André Marquardt ◽

Laura-Sophie Landwehr ◽

Cristina L. Ronchi ◽

Guido di Dalmazi ◽

Anna Riester ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Good Prognosis ◽

Marker Genes ◽

Expression Data ◽

Poor Survival ◽

Significant Differential Expression ◽

Mrna Expression Data ◽

Adrenocortical Tumors ◽

Unbiased Manner

Adrenocortical carcinoma (ACC) is a rare disease, associated with poor survival. Several “multiple-omics” studies characterizing ACC on a molecular level identified two different clusters correlating with patient survival (C1A and C1B). We here used the publicly available transcriptome data from the TCGA-ACC dataset (n = 79), applying machine learning (ML) methods to classify the ACC based on expression pattern in an unbiased manner. UMAP (uniform manifold approximation and projection)-based clustering resulted in two distinct groups, ACC-UMAP1 and ACC-UMAP2, that largely overlap with clusters C1B and C1A, respectively. However, subsequent use of random-forest-based learning revealed a set of new possible marker genes showing significant differential expression in the described clusters (e.g., SOAT1, EIF2A1). For validation purposes, we used a secondary dataset based on a previous study from our group, consisting of 4 normal adrenal glands and 52 benign and 7 malignant tumor samples. The results largely confirmed those obtained for the TCGA-ACC cohort. In addition, the ENSAT dataset showed a correlation between benign adrenocortical tumors and the good prognosis ACC cluster ACC-UMAP1/C1B. In conclusion, the use of ML approaches re-identified and redefined known prognostic ACC subgroups. On the other hand, the subsequent use of random-forest-based learning identified new possible prognostic marker genes for ACC.

Download Full-text

Comparing Machine Learning Models and Hybrid Geostatistical Methods Using Environmental and Soil Covariates for Soil pH Prediction

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9040276 ◽

2020 ◽

Vol 9 (4) ◽

pp. 276

Author(s):

Panagiotis Tziachris ◽

Vassilis Aschonitis ◽

Theocharis Chatzistathis ◽

Maria Papadopoulou ◽

Ioannis (John) D. Doukas

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Random Forests ◽

Soil Ph ◽

Prediction Accuracy ◽

Gradient Boosting ◽

Soil Parameters ◽

Geostatistical Methods ◽

Significant Difference ◽

Hyperparameter Selection

In the current paper we assess different machine learning (ML) models and hybrid geostatistical methods in the prediction of soil pH using digital elevation model derivates (environmental covariates) and co-located soil parameters (soil covariates). The study was located in the area of Grevena, Greece, where 266 disturbed soil samples were collected from randomly selected locations and analyzed in the laboratory of the Soil and Water Resources Institute. The different models that were assessed were random forests (RF), random forests kriging (RFK), gradient boosting (GB), gradient boosting kriging (GBK), neural networks (NN), and neural networks kriging (NNK) and finally, multiple linear regression (MLR), ordinary kriging (OK), and regression kriging (RK) that although they are not ML models, they were used for comparison reasons. Both the GB and RF models presented the best results in the study, with NN a close second. The introduction of OK to the ML models’ residuals did not have a major impact. Classical geostatistical or hybrid geostatistical methods without ML (OK, MLR, and RK) exhibited worse prediction accuracy compared to the models that included ML. Furthermore, different implementations (methods and packages) of the same ML models were also assessed. Regarding RF and GB, the different implementations that were applied (ranger-ranger, randomForest-rf, xgboost-xgbTree, xgboost-xgbDART) led to similar results, whereas in NN, the differences between the implementations used (nnet-nnet and nnet-avNNet) were more distinct. Finally, ML models tuned through a random search optimization method were compared with the same ML models with their default values. The results showed that the predictions were improved by the optimization process only where the ML algorithms demanded a large number of hyperparameters that needed tuning and there was a significant difference between the default values and the optimized ones, like in the case of GB and NN, but not in RF. In general, the current study concluded that although RF and GB presented approximately the same prediction accuracy, RF had more consistent results, regardless of different packages, different hyperparameter selection methods, or even the inclusion of OK in the ML models’ residuals.

Download Full-text

Prediction of Breast Cancer Using Machine Learning

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666190617160834 ◽

2020 ◽

Vol 13 (5) ◽

pp. 901-908

Author(s):

Somil Jain ◽

Puneet Kumar

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Prediction Accuracy ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Classification Algorithms ◽

Breast Cancer Dataset

Background:: Breast cancer is one of the diseases which cause number of deaths ever year across the globe, early detection and diagnosis of such type of disease is a challenging task in order to reduce the number of deaths. Now a days various techniques of machine learning and data mining are used for medical diagnosis which has proven there metal by which prediction can be done for the chronic diseases like cancer which can save the life’s of the patients suffering from such type of disease. The major concern of this study is to find the prediction accuracy of the classification algorithms like Support Vector Machine, J48, Naïve Bayes and Random Forest and to suggest the best algorithm. Objective:: The objective of this study is to assess the prediction accuracy of the classification algorithms in terms of efficiency and effectiveness. Methods: This paper provides a detailed analysis of the classification algorithms like Support Vector Machine, J48, Naïve Bayes and Random Forest in terms of their prediction accuracy by applying 10 fold cross validation technique on the Wisconsin Diagnostic Breast Cancer dataset using WEKA open source tool. Results:: The result of this study states that Support Vector Machine has achieved the highest prediction accuracy of 97.89 % with low error rate of 0.14%. Conclusion:: This paper provides a clear view over the performance of the classification algorithms in terms of their predicting ability which provides a helping hand to the medical practitioners to diagnose the chronic disease like breast cancer effectively.

Download Full-text