scholarly journals Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study

10.2196/18387 ◽  
2020 ◽  
Vol 22 (8) ◽  
pp. e18387
Author(s):  
Solbi Kweon ◽  
Jeong Hoon Lee ◽  
Younghee Lee ◽  
Yu Rang Park

Background As the need for sharing genomic data grows, privacy issues and concerns, such as the ethics surrounding data sharing and disclosure of personal information, are raised. Objective The main purpose of this study was to verify whether genomic data is sufficient to predict a patient's personal information. Methods RNA expression data and matched patient personal information were collected from 9538 patients in The Cancer Genome Atlas program. Five personal information variables (age, gender, race, cancer type, and cancer stage) were recorded for each patient. Four different machine learning algorithms (support vector machine, decision tree, random forest, and artificial neural network) were used to determine whether a patient's personal information could be accurately predicted from RNA expression data. Performance measurement of the prediction models was based on the accuracy and area under the receiver operating characteristic curve. We selected five cancer types (breast carcinoma, kidney renal clear cell carcinoma, head and neck squamous cell carcinoma, low-grade glioma, and lung adenocarcinoma) with large samples sizes to verify whether predictive accuracy would differ between them. We also validated the efficacy of our four machine learning models in analyzing normal samples from 593 cancer patients. Results In most samples, personal information with high genetic relevance, such as gender and cancer type, could be predicted from RNA expression data alone. The prediction accuracies for gender and cancer type, which were the best models, were 0.93-0.99 and 0.78-0.94, respectively. Other aspects of personal information, such as age, race, and cancer stage, were difficult to predict from RNA expression data, with accuracies ranging from 0.0026-0.29, 0.76-0.96, and 0.45-0.79, respectively. Among the tested machine learning methods, the highest predictive accuracy was obtained using the support vector machine algorithm (mean accuracy 0.77), while the lowest accuracy was obtained using the random forest method (mean accuracy 0.65). Gender and race were predicted more accurately than other variables in the samples. On average, the accuracy of cancer stage prediction ranged between 0.71-0.67, while the age prediction accuracy ranged between 0.18-0.23 for the five cancer types. Conclusions We attempted to predict patient information using RNA expression data. We found that some identifiers could be predicted, but most others could not. This study showed that personal information available from RNA expression data is limited and this information cannot be used to identify specific patients.

2020 ◽  
Author(s):  
Solbi Kweon ◽  
Jeong Hoon Lee ◽  
Younghee Lee ◽  
Yu Rang Park

BACKGROUND As the need for sharing genomic data grows, privacy issues and concerns, such as the ethics surrounding data sharing and disclosure of personal information, are raised. OBJECTIVE The main purpose of this study was to verify whether genomic data is sufficient to predict a patient's personal information. METHODS RNA expression data and matched patient personal information were collected from 9538 patients in The Cancer Genome Atlas program. Five personal information variables (age, gender, race, cancer type, and cancer stage) were recorded for each patient. Four different machine learning algorithms (support vector machine, decision tree, random forest, and artificial neural network) were used to determine whether a patient's personal information could be accurately predicted from RNA expression data. Performance measurement of the prediction models was based on the accuracy and area under the receiver operating characteristic curve. We selected five cancer types (breast carcinoma, kidney renal clear cell carcinoma, head and neck squamous cell carcinoma, low-grade glioma, and lung adenocarcinoma) with large samples sizes to verify whether predictive accuracy would differ between them. We also validated the efficacy of our four machine learning models in analyzing normal samples from 593 cancer patients. RESULTS In most samples, personal information with high genetic relevance, such as gender and cancer type, could be predicted from RNA expression data alone. The prediction accuracies for gender and cancer type, which were the best models, were 0.93-0.99 and 0.78-0.94, respectively. Other aspects of personal information, such as age, race, and cancer stage, were difficult to predict from RNA expression data, with accuracies ranging from 0.0026-0.29, 0.76-0.96, and 0.45-0.79, respectively. Among the tested machine learning methods, the highest predictive accuracy was obtained using the support vector machine algorithm (mean accuracy 0.77), while the lowest accuracy was obtained using the random forest method (mean accuracy 0.65). Gender and race were predicted more accurately than other variables in the samples. On average, the accuracy of cancer stage prediction ranged between 0.71-0.67, while the age prediction accuracy ranged between 0.18-0.23 for the five cancer types. CONCLUSIONS We attempted to predict patient information using RNA expression data. We found that some identifiers could be predicted, but most others could not. This study showed that personal information available from RNA expression data is limited and this information cannot be used to identify specific patients.


Cancers ◽  
2021 ◽  
Vol 13 (15) ◽  
pp. 3768
Author(s):  
Vijayachitra Modhukur ◽  
Shakshi Sharma ◽  
Mainak Mondal ◽  
Ankita Lawarde ◽  
Keiu Kask ◽  
...  

Metastatic cancers account for up to 90% of cancer-related deaths. The clear differentiation of metastatic cancers from primary cancers is crucial for cancer type identification and developing targeted treatment for each cancer type. DNA methylation patterns are suggested to be an intriguing target for cancer prediction and are also considered to be an important mediator for the transition to metastatic cancer. In the present study, we used 24 cancer types and 9303 methylome samples downloaded from publicly available data repositories, including The Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO). We constructed machine learning classifiers to discriminate metastatic, primary, and non-cancerous methylome samples. We applied support vector machines (SVM), Naive Bayes (NB), extreme gradient boosting (XGBoost), and random forest (RF) machine learning models to classify the cancer types based on their tissue of origin. RF outperformed the other classifiers, with an average accuracy of 99%. Moreover, we applied local interpretable model-agnostic explanations (LIME) to explain important methylation biomarkers to classify cancer types.


PLoS ONE ◽  
2021 ◽  
Vol 16 (4) ◽  
pp. e0250458
Author(s):  
ChangHyuk Kwon ◽  
Sangjin Park ◽  
Soohyun Ko ◽  
Jaegyoon Ahn

Accurate prediction of cancer stage is important in that it enables more appropriate treatment for patients with cancer. Many measures or methods have been proposed for more accurate prediction of cancer stage, but recently, machine learning, especially deep learning-based methods have been receiving increasing attention, mostly owing to their good prediction accuracy in many applications. Machine learning methods can be applied to high throughput DNA mutation or RNA expression data to predict cancer stage. However, because the number of genes or markers generally exceeds 10,000, a considerable number of data samples is required to guarantee high prediction accuracy. To solve this problem of a small number of clinical samples, we used a Generative Adversarial Networks (GANs) to augment the samples. Because GANs are not effective with whole genes, we first selected significant genes using DNA mutation data and random forest feature ranking. Next, RNA expression data for selected genes were expanded using GANs. We compared the classification accuracies using original dataset and expanded datasets generated by proposed and existing methods, using random forest, Deep Neural Networks (DNNs), and 1-Dimensional Convolutional Neural Networks (1DCNN). When using the 1DCNN, the F1 score of GAN5 (a 5-fold increase in data) was improved by 39% in relation to the original data. Moreover, the results using only 30% of the data were better than those using all of the data. Our attempt is the first to use GAN for augmentation using numeric data for both DNA and RNA. The augmented datasets obtained using the proposed method demonstrated significantly increased classification accuracy for most cases. By using GAN and 1DCNN in the prediction of cancer stage, we confirmed that good results can be obtained even with small amounts of samples, and it is expected that a great deal of the cost and time required to obtain clinical samples will be reduced. The proposed sample augmentation method could also be applied for other purposes, such as prognostic prediction or cancer classification.


Cancers ◽  
2019 ◽  
Vol 11 (10) ◽  
pp. 1562 ◽  
Author(s):  
Maurizio Polano ◽  
Marco Chierici ◽  
Michele Dal Bo ◽  
Davide Gentilini ◽  
Federica Di Cintio ◽  
...  

Immunotherapy by using immune checkpoint inhibitors (ICI) has dramatically improved the treatment options in various cancers, increasing survival rates for treated patients. Nevertheless, there are heterogeneous response rates to ICI among different cancer types, and even in the context of patients affected by a specific cancer. Thus, it becomes crucial to identify factors that predict the response to immunotherapeutic approaches. A comprehensive investigation of the mutational and immunological aspects of the tumor can be useful to obtain a robust prediction. By performing a pan-cancer analysis on gene expression data from the Cancer Genome Atlas (TCGA, 8055 cases and 29 cancer types), we set up and validated a machine learning approach to predict the potential for positive response to ICI. Support vector machines (SVM) and extreme gradient boosting (XGboost) models were developed with a 10×5-fold cross-validation schema on 80% of TCGA cases to predict ICI responsiveness defined by a score combining tumor mutational burden and TGF- β signaling. On the remaining 20% validation subset, our SVM model scored 0.88 accuracy and 0.27 Matthews Correlation Coefficient. The proposed machine learning approach could be useful to predict the putative response to ICI treatment by expression data of primary tumors.


2021 ◽  
Vol 11 (2) ◽  
pp. 61
Author(s):  
Jiande Wu ◽  
Chindo Hicks

Background: Breast cancer is a heterogeneous disease defined by molecular types and subtypes. Advances in genomic research have enabled use of precision medicine in clinical management of breast cancer. A critical unmet medical need is distinguishing triple negative breast cancer, the most aggressive and lethal form of breast cancer, from non-triple negative breast cancer. Here we propose use of a machine learning (ML) approach for classification of triple negative breast cancer and non-triple negative breast cancer patients using gene expression data. Methods: We performed analysis of RNA-Sequence data from 110 triple negative and 992 non-triple negative breast cancer tumor samples from The Cancer Genome Atlas to select the features (genes) used in the development and validation of the classification models. We evaluated four different classification models including Support Vector Machines, K-nearest neighbor, Naïve Bayes and Decision tree using features selected at different threshold levels to train the models for classifying the two types of breast cancer. For performance evaluation and validation, the proposed methods were applied to independent gene expression datasets. Results: Among the four ML algorithms evaluated, the Support Vector Machine algorithm was able to classify breast cancer more accurately into triple negative and non-triple negative breast cancer and had less misclassification errors than the other three algorithms evaluated. Conclusions: The prediction results show that ML algorithms are efficient and can be used for classification of breast cancer into triple negative and non-triple negative breast cancer types.


2021 ◽  
Author(s):  
Xiaotong Zhu ◽  
Jinhui Jeanne Huang

<p>Remote sensing monitoring has the characteristics of wide monitoring range, celerity, low cost for long-term dynamic monitoring of water environment. With the flourish of artificial intelligence, machine learning has enabled remote sensing inversion of seawater quality to achieve higher prediction accuracy. However, due to the physicochemical property of the water quality parameters, the performance of algorithms differs a lot. In order to improve the predictive accuracy of seawater quality parameters, we proposed a technical framework to identify the optimal machine learning algorithms using Sentinel-2 satellite and in-situ seawater sample data. In the study, we select three algorithms, i.e. support vector regression (SVR), XGBoost and deep learning (DL), and four seawater quality parameters, i.e. dissolved oxygen (DO), total dissolved solids (TDS), turbidity(TUR) and chlorophyll-a (Chla). The results show that SVR is a more precise algorithm to inverse DO (R<sup>2</sup> = 0.81). XGBoost has the best accuracy for Chla and Tur inversion (R<sup>2</sup> = 0.75 and 0.78 respectively) while DL performs better in TDS (R<sup>2</sup> =0.789). Overall, this research provides a theoretical support for high precision remote sensing inversion of offshore seawater quality parameters based on machine learning.</p>


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Emmanuel Adinyira ◽  
Emmanuel Akoi-Gyebi Adjei ◽  
Kofi Agyekum ◽  
Frank Desmond Kofi Fugar

PurposeKnowledge of the effect of various cash-flow factors on expected project profit is important to effectively manage productivity on construction projects. This study was conducted to develop and test the sensitivity of a Machine Learning Support Vector Regression Algorithm (SVRA) to predict construction project profit in Ghana.Design/methodology/approachThe study relied on data from 150 institutional projects executed within the past five years (2014–2018) in developing the model. Eighty percent (80%) of the data from the 150 projects was used at hyperparameter selection and final training phases of the model development and the remaining 20% for model testing. Using MATLAB for Support Vector Regression, the parameters available for tuning were the epsilon values, the kernel scale, the box constraint and standardisations. The sensitivity index was computed to determine the degree to which the independent variables impact the dependent variable.FindingsThe developed model's predictions perfectly fitted the data and explained all the variability of the response data around its mean. Average predictive accuracy of 73.66% was achieved with all the variables on the different projects in validation. The developed SVR model was sensitive to labour and loan.Originality/valueThe developed SVRA combines variation, defective works and labour with other financial constraints, which have been the variables used in previous studies. It will aid contractors in predicting profit on completion at commencement and also provide information on the effect of changes to cash-flow factors on profit.


2022 ◽  
Author(s):  
Malvika Sudhakar ◽  
Raghunathan Rengaswamy ◽  
Karthik Raman

The progression of tumorigenesis starts with a few mutational and structural driver events in the cell. Various cohort-based computational tools exist to identify driver genes but require a large number of samples to produce reliable results. Many studies use different methods to identify driver mutations/genes from mutations that have no impact on tumour progression; however, a small fraction of patients show no mutational events in any known driver genes. Current unsupervised methods map somatic and expression data onto a network to identify the perturbation in the network. Our method is the first machine learning model to classify genes as tumour suppressor gene (TSG), oncogene (OG) or neutral, thus assigning the functional impact of the gene in the patient. In this study, we develop a multi-omic approach, PIVOT (Personalised Identification of driVer OGs and TSGs), to train on experimentally or computationally validated mutational and structural driver events. Given the lack of any gold standards for the identification of personalised driver genes, we label the data using four strategies and, based on classification metrics, show gene-based labelling strategies perform best. We build different models using SNV, RNA, and multi-omic features to be used based on the data available. Our models trained on multi-omic data improved predictions compared to mutation and expression data, achieving an accuracy >0.99 for BRCA, LUAD and COAD datasets. We show network and expression-based features contribute the most to PIVOT. Our predictions on BRCA, COAD and LUAD cancer types reveal commonly altered genes such as TP53, and PIK3CA, which are predicted drivers for multiple cancer types. Along with known driver genes, our models also identify new driver genes such as PRKCA, SOX9 and PSMD4. Our multi-omic model labels both CNV and mutations with a more considerable contribution by CNV alterations. While predicting labels for genes mutated in multiple samples, we also label rare driver events occurring in as few as one sample. We also identify genes with dual roles within the same cancer type. Overall, PIVOT labels personalised driver genes as TSGs and OGs and also identifies rare driver genes. PIVOT is available at https://github.com/RamanLab/PIVOT.


2016 ◽  
Vol 24 (1) ◽  
pp. 54-65 ◽  
Author(s):  
Stefano Parodi ◽  
Chiara Manneschi ◽  
Damiano Verda ◽  
Enrico Ferrari ◽  
Marco Muselli

This study evaluates the performance of a set of machine learning techniques in predicting the prognosis of Hodgkin’s lymphoma using clinical factors and gene expression data. Analysed samples from 130 Hodgkin’s lymphoma patients included a small set of clinical variables and more than 54,000 gene features. Machine learning classifiers included three black-box algorithms ( k-nearest neighbour, Artificial Neural Network, and Support Vector Machine) and two methods based on intelligible rules (Decision Tree and the innovative Logic Learning Machine method). Support Vector Machine clearly outperformed any of the other methods. Among the two rule-based algorithms, Logic Learning Machine performed better and identified a set of simple intelligible rules based on a combination of clinical variables and gene expressions. Decision Tree identified a non-coding gene ( XIST) involved in the early phases of X chromosome inactivation that was overexpressed in females and in non-relapsed patients. XIST expression might be responsible for the better prognosis of female Hodgkin’s lymphoma patients.


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. 3072-3072
Author(s):  
Habte Aragaw Yimer ◽  
Wai Hong Wilson Tang ◽  
Mohan K. Tummala ◽  
Spencer Shao ◽  
Gina G. Chung ◽  
...  

3072 Background: The Circulating Cell-free Genome Atlas study (CCGA; NCT02889978) previously demonstrated that a blood-based multi-cancer early detection (MCED) test utilizing cell-free DNA (cfDNA) sequencing in combination with machine learning could detect cancer signals across multiple cancer types and predict cancer signal origin. Cancer classes were defined within the CCGA study for sensitivity reporting. Separately, cancer types defined by the American Joint Committee on Cancer (AJCC) criteria, which outline unique staging requirements and reflect a distinct combination of anatomic site, histology and other biologic features, were assigned to each cancer participant using the same source data for primary site of origin and histologic type. Here, we report CCGA ‘cancer class’ designation and AJCC ‘cancer type’ assignment within the third and final CCGA3 validation substudy to better characterize the diversity of tumors across which a cancer signal could be detected with the MCED test that is nearing clinical availability. Methods: CCGA is a prospective, multicenter, case-control, observational study with longitudinal follow-up (overall population N = 15,254). Plasma cfDNA from evaluable samples was analyzed using a targeted methylation bisulfite sequencing assay and a machine learning approach, and test performance, including sensitivity, was assessed. For sensitivity reporting, CCGA cancer classes were assigned to cancer participants using a combination of the type of primary cancer reported by the site and tumor characteristics abstracted from the site pathology reports by GRAIL pathologists. Each cancer participant also was separately assigned an AJCC cancer type based on the same source data using AJCC staging manual (8th edition) classifications. Results: A total of 4077 participants comprised the independent validation set with confirmed status (cancer: n = 2823; non-cancer: n = 1254 with non-cancer status confirmed at year-one follow-up). Sensitivity was reported for 24 cancer classes (sample sizes ranged from 10 to 524 participants), as well as an “other” cancer class (59 participants). According to AJCC classification, the MCED test was found to detect cancer signals across 50+ AJCC cancer types, including some types not present in the training set; some cancer types had limited representation. Conclusions: This MCED test that is nearing clinical availability and was evaluated in the third CCGA substudy detected cancer signals across 50+ AJCC cancer types. Reporting CCGA cancer classes and AJCC cancer types demonstrates the ability of the MCED test to detect cancer signals across a set of diverse cancer types representing a wide range of biologic characteristics, including cancer types that the classifier has not been trained on, and supports its use on a population-wide scale. Clinical trial information: NCT02889978.


Sign in / Sign up

Export Citation Format

Share Document