scholarly journals A Machine Learning Approach for Tracing Tumor Original Sites With Gene Expression Profiles

Author(s):  
Xin Liang ◽  
Wen Zhu ◽  
Bo Liao ◽  
Bo Wang ◽  
Jialiang Yang ◽  
...  

Some carcinomas show that one or more metastatic sites appear with unknown origins. The identification of primary or metastatic tumor tissues is crucial for physicians to develop precise treatment plans for patients. With unknown primary origin sites, it is challenging to design specific plans for patients. Usually, those patients receive broad-spectrum chemotherapy, while still having poor prognosis though. Machine learning has been widely used and already achieved significant advantages in clinical practices. In this study, we classify and predict a large number of tumor samples with uncertain origins by applying the random forest and Naive Bayesian algorithms. We use the precision, recall, and other measurements to evaluate the performance of our approach. The results have showed that the prediction accuracy of this method was 90.4 for 7,713 samples. The accuracy was 80% for 20 metastatic tumors samples. In addition, the 10-fold cross-validation is used to evaluate the accuracy of classification, which reaches 91%.

PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5285 ◽  
Author(s):  
Mei Sze Tan ◽  
Siow-Wee Chang ◽  
Phaik Leng Cheah ◽  
Hwa Jen Yap

Although most of the cervical cancer cases are reported to be closely related to the Human Papillomavirus (HPV) infection, there is a need to study genes that stand up differentially in the final actualization of cervical cancers following HPV infection. In this study, we proposed an integrative machine learning approach to analyse multiple gene expression profiles in cervical cancer in order to identify a set of genetic markers that are associated with and may eventually aid in the diagnosis or prognosis of cervical cancers. The proposed integrative analysis is composed of three steps: namely, (i) gene expression analysis of individual dataset; (ii) meta-analysis of multiple datasets; and (iii) feature selection and machine learning analysis. As a result, 21 gene expressions were identified through the integrative machine learning analysis which including seven supervised and one unsupervised methods. A functional analysis with GSEA (Gene Set Enrichment Analysis) was performed on the selected 21-gene expression set and showed significant enrichment in a nine-potential gene expression signature, namely PEG3, SPON1, BTD and RPLP2 (upregulated genes) and PRDX3, COPB2, LSM3, SLC5A3 and AS1B (downregulated genes).


2020 ◽  
Author(s):  
Haoyu Ruan ◽  
Yihang Zhou ◽  
Jie Shen ◽  
Yue Zhai ◽  
Ying Xu ◽  
...  

AbstractMetastatic lung cancer accounts for about half of the brain metastases (BM). Development of leptomeningeal metastases (LM) are becoming increasingly common, and its prognosis is still poor despite the advances in systemic and local approaches. Cytology analysis in the cerebrospinal fluid (CSF) remains the diagnostic gold standard. Although several previous studies performed in CSF have offered great promise for the diagnostics and therapeutics of LM, a comprehensive characterization of circulating tumor cells (CTCs) in CSF is still lacking. To fill this critical gap of lung adenocarcinoma LM (LUAD-LM), we analyzed the transcriptomes of 1,375 cells from 5 LUAD-LM patient and 3 control samples using single-cell RNA sequencing technology. We defined CSF-CTCs based on abundant expression of epithelial markers and genes with lung origin, as well as the enrichment of metabolic pathway and cell adhesion molecules, which are crucial for the survival and metastases of tumor cells. Elevated expression of CEACAM6 and SCGB3A2 was discovered in CSF-CTCs, which could serve as candidate biomarkers of LUAD-LM. We identified substantial heterogeneity in CSF-CTCs among LUAD-LM patients and within patient among individual cells. Cell-cycle gene expression profiles and the proportion of CTCs displaying mesenchymal and cancer stem cell properties also vary among patients. In addition, CSF-CTC transcriptome profiling identified one LM case as cancer of unknown primary site (CUP). Our results will shed light on the mechanism of LUAD-LM and provide a new direction of diagnostic test of LUAD-LM and CUP cases from CSF samples.


2021 ◽  
Author(s):  
Nathaniel T Hawkins ◽  
Marc Maldaver ◽  
Anna Yannakopoulos ◽  
Lindsay A Guare ◽  
Arjun Krishnan

There are currently >1.3 million human –omics samples that are publicly available. However, this valuable resource remains acutely underused because discovering samples, say from a particular tissue of interest, from this ever-growing data collection is still a significant challenge. The major impediment is that sample attributes such as tissue/cell-type of origin are routinely described using non-standard, varied terminologies written in unstructured natural language. Here, we propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for –omics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample text descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms in a structured ontology. Our approach significantly and substantially outperforms an advanced text annotation method (MetaSRA) that uses graph-based reasoning and a baseline method (TAGGER) that annotates text based on exact string matching. Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text, as does the ability of these models to classify tissue-associated biological processes and diseases based on their descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the –omics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto.


2018 ◽  
Author(s):  
William F. Flynn ◽  
Sandeep Namburi ◽  
Carolyn A. Paisie ◽  
Honey V. Reddi ◽  
Sheng Li ◽  
...  

ABSTRACTBackgroundIt is estimated by the American Cancer Society that approximately 5% of all metastatic tumors have no defined primary site (tissue) of origin and are classified as cancers of unknown primary (CUPs). The current standard of care for CUP patients depends on immunohistochemistry (IHC) based approaches to identify the primary site. The addition of post-mortem evaluation to IHC based tests helps to reveal the identity of the primary site for only 25% of the CUPs, emphasizing the acute need for better methods of determination of the site of origin. CUP patients are therefore given generic chemotherapeutic agents resulting in poor prognosis. When the tissue of origin is known, patients can be given site specific therapy with significant improvement in clinical outcome. Similarly, identifying the primary site of origin of metastatic cancer is of great importance for designing treatment.Identification of the primary site of origin is an import first step but may not be sufficient information for optimal treatment of the patient. Recent studies, primarily from The Cancer Genome Atlas (TCGA) project, and others, have revealed molecular subtypes in several cancer types with distinct clinical outcome. The molecular subtype captures the fundamental mechanisms driving the cancer and provides information that is essential for the optimal treatment of a cancer. Thus, along with primary site of origin, molecular subtype of a tumor is emerging as a criterion for personalized medicine and patient entry into clinical trials.However, there is no comprehensive toolset available for precise identification of tissue of origin or molecular subtype for precision medicine and translational research.Methods and FindingsWe posited that metastatic tumors will harbor the gene expression profiles of the primary site of origin of the cancer. Therefore, we decided to learn the molecular characteristics of the primary tumors using the large number of cancer genome profiles available from the TCGA project. Our predictors were trained for 33 cancer types and for the 11 cancers where there are established molecular subtypes. We estimated the accuracy of several machine learning models using cross-validation methods. The extensive testing using independent test sets revealed that the predictors had a median sensitivity and specificity of 97.2% and 99.9% respectively without losing classification of any tumor. Subtype classifiers achieved median sensitivity of 87.7% and specificity of 94.5% via cross validation and presented median sensitivity of 79.6% and specificity of 94.6% in two external datasets of 1,999 total samples. Importantly, these external data shows that our classifiers can robustly predict the primary site of origin from external microarray data, metastatic cancer data, and patient-derived xenograft (PDX) data.ConclusionWe have demonstrated the utility of gene expression profiles to solve the important clinical challenge of identifying the primary site of origin and the molecular subtype of cancers based on machine learning algorithms. We show, for the first time to our knowledge, that our pan-cancer classifiers can predict multiple cancers’ primary site of origin from metastatic samples. The predictors will be made available as open source software, freely available for academic non-commercial use.


2022 ◽  
Vol 02 ◽  
Author(s):  
Sergey Shityakov ◽  
Jane Pei-Chen Chang ◽  
Ching-Fang Sun ◽  
David Ta-Wei Guu ◽  
Thomas Dandekar ◽  
...  

Background: Omega-3 polyunsaturated fatty acids (PUFAs), such as eicosapentaenoic (EPA) and docosahexaenoic (DHA) acids, have beneficial effects on human health, but their effect on gene expression in elderly individuals (age ≥ 65) is largely unknown. In order to examine this, the gene expression profiles were analyzed in the healthy subjects (n = 96) at baseline and after 26 weeks of supplementation with EPA+DHA to determine up-regulated and down-regulated dif-ferentially expressed genes (DEGs) triggered by PUFAs. The protein-protein interaction (PPI) networks were constructed by mapping these DEGs to a human interactome and linking them to the specific pathways. Objective: This study aimed to implement supervised machine learning models and protein-protein interaction network analysis of gene expression profiles induced by PUFAs. Methods: The transcriptional profile of GSE12375 was obtained from the Gene Expression Om-nibus database, which is based on the Affymetrix NuGO array. The probe cell intensity data were converted into the gene expression values, and the background correction was performed by the multi-array average algorithm. The LIMMA (Linear Models for Microarray Data) algo-rithm was implemented to identify relevant DEGs at baseline and after 26 weeks of supplemen-tation with a p-value < 0.05. The DAVID web server was used to identify and construct the en-riched KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways. Finally, the construction of machine learning (ML) models, including logistic regression, naïve Bayes, and deep neural networks, were implemented for the analyzed DEGs associated with the specific pathways. Results: The results revealed that up-regulated DEGs were associated with neurotrophin/MAPK signaling, whereas the down-regulated DEGs were linked to cancer, acute myeloid leukemia, and long-term depression pathways. Additionally, ML approaches were able to cluster the EPA/DHA-treated and control groups by the logistic regression performing the best. Conclusion: Overall, this study highlights the pivotal changes in DEGs induced by PUFAs and provides the rationale for the implementation of ML algorithms as predictive models for this type of biomedical data.


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. 3083-3083 ◽  
Author(s):  
Jim Abraham ◽  
Amy B. Heimberger ◽  
Zoran Gatalica ◽  
Wolfgang Michael Korn ◽  
David Spetzler

3083 Background: The diagnosis of a malignancy is typically informed by clinical presentation and tumor tissue features including cell morphology, immunohistochemistry, cytogenetics, and molecular markers. However, in approximately 5-10% of cancers, ambiguity is high enough that no tissue of origin can be determined and the specimen is labeled as a Cancer of Occult\Unknown Primary (CUP). Lack of reliable classification of a tumor poses a significant treatment dilemma for the oncologist leading to inappropriate and/or delayed treatment. Methods: 40,000 tumor patients with NGS data were used to construct a multiple parameter lineage-specific classification system using an advanced machine learning approach. The dataset for each classifier was split 50% for training and the other 50% for testing. The training task for each classifier was to identify the cases that were similar to the cases it was trained on against a backdrop of randomly selected cases of other histological origins. Results: Tumor lineage classifiers predicted the correct classifications where the primary site was known with accuracies ranging between 85% and 95%. When applied to CUP cases (n = 500), an unequivocal result could be obtained 100% of the time. Conclusions: Lineage predictors can render a histologic diagnosis to CUP cases that can inform treatment and potentially improve outcomes.


Sign in / Sign up

Export Citation Format

Share Document