A Methodological Framework to Discover Pharmacogenomic Interactions Based on Random Forests

The identification of genomic alterations in tumor tissues, including somatic mutations, deletions, and gene amplifications, produces large amounts of data, which can be correlated with a diversity of therapeutic responses. We aimed to provide a methodological framework to discover pharmacogenomic interactions based on Random Forests. We matched two databases from the Cancer Cell Line Encyclopaedia (CCLE) project, and the Genomics of Drug Sensitivity in Cancer (GDSC) project. For a total of 648 shared cell lines, we considered 48,270 gene alterations from CCLE as input features and the area under the dose-response curve (AUC) for 265 drugs from GDSC as the outcomes. A three-step reduction to 501 alterations was performed, selecting known driver genes and excluding very frequent/infrequent alterations and redundant ones. For each model, we used the concordance correlation coefficient (CCC) for assessing the predictive performance, and permutation importance for assessing the contribution of each alteration. In a reasonable computational time (56 min), we identified 12 compounds whose response was at least fairly sensitive (CCC > 20) to the alteration profiles. Some diversities were found in the sets of influential alterations, providing clues to discover significant drug-gene interactions. The proposed methodological framework can be helpful for mining pharmacogenomic interactions.

Download Full-text

Development of Machine Learning Models to Predict Probabilities and Types of Stroke at Prehospital Stage: the Japan Urgent Stroke Triage Score Using Machine Learning (JUST-ML)

Translational Stroke Research ◽

10.1007/s12975-021-00937-x ◽

2021 ◽

Author(s):

Kazutaka Uchida ◽

Junichi Kouno ◽

Shinichi Yoshimura ◽

Norito Kinjo ◽

Fumihiro Sakakibara ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forests ◽

Prediction Models ◽

Characteristic Curve ◽

Predictive Performance ◽

Vessel Occlusion ◽

Predictive Values ◽

Training Cohort ◽

Sensitivity Specificity

AbstractIn conjunction with recent advancements in machine learning (ML), such technologies have been applied in various fields owing to their high predictive performance. We tried to develop prehospital stroke scale with ML. We conducted multi-center retrospective and prospective cohort study. The training cohort had eight centers in Japan from June 2015 to March 2018, and the test cohort had 13 centers from April 2019 to March 2020. We use the three different ML algorithms (logistic regression, random forests, XGBoost) to develop models. Main outcomes were large vessel occlusion (LVO), intracranial hemorrhage (ICH), subarachnoid hemorrhage (SAH), and cerebral infarction (CI) other than LVO. The predictive abilities were validated in the test cohort with accuracy, positive predictive value, sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and F score. The training cohort included 3178 patients with 337 LVO, 487 ICH, 131 SAH, and 676 CI cases, and the test cohort included 3127 patients with 183 LVO, 372 ICH, 90 SAH, and 577 CI cases. The overall accuracies were 0.65, and the positive predictive values, sensitivities, specificities, AUCs, and F scores were stable in the test cohort. The classification abilities were also fair for all ML models. The AUCs for LVO of logistic regression, random forests, and XGBoost were 0.89, 0.89, and 0.88, respectively, in the test cohort, and these values were higher than the previously reported prediction models for LVO. The ML models developed to predict the probability and types of stroke at the prehospital stage had superior predictive abilities.

Download Full-text

Characterizations of Gene Alterations in Melanoma Patients from Chinese Population

BioMed Research International ◽

10.1155/2020/6096814 ◽

2020 ◽

Vol 2020 ◽

pp. 1-8

Author(s):

Yi Luo ◽

Zhenzhen Zhang ◽

Jianfan Liu ◽

Linqing Li ◽

Xuezheng Xu ◽

...

Keyword(s):

Braf Mutations ◽

Driver Genes ◽

Cancer Driver ◽

Formalin Fixed Paraffin ◽

Significant Difference ◽

The Status ◽

Ffpe Samples ◽

Formalin Fixed Paraffin Embedded ◽

Comprehensive Genomic Profiling ◽

Gene Alterations

Melanoma is a human skin malignant tumor with high invasion and poor prognosis. The limited understanding of genomic alterations in melanomas in China impedes the diagnosis and therapeutic strategy selection. We conducted comprehensive genomic profiling of melanomas from 39 primary and metastatic formalin-fixed paraffin-embedded (FFPE) samples from 27 patients in China based on an NGS panel of 223 genes. No significant difference in gene alterations was found between primary and metastasis melanomas. The status of germline mutation, CNV, and somatic mutation in our cohort was quite different from that reported in Western populations. We further delineated the mutation patterns of 4 molecular subgroups (BRAF, RAS, NF1, and Triple-WT) of melanoma in our cohort. BRAF mutations were more frequently identified in melanomas without chromic sun-induced damage (non-CSD), while RAS mutations were more likely observed in acral melanomas. NF1 and Triple-WT subgroups were unbiased between melanomas arising in non-CSD and acral skin. BRAF, RAS, and NF1 mutations were significantly associated with lymph node metastasis or presence of ulceration, implying that these cancer driver genes were independent prognostic factors. In summary, our results suggest that mutational profiles of malignant melanomas in China are significantly different from Western countries, and both gene mutation and amplification play an important role in the development and progression of melanomas.

Download Full-text

Leveraging multi-way interactions for systematic prediction of pre-clinical drug combination effects

Nature Communications ◽

10.1038/s41467-020-19950-z ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Heli Julkunen ◽

Anna Cichonska ◽

Prson Gautam ◽

Sandor Szedmak ◽

Jane Douat ◽

...

Keyword(s):

Effective Means ◽

Cancer Cell Line ◽

Predictive Performance ◽

Drug Combinations ◽

Precision Oncology ◽

Learning Framework ◽

Anaplastic Lymphoma ◽

Higher Order Tensors ◽

Using Data ◽

Context Specific

AbstractWe present comboFM, a machine learning framework for predicting the responses of drug combinations in pre-clinical studies, such as those based on cell lines or patient-derived cells. comboFM models the cell context-specific drug interactions through higher-order tensors, and efficiently learns latent factors of the tensor using powerful factorization machines. The approach enables comboFM to leverage information from previous experiments performed on similar drugs and cells when predicting responses of new combinations in so far untested cells; thereby, it achieves highly accurate predictions despite sparsely populated data tensors. We demonstrate high predictive performance of comboFM in various prediction scenarios using data from cancer cell line pharmacogenomic screens. Subsequent experimental validation of a set of previously untested drug combinations further supports the practical and robust applicability of comboFM. For instance, we confirm a novel synergy between anaplastic lymphoma kinase (ALK) inhibitor crizotinib and proteasome inhibitor bortezomib in lymphoma cells. Overall, our results demonstrate that comboFM provides an effective means for systematic pre-screening of drug combinations to support precision oncology applications.

Download Full-text

Entropy Ensemble Filter: Does information content assessment of bootstrapped training datasets before model training lead to better trade-off between ensemble size and predictive performance?

10.5194/egusphere-egu2020-1963 ◽

2020 ◽

Author(s):

Hossein Foroozand ◽

Steven V. Weijs

Keyword(s):

Machine Learning ◽

Computational Cost ◽

Predictive Performance ◽

Original Data ◽

Training Data ◽

Computational Time ◽

Limiting Factor ◽

Ensemble Size ◽

Content Assessment ◽

Model Training

<p>Machine learning is the fast-growing branch of data-driven models, and its main objective is to use computational methods to become more accurate in predicting outcomes without being explicitly programmed. In this field, a way to improve model predictions is to use a large collection of models (called ensemble) instead of a single one. Each model is then trained on slightly different samples of the original data, and their predictions are averaged. This is called bootstrap aggregating, or Bagging, and is widely applied. A recurring question in previous works was: how to choose the ensemble size of training data sets for tuning the weights in machine learning? The computational cost of ensemble-based methods scales with the size of the ensemble, but excessively reducing the ensemble size comes at the cost of reduced predictive performance. The choice of ensemble size was often determined based on the size of input data and available computational power, which can become a limiting factor for larger datasets and complex models&#8217; training. In this research, it is our hypothesis that if an ensemble of artificial neural networks (ANN) models or any other machine learning technique uses the most informative ensemble members for training purpose rather than all bootstrapped ensemble members, it could reduce the computational time substantially without negatively affecting the performance of simulation.</p>

Download Full-text

Rainfall trends in hindsight and in foresight

10.5194/egusphere-egu2020-8753 ◽

2020 ◽

Author(s):

Theano Iliopoulou ◽

Demetris Koutsoyiannis

Keyword(s):

Daily Rainfall ◽

Predictive Performance ◽

Predictive Modelling ◽

Annual Rainfall ◽

Methodological Framework ◽

Rainfall Trends ◽

Out Of Sample ◽

Linear Trends

<p>Trends are customarily identified in rainfall data in the framework of explanatory modelling. Little insight however has been gained by this type of analysis with respect to their performance in foresight. In this work, we examine the out-of-sample predictive performance of linear trends through extensive investigation of 60 of the longest daily rainfall records available worldwide. We devise a systematic methodological framework in which linear trends are compared to simpler mean models, based on their performance in predicting climatic-scale (30-year) annual rainfall indices, i.e. maxima, totals, wet-day average and probability dry, from long-term daily records. Parallel experiments from synthetic timeseries are performed in order to provide theoretical insights to the results and the role of parsimony in predictive modelling is discussed. In line with the empirical findings, it is shown that, prediction-wise, simple is preferable to trendy.</p>

Download Full-text

A weighted random forests approach to improve predictive performance

Statistical Analysis and Data Mining The ASA Data Science Journal ◽

10.1002/sam.11196 ◽

2013 ◽

Vol 6 (6) ◽

pp. 496-505 ◽

Cited By ~ 36

Author(s):

Stacey J. Winham ◽

Robert R. Freimuth ◽

Joanna M. Biernacka

Keyword(s):

Random Forests ◽

Predictive Performance

Download Full-text

Latent Feature Representations for Human Gene Expression Data Improve Phenotypic Predictions

10.1101/2020.10.15.340802 ◽

2020 ◽

Author(s):

Yannis Pantazis ◽

Christos Tselas ◽

Kleanthi Lakiotaki ◽

Vincenzo Lagani ◽

Ioannis Tsamardinos

Keyword(s):

Principal Component ◽

Predictive Performance ◽

Original Data ◽

Relevant Information ◽

Computational Time ◽

Additive Interaction ◽

Human Transcriptome ◽

Feature Spaces ◽

Reconstruction Performance ◽

Low Dimensional

AbstractHigh-throughput technologies such as microarrays and RNA-sequencing (RNA-seq) allow to precisely quantify transcriptomic profiles, generating datasets that are inevitably high-dimensional. In this work, we investigate whether the whole human transcriptome can be represented in a compressed, low dimensional latent space without loosing relevant information. We thus constructed low-dimensional latent feature spaces of the human genome, by utilizing three dimensionality reduction approaches and a diverse set of curated datasets. We applied standard Principal Component Analysis (PCA), kernel PCA and Autoencoder Neural Networks on 1360 datasets from four different measurement technologies. The latent feature spaces are tested for their ability to (a) reconstruct the original data and (b) improve predictive performance on validation datasets not used during the creation of the feature space. While linear techniques show better reconstruction performance, nonlinear approaches, particularly, neural-based models seem to be able to capture non-additive interaction effects, and thus enjoy stronger predictive capabilities. Our results show that low dimensional representations of the human transcriptome can be achieved by integrating hundreds of datasets, despite the limited sample size of each dataset and the biological / technological heterogeneity across studies. The created space is two to three orders of magnitude smaller compared to the raw data, offering the ability of capturing a large portion of the original data variability and eventually reducing computational time for downstream analyses.

Download Full-text

Whole exome precision oncology targeting synthetic lethal vulnerabilities across the tumor transcriptome

10.1101/2020.02.16.951699 ◽

2020 ◽

Cited By ~ 1

Author(s):

Joo Sang Lee ◽

Nishanth Ulhas Nair ◽

Lesley Chapman ◽

Sanju Sinha ◽

Kun Wang ◽

...

Keyword(s):

Predictive Performance ◽

Patient Treatment ◽

Precision Oncology ◽

Driver Genes ◽

Patient Response ◽

Synthetic Lethal ◽

Cancer Driver ◽

Whole Exome ◽

Transcriptomics Data ◽

Cancer Types

AbstractPrecision oncology has made significant advances in the last few years, mainly by targeting actionable mutations in cancer driver genes. However, the proportion of patients whose tumors can be targeted therapeutically remains limited. Recent studies have begun to explore the benefit of analyzing tumor transcriptomics data to guide patient treatment, raising the need for new approaches for systematically accomplishing that. Here we show that computationally derived genetic interactions can successfully predict patient response. Assembling a broad repertoire of 32 datasets spanning more than 1,500 patients and including both tumor transcriptomics and response data, we predicted the response in 17 out of 21 targeted and 8 out of 11 checkpoint therapy datasets across 8 different cancer types with considerable accuracy, without ever training on these datasets. Analyzing the recently published multi-arm WINTHER trial, we show that the fraction of patients benefitting from transcriptomic-based treatments could potentially be markedly increased from 15% to about 85% by targeting synthetic lethal vulnerabilities in their tumors. In summary, this is the first computational approach to obtain considerable predictive performance across many different targeted and immunotherapy datasets, providing a promising new way for guiding cancer treatment based on the tumor transcriptomics of cancer patients.

Download Full-text

Temporally-Informed Random Forests for Suicide Risk Prediction

10.1101/2021.06.01.21258179 ◽

2021 ◽

Author(s):

Ilkin Bayramli ◽

Victor Castro ◽

Yuval Barak-Corrren ◽

Emily Masden ◽

Matthew Nock ◽

...

Keyword(s):

Random Forest ◽

Risk Prediction ◽

Random Forests ◽

Suicide Risk ◽

Prediction Models ◽

Predictive Performance ◽

Temporal Information ◽

Risk Detection ◽

Temporal Variables ◽

Algorithmic Approaches

Background. Suicide is one of the leading causes of death worldwide, yet clinicians find it difficult to reliably identify individuals at high risk for suicide. Algorithmic approaches for suicide risk detection have been developed in recent years, mostly based on data from electronics health records (EHRs). These models typically do not optimally exploit the valuable temporal information inherent in these longitudinal data. Methods. We propose a temporally enhanced variant of the Random Forest model - Omni-Temporal Balanced Random Forests (OTBRFs) - that incorporates temporal information in every tree within the forest. We develop and validate this model using longitudinal EHRs and clinician notes from the Mass General Brigham Health System recorded between 1998 and 2018, and compare its performance to a baseline Naive Bayes Classifier and two standard versions of Balanced Random Forests. Results. Temporal variables were found to be associated with suicide risk. RF models were more accurate than Naive Bayesian classifiers at predicting suicide risk in advance (AUC=0.824 vs. 0.754 respectively). The OT-BRF model performed best among all RF approaches (0.339 sensitivity at 95% specificity), compared to 0.290 and 0.286 for the other two RF models. Temporal variables were assigned high importance by the models that incorporated them. Discussion. We demonstrate that temporal variables have an important role to play in suicide risk detection, and that requiring their inclusion in all random forest trees leads to increased predictive performance. Integrating temporal information into risk prediction models helps the models interpret patient data in temporal context, improving predictive performance.

Download Full-text

Transposable elements mediate genetic effects altering the expression of nearby genes in colorectal cancer

10.21203/rs.3.rs-1138340/v1 ◽

2021 ◽

Author(s):

Nikolaos Lykoskoufis ◽

Evarist Planet ◽

Halit Ongen ◽

Didier Trono ◽

Emmanouil T Dermitzakis

Keyword(s):

Colorectal Cancer ◽

Transposable Elements ◽

Genetic Effect ◽

Regulatory Sequences ◽

Driver Genes ◽

Cancer Driver ◽

Tumor Tissues ◽

Interspersed Repeats ◽

Methylation Patterns ◽

The Impact

Abstract Transposable elements (TEs) are interspersed repeats that contribute to more than half of the human genome, and TE-embedded regulatory sequences are increasingly recognized as major components of the human regulome. Perturbations of this system can contribute to tumorigenesis, but the impact of TEs on gene expression in cancer cells remains to be fully assessed. Here, we analyzed 275 normal colon and 276 colorectal cancer (CRC) samples from the SYSCOL colorectal cancer cohort and discovered 10,111 and 5,152 TE expression quantitative trait loci (eQTLs) in normal and tumor tissues, respectively. Amongst the latter, 376 were exclusive to CRC, likely driven by changes in methylation patterns. We identified that transcription factors are more enriched in tumor-specific TE-eQTLs than shared TE-eQTLs, indicating that TEs are more specifically regulated in tumor than normal. Using Bayesian Networks to assess the causal relationship between eQTL variants, TEs and genes, we identified that 1,758 TEs are mediators of genetic effect, altering the expression of 1,626 nearby genes significantly more in tumor compared to normal, of which 51 are cancer driver genes. We show that tumor-specific TE-eQTLs trigger the driver capability of TEs subsequently impacting expression of nearby genes. Collectively, our results highlight a global profile of a new class of cancer drivers, thereby enhancing our understanding of tumorigenesis and providing potential new candidate mechanisms for therapeutic target development.

Download Full-text