Sparse network-based regularization for the analysis of patientomics high-dimensional survival data

AbstractData availability by modern sequencing technologies represents a major challenge in oncological survival analysis, as the increasing amount of molecular data hampers the generation of models that are both accurate and interpretable. To tackle this problem, this work evaluates the introduction of graph centrality measures in classical sparse survival models such as the elastic net.We explore the use of network information as part of the regularization applied to the inverse problem, obtained both by external knowledge on the features evaluated and the data themselves. A sparse solution is obtained either promoting features that are isolated from the network or, alternatively, hubs, i.e., features that are highly connected within the network.We show that introducing the degree information of the features when inferring survival models consistently improves the model predictive performance in breast invasive carcinoma (BRCA) transcriptomic TCGA data while enhancing model interpretability. Preliminary clinical validation is performed using the Cancer Hallmarks Analytics Tool API and the String database.These case studies are included in the recently released glmSparseNet R package1, a flexible tool to explore the potential of sparse network-based regularizers in generalized linear models for the analysis of omics data.

Download Full-text

eNetXplorer: an R package for the quantitative exploration of elastic net families for generalized linear models

10.1101/305870 ◽

2018 ◽

Cited By ~ 1

Author(s):

Julián Candia ◽

John S. Tsang

Keyword(s):

Generalized Linear Models ◽

Linear Models ◽

Statistical Significance ◽

Model Fitting ◽

Predictive Performance ◽

R Package ◽

Research Area ◽

Elastic Net ◽

Feature Identification ◽

Practical Applications

AbstractBackgroundRegularized generalized linear models (GLMs) are popular regression methods in bioinformatics, particularly useful in scenarios with fewer observations than parameters/features or when many of the features are correlated. In both ridge and lasso regularization, feature shrinkage is controlled by a penalty parameter λ. The elastic net introduces a mixing parameter α to tune the shrinkage continuously from ridge to lasso. Selecting α objectively and determining which features contributed significantly to prediction after model fitting remain a practical challenge given the paucity of available software to evaluate performance and statistical significance.ResultseNetXplorer builds on top of glmnet to address the above issues for linear (Gaussian), binomial (logistic), and multinomial GLMs. It provides new functionalities to empower practical applications by using a cross validation framework that assesses the predictive performance and statistical significance of a family of elastic net models (as α is varied) and of the corresponding features that contribute to prediction. The user can select which quality metrics to use to quantify the concordance between predicted and observed values, with defaults provided for each GLM. Statistical significance for each model (as defined by α) is determined based on comparison to a set of null models generated by random permutations of the response; the same permutation-based approach is used to evaluate the significance of individual features. In the analysis of large and complex biological datasets, such as transcriptomic and proteomic data, eNetXplorer provides summary statistics, output tables, and visualizations to help assess which subset(s) of features have predictive value for a set of response measurements, and to what extent those subset(s) of features can be expanded or reduced via regularization.ConclusionsThis package presents a framework and software for exploratory data analysis and visualization. By making regularized GLMs more accessible and interpretable, eNetXplorer guides the process to generate hypotheses based on features significantly associated with biological phenotypes of interest, e.g. to identify biomarkers for therapeutic responsiveness. eNetXplorer is also generally applicable to any research area that may benefit from predictive modeling and feature identification using regularized GLMs.Availability and implementationThe package is available under GPL-3 license at the CRAN repository, https://CRAN.R-project.org/package=eNetXplorer

Download Full-text

Spatial survival modelling of business re-opening after Katrina: Survival modelling compared to spatial probit modelling of re-opening within 3, 6 or 12 months

Statistical Modelling ◽

10.1177/1471082x20967158 ◽

2020 ◽

pp. 1471082X2096715

Author(s):

Roger S. Bivand ◽

Virgilio Gómez-Rubio

Keyword(s):

Hurricane Katrina ◽

Survival Data ◽

R Package ◽

American Statistical Association ◽

Survival Models ◽

Royal Statistical Society ◽

Business Decisions ◽

Modeling And Analysis ◽

Probit Models ◽

Spatial Probit

Zhou and Hanson; Zhou and Hanson; Zhou and Hanson ( 2015 , Nonparametric Bayesian Inference in Biostatistics, pages 215–46. Cham: Springer; 2018, Journal of the American Statistical Association, 113, 571–81; 2020, spBayesSurv: Bayesian Modeling and Analysis of Spatially Correlated Survival Data. R package version 1.1.4) and Zhou et al. (2020, Journal of Statistical Software, Articles, 92, 1–33) present methods for estimating spatial survival models using areal data. This article applies their methods to a dataset recording New Orleans business decisions to re-open after Hurricane Katrina; the data were included in LeSage et al. (2011b , Journal of the Royal Statistical Society: Series A (Statistics in Society), 174, 1007—27). In two articles ( LeSage etal., 2011a , Significance, 8, 160—63; 2011b, Journal of the Royal Statistical Society: Series A (Statistics in Society), 174, 1007—27), spatial probit models are used to model spatial dependence in this dataset, with decisions to re-open aggregated to the first 90, 180 and 360 days. We re-cast the problem as one of examining the time-to-event records in the data, right-censored as observations ceased before 175 businesses had re-opened; we omit businesses already re-opened when observations began on Day 41. We are interested in checking whether the conclusions about the covariates using aspatial and spatial probit models are modified when applying survival and spatial survival models estimated using MCMC and INLA. In general, we find that the same covariates are associated with re-opening decisions in both modelling approaches. We do however find that data collected from three streets differ substantially, and that the streets are probably better handled separately or that the street effect should be included explicitly.

Download Full-text

Generalized Linear Models for Flexible Parametric Modeling of the Hazard Function

Medical Decision Making ◽

10.1177/0272989x19873661 ◽

2019 ◽

Vol 39 (7) ◽

pp. 867-878 ◽

Cited By ~ 2

Author(s):

Benjamin Kearns ◽

Matt D. Stevenson ◽

Kostas Triantafyllopoulos ◽

Andrea Manca

Keyword(s):

Generalized Linear Models ◽

Survival Data ◽

Hazard Function ◽

Linear Models ◽

Parametric Modeling ◽

Additive Models ◽

Survival Models ◽

Parametric Survival ◽

Flexible Models

Background. Parametric modeling of survival data is important, and reimbursement decisions may depend on the selected distribution. Accurate predictions require sufficiently flexible models to describe adequately the temporal evolution of the hazard function. A rich class of models is available among the framework of generalized linear models (GLMs) and its extensions, but these models are rarely applied to survival data. This article describes the theoretical properties of these more flexible models and compares their performance to standard survival models in a reproducible case study. Methods. We describe how survival data may be analyzed with GLMs and their extensions: fractional polynomials, spline models, generalized additive models, generalized linear mixed (frailty) models, and dynamic survival models. For each, we provide a comparison of the strengths and limitations of these approaches. For the case study, we compare within-sample fit, the plausibility of extrapolations, and extrapolation performance based on data splitting. Results. Viewing standard survival models as GLMs shows that many impose a restrictive assumption of linearity. For the case study, GLMs provided better within-sample fit and more plausible extrapolations. However, they did not improve extrapolation performance. We also provide guidance to aid in choosing between the different approaches based on GLMs and their extensions. Conclusions. The use of GLMs for parametric survival analysis can outperform standard parametric survival models, although the improvements were modest in our case study. This approach is currently seldom used. We provide guidance on both implementing these models and choosing between them. The reproducible case study will help to increase uptake of these models.

Download Full-text

Efficient gradient boosting for prognostic biomarker discovery

10.1101/2021.07.06.451263 ◽

2021 ◽

Author(s):

Kaiqiao Li ◽

Sijie Yao ◽

Zhenyu Zhang ◽

Biwei Cao ◽

Christopher Wilson ◽

...

Keyword(s):

Survival Data ◽

Biomarker Discovery ◽

Cox Regression ◽

R Package ◽

Molecular Data ◽

Gradient Boosting ◽

Light Gradient ◽

Ensemble Machine Learning ◽

Extreme Gradient Boosting ◽

Cancer Studies

Motivation: Gradient boosting decision tree (GBDT) is a powerful ensemble machine learning method that has the potential to accelerate biomarker discovery from high-dimensional molecular data. Recent algorithmic advances, such as Extreme Gradient Boosting (XGB) and Light Gradient Boosting (LGB), have rendered the GBDT training more efficient, scalable and accurate. These modern techniques, however, have not yet been widely adopted in biomarkers discovery based on patient survival data, which are key clinical outcomes or endpoints in cancer studies. Results: In this paper, we present a new R package Xsurv as an integrated solution which ap-plies two modern GBDT training framework namely, XGB and LGB, for the modeling of censored survival outcomes. Based on a comprehensive set of simulations, we benchmark the new approaches against traditional methods including the stepwise Cox regression model and the original gradient boosting function implemented in the package gbm. We also demonstrate the application of Xsurv in analyzing a melanoma methylation dataset. Together, these results suggest that Xsurv is a useful and computationally viable tool for screening a large number of prognostic candidate biomarkers, which may facilitate cancer translational and clinical research. Availability: Xsurv is freely available as an R package at: https://github.com/topycyao/Xsurv

Download Full-text

Linking morphological and molecular sources to disentangle the case of Xylodon australis

Scientific Reports ◽

10.1038/s41598-020-78399-8 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Javier Fernández-López ◽

M. Teresa Telleria ◽

Margarita Dueñas ◽

Mara Laguna-Castro ◽

Klaus Schliep ◽

...

Keyword(s):

New Zealand ◽

Phylogenetic Analyses ◽

Single Species ◽

R Package ◽

Molecular Data ◽

Historical Collections ◽

Close Relationship ◽

Kinship Relations ◽

History Of ◽

Different Sources

AbstractThe use of different sources of evidence has been recommended in order to conduct species delimitation analyses to solve taxonomic issues. In this study, we use a maximum likelihood framework to combine morphological and molecular traits to study the case of Xylodon australis (Hymenochaetales, Basidiomycota) using the locate.yeti function from the phytools R package. Xylodon australis has been considered a single species distributed across Australia, New Zealand and Patagonia. Multi-locus phylogenetic analyses were conducted to unmask the actual diversity under X. australis as well as the kinship relations respect their relatives. To assess the taxonomic position of each clade, locate.yeti function was used to locate in a molecular phylogeny the X. australis type material for which no molecular data was available using morphological continuous traits. Two different species were distinguished under the X. australis name, one from Australia–New Zealand and other from Patagonia. In addition, a close relationship with Xylodon lenis, a species from the South East of Asia, was confirmed for the Patagonian clade. We discuss the implications of our results for the biogeographical history of this genus and we evaluate the potential of this method to be used with historical collections for which molecular data is not available.

Download Full-text

Using generalized linear models to implement g‐estimation for survival data with time‐varying confounding

Statistics in Medicine ◽

10.1002/sim.8997 ◽

2021 ◽

Author(s):

Shaun R. Seaman ◽

Ruth H. Keogh ◽

Oliver Dukes ◽

Stijn Vansteelandt

Keyword(s):

Generalized Linear Models ◽

Survival Data ◽

Linear Models ◽

Time Varying

Download Full-text

glrt - New R Package for Analyzing Interval-Censored Survival Data

Interval-Censored Time-to-Event Data - Chapman & Hall/CRC Biostatistics Series ◽

10.1201/b12290-18 ◽

2012 ◽

Cited By ~ 1

Author(s):

Qiang Zhao

Keyword(s):

Survival Data ◽

R Package ◽

Censored Survival Data ◽

Interval Censored

Download Full-text

tscount: An R Package for Analysis of Count Time Series Following Generalized Linear Models

Journal of Statistical Software ◽

10.18637/jss.v082.i05 ◽

2017 ◽

Vol 82 (5) ◽

Cited By ~ 28

Author(s):

Tobias Liboschik ◽

Konstantinos Fokianos ◽

Roland Fried

Keyword(s):

Time Series ◽

Generalized Linear Models ◽

Linear Models ◽

R Package ◽

Count Time Series

Download Full-text

Predicting dengue importation into Europe, using machine learning and model-agnostic methods

10.1101/19013383 ◽

2019 ◽

Author(s):

Donald Salami ◽

Carla Alexandra Sousa ◽

Maria do Rosário Oliveira Martins ◽

César Capinha

Keyword(s):

Machine Learning ◽

Operating Characteristic ◽

Predictive Accuracy ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Transport Network ◽

Air Transport ◽

Health Concern ◽

Centrality Measures ◽

Network Centrality

ABSTRACTThe geographical spread of dengue is a global public health concern. This is largely mediated by the importation of dengue from endemic to non-endemic areas via the increasing connectivity of the global air transport network. The dynamic nature and intrinsic heterogeneity of the air transport network make it challenging to predict dengue importation.Here, we explore the capabilities of state-of-the-art machine learning algorithms to predict dengue importation. We trained four machine learning classifiers algorithms, using a 6-year historical dengue importation data for 21 countries in Europe and connectivity indices mediating importation and air transport network centrality measures. Predictive performance for the classifiers was evaluated using the area under the receiving operating characteristic curve, sensitivity, and specificity measures. Finally, we applied practical model-agnostic methods, to provide an in-depth explanation of our optimal model’s predictions on a global and local scale.Our best performing model achieved high predictive accuracy, with an area under the receiver operating characteristic score of 0.94 and a maximized sensitivity score of 0.88. The predictor variables identified as most important were the source country’s dengue incidence rate, population size, and volume of air passengers. Network centrality measures, describing the positioning of European countries within the air travel network, were also influential to the predictions.We demonstrated the high predictive performance of a machine learning model in predicting dengue importation and the utility of the model-agnostic methods to offer a comprehensive understanding of the reasons behind the predictions. Similar approaches can be utilized in the development of an operational early warning surveillance system for dengue importation.

Download Full-text

A modular approach to integrating multiple data sources into real-time clinical prediction for pediatric diarrhea

10.1101/2020.10.26.20210385 ◽

2020 ◽

Author(s):

Ben J. Brintz ◽

Benjamin Haaland ◽

Joel Howard ◽

Dennis L. Chao ◽

Joshua L. Proctor ◽

...

Keyword(s):

Real Time ◽

Prediction Models ◽

Predictive Performance ◽

Data Availability ◽

Clinical Prediction ◽

Multiple Sources ◽

Statistical Framework ◽

Multiple Data ◽

Post Test ◽

The Individual

AbstractTraditional clinical prediction models focus on parameters of the individual patient. For infectious diseases, sources external to the patient, including characteristics of prior patients and seasonal factors, may improve predictive performance. We describe the development of a predictive model that integrates multiple sources of data in a principled statistical framework using a post-test odds formulation. Our method enables electronic real-time updating and flexibility, such that components can be included or excluded according to data availability. We apply this method to the prediction of etiology of pediatric diarrhea, where “pre-test” epidemiologic data may be highly informative. Diarrhea has a high burden in low-resource settings, and antibiotics are often over-prescribed. We demonstrate that our integrative method outperforms traditional prediction in accurately identifying cases with a viral etiology, and show that its clinical application, especially when used with an additional diagnostic test, could result in a 61% reduction in inappropriately prescribed antibiotics.

Download Full-text