scholarly journals Sparse network-based regularization for the analysis of patientomics high-dimensional survival data

2018 ◽  
Author(s):  
André Veríssimo ◽  
Eunice Carrasquinha ◽  
Marta B. Lopes ◽  
Arlindo L. Oliveira ◽  
Marie-France Sagot ◽  
...  

AbstractData availability by modern sequencing technologies represents a major challenge in oncological survival analysis, as the increasing amount of molecular data hampers the generation of models that are both accurate and interpretable. To tackle this problem, this work evaluates the introduction of graph centrality measures in classical sparse survival models such as the elastic net.We explore the use of network information as part of the regularization applied to the inverse problem, obtained both by external knowledge on the features evaluated and the data themselves. A sparse solution is obtained either promoting features that are isolated from the network or, alternatively, hubs, i.e., features that are highly connected within the network.We show that introducing the degree information of the features when inferring survival models consistently improves the model predictive performance in breast invasive carcinoma (BRCA) transcriptomic TCGA data while enhancing model interpretability. Preliminary clinical validation is performed using the Cancer Hallmarks Analytics Tool API and the String database.These case studies are included in the recently released glmSparseNet R package1, a flexible tool to explore the potential of sparse network-based regularizers in generalized linear models for the analysis of omics data.

2018 ◽  
Author(s):  
Julián Candia ◽  
John S. Tsang

AbstractBackgroundRegularized generalized linear models (GLMs) are popular regression methods in bioinformatics, particularly useful in scenarios with fewer observations than parameters/features or when many of the features are correlated. In both ridge and lasso regularization, feature shrinkage is controlled by a penalty parameter λ. The elastic net introduces a mixing parameter α to tune the shrinkage continuously from ridge to lasso. Selecting α objectively and determining which features contributed significantly to prediction after model fitting remain a practical challenge given the paucity of available software to evaluate performance and statistical significance.ResultseNetXplorer builds on top of glmnet to address the above issues for linear (Gaussian), binomial (logistic), and multinomial GLMs. It provides new functionalities to empower practical applications by using a cross validation framework that assesses the predictive performance and statistical significance of a family of elastic net models (as α is varied) and of the corresponding features that contribute to prediction. The user can select which quality metrics to use to quantify the concordance between predicted and observed values, with defaults provided for each GLM. Statistical significance for each model (as defined by α) is determined based on comparison to a set of null models generated by random permutations of the response; the same permutation-based approach is used to evaluate the significance of individual features. In the analysis of large and complex biological datasets, such as transcriptomic and proteomic data, eNetXplorer provides summary statistics, output tables, and visualizations to help assess which subset(s) of features have predictive value for a set of response measurements, and to what extent those subset(s) of features can be expanded or reduced via regularization.ConclusionsThis package presents a framework and software for exploratory data analysis and visualization. By making regularized GLMs more accessible and interpretable, eNetXplorer guides the process to generate hypotheses based on features significantly associated with biological phenotypes of interest, e.g. to identify biomarkers for therapeutic responsiveness. eNetXplorer is also generally applicable to any research area that may benefit from predictive modeling and feature identification using regularized GLMs.Availability and implementationThe package is available under GPL-3 license at the CRAN repository, https://CRAN.R-project.org/package=eNetXplorer


2020 ◽  
pp. 1471082X2096715
Author(s):  
Roger S. Bivand ◽  
Virgilio Gómez-Rubio

Zhou and Hanson; Zhou and Hanson; Zhou and Hanson ( 2015 , Nonparametric Bayesian Inference in Biostatistics, pages 215–46. Cham: Springer; 2018, Journal of the American Statistical Association, 113, 571–81; 2020, spBayesSurv: Bayesian Modeling and Analysis of Spatially Correlated Survival Data. R package version 1.1.4) and Zhou et al. (2020, Journal of Statistical Software, Articles, 92, 1–33) present methods for estimating spatial survival models using areal data. This article applies their methods to a dataset recording New Orleans business decisions to re-open after Hurricane Katrina; the data were included in LeSage et al. (2011b , Journal of the Royal Statistical Society: Series A (Statistics in Society), 174, 1007—27). In two articles ( LeSage etal., 2011a , Significance, 8, 160—63; 2011b, Journal of the Royal Statistical Society: Series A (Statistics in Society), 174, 1007—27), spatial probit models are used to model spatial dependence in this dataset, with decisions to re-open aggregated to the first 90, 180 and 360 days. We re-cast the problem as one of examining the time-to-event records in the data, right-censored as observations ceased before 175 businesses had re-opened; we omit businesses already re-opened when observations began on Day 41. We are interested in checking whether the conclusions about the covariates using aspatial and spatial probit models are modified when applying survival and spatial survival models estimated using MCMC and INLA. In general, we find that the same covariates are associated with re-opening decisions in both modelling approaches. We do however find that data collected from three streets differ substantially, and that the streets are probably better handled separately or that the street effect should be included explicitly.


2019 ◽  
Vol 39 (7) ◽  
pp. 867-878 ◽  
Author(s):  
Benjamin Kearns ◽  
Matt D. Stevenson ◽  
Kostas Triantafyllopoulos ◽  
Andrea Manca

Background. Parametric modeling of survival data is important, and reimbursement decisions may depend on the selected distribution. Accurate predictions require sufficiently flexible models to describe adequately the temporal evolution of the hazard function. A rich class of models is available among the framework of generalized linear models (GLMs) and its extensions, but these models are rarely applied to survival data. This article describes the theoretical properties of these more flexible models and compares their performance to standard survival models in a reproducible case study. Methods. We describe how survival data may be analyzed with GLMs and their extensions: fractional polynomials, spline models, generalized additive models, generalized linear mixed (frailty) models, and dynamic survival models. For each, we provide a comparison of the strengths and limitations of these approaches. For the case study, we compare within-sample fit, the plausibility of extrapolations, and extrapolation performance based on data splitting. Results. Viewing standard survival models as GLMs shows that many impose a restrictive assumption of linearity. For the case study, GLMs provided better within-sample fit and more plausible extrapolations. However, they did not improve extrapolation performance. We also provide guidance to aid in choosing between the different approaches based on GLMs and their extensions. Conclusions. The use of GLMs for parametric survival analysis can outperform standard parametric survival models, although the improvements were modest in our case study. This approach is currently seldom used. We provide guidance on both implementing these models and choosing between them. The reproducible case study will help to increase uptake of these models.


2021 ◽  
Author(s):  
Kaiqiao Li ◽  
Sijie Yao ◽  
Zhenyu Zhang ◽  
Biwei Cao ◽  
Christopher Wilson ◽  
...  

Motivation: Gradient boosting decision tree (GBDT) is a powerful ensemble machine learning method that has the potential to accelerate biomarker discovery from high-dimensional molecular data. Recent algorithmic advances, such as Extreme Gradient Boosting (XGB) and Light Gradient Boosting (LGB), have rendered the GBDT training more efficient, scalable and accurate. These modern techniques, however, have not yet been widely adopted in biomarkers discovery based on patient survival data, which are key clinical outcomes or endpoints in cancer studies. Results: In this paper, we present a new R package Xsurv as an integrated solution which ap-plies two modern GBDT training framework namely, XGB and LGB, for the modeling of censored survival outcomes. Based on a comprehensive set of simulations, we benchmark the new approaches against traditional methods including the stepwise Cox regression model and the original gradient boosting function implemented in the package gbm. We also demonstrate the application of Xsurv in analyzing a melanoma methylation dataset. Together, these results suggest that Xsurv is a useful and computationally viable tool for screening a large number of prognostic candidate biomarkers, which may facilitate cancer translational and clinical research. Availability: Xsurv is freely available as an R package at: https://github.com/topycyao/Xsurv


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Javier Fernández-López ◽  
M. Teresa Telleria ◽  
Margarita Dueñas ◽  
Mara Laguna-Castro ◽  
Klaus Schliep ◽  
...  

AbstractThe use of different sources of evidence has been recommended in order to conduct species delimitation analyses to solve taxonomic issues. In this study, we use a maximum likelihood framework to combine morphological and molecular traits to study the case of Xylodon australis (Hymenochaetales, Basidiomycota) using the locate.yeti function from the phytools R package. Xylodon australis has been considered a single species distributed across Australia, New Zealand and Patagonia. Multi-locus phylogenetic analyses were conducted to unmask the actual diversity under X. australis as well as the kinship relations respect their relatives. To assess the taxonomic position of each clade, locate.yeti function was used to locate in a molecular phylogeny the X. australis type material for which no molecular data was available using morphological continuous traits. Two different species were distinguished under the X. australis name, one from Australia–New Zealand and other from Patagonia. In addition, a close relationship with Xylodon lenis, a species from the South East of Asia, was confirmed for the Patagonian clade. We discuss the implications of our results for the biogeographical history of this genus and we evaluate the potential of this method to be used with historical collections for which molecular data is not available.


2019 ◽  
Author(s):  
Donald Salami ◽  
Carla Alexandra Sousa ◽  
Maria do Rosário Oliveira Martins ◽  
César Capinha

ABSTRACTThe geographical spread of dengue is a global public health concern. This is largely mediated by the importation of dengue from endemic to non-endemic areas via the increasing connectivity of the global air transport network. The dynamic nature and intrinsic heterogeneity of the air transport network make it challenging to predict dengue importation.Here, we explore the capabilities of state-of-the-art machine learning algorithms to predict dengue importation. We trained four machine learning classifiers algorithms, using a 6-year historical dengue importation data for 21 countries in Europe and connectivity indices mediating importation and air transport network centrality measures. Predictive performance for the classifiers was evaluated using the area under the receiving operating characteristic curve, sensitivity, and specificity measures. Finally, we applied practical model-agnostic methods, to provide an in-depth explanation of our optimal model’s predictions on a global and local scale.Our best performing model achieved high predictive accuracy, with an area under the receiver operating characteristic score of 0.94 and a maximized sensitivity score of 0.88. The predictor variables identified as most important were the source country’s dengue incidence rate, population size, and volume of air passengers. Network centrality measures, describing the positioning of European countries within the air travel network, were also influential to the predictions.We demonstrated the high predictive performance of a machine learning model in predicting dengue importation and the utility of the model-agnostic methods to offer a comprehensive understanding of the reasons behind the predictions. Similar approaches can be utilized in the development of an operational early warning surveillance system for dengue importation.


2020 ◽  
Author(s):  
Ben J. Brintz ◽  
Benjamin Haaland ◽  
Joel Howard ◽  
Dennis L. Chao ◽  
Joshua L. Proctor ◽  
...  

AbstractTraditional clinical prediction models focus on parameters of the individual patient. For infectious diseases, sources external to the patient, including characteristics of prior patients and seasonal factors, may improve predictive performance. We describe the development of a predictive model that integrates multiple sources of data in a principled statistical framework using a post-test odds formulation. Our method enables electronic real-time updating and flexibility, such that components can be included or excluded according to data availability. We apply this method to the prediction of etiology of pediatric diarrhea, where “pre-test” epidemiologic data may be highly informative. Diarrhea has a high burden in low-resource settings, and antibiotics are often over-prescribed. We demonstrate that our integrative method outperforms traditional prediction in accurately identifying cases with a viral etiology, and show that its clinical application, especially when used with an additional diagnostic test, could result in a 61% reduction in inappropriately prescribed antibiotics.


Sign in / Sign up

Export Citation Format

Share Document