scholarly journals Survival Analysis on Rare Events Using Group-Regularized Multi-Response Cox Regression

Author(s):  
Ruilin Li ◽  
Yosuke Tanigawa ◽  
Johanne M Justesen ◽  
Jonathan Taylor ◽  
Trevor Hastie ◽  
...  

Abstract Motivation The prediction performance of Cox proportional hazard model suffers when there are only few uncensored events in the training data. Results We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events. Our approach is applicable when there is one or more other survival responses that 1. has a large number of observed events; 2. share a common set of associated predictors with the rare event response. This scenario is common in the UK Biobank (Sudlow et al., 2015) dataset where records for a large number of common and less prevalent diseases of the same set of individuals are available. By analyzing these responses together, we hope to achieve higher prediction performance than when they are analyzed individually. To make this approach practical for large-scale data, we developed an accelerated proximal gradient optimization algorithm as well as a screening procedure inspired by Qian et al. (2020). Availability https://github.com/rivas-lab/multisnpnet-Cox Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Author(s):  
Ruilin Li ◽  
Yosuke Tanigawa ◽  
Johanne M. Justesen ◽  
Jonathan Taylor ◽  
Trevor Hastie ◽  
...  

AbstractWe propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events. Our approach is applicable when there is one or more other survival responses that 1. has a large number of observed events; 2. share a common set of associated predictors with the rare event response. This scenario is common in the UK Biobank (Sudlow et al. 2015) dataset where records for a large number of common and rare diseases of the same set of individuals are available. By analyzing these responses together, we hope to achieve higher prediction performance than when they are analyzed individually. To make this approach practical for large-scale data, we developed an accelerated proximal gradient optimization algorithm as well as a screening procedure inspired by Qian et al. (2019). We provide a software implementation of the proposed method and demonstrate its efficacy through simulations and applications to UK Biobank data.


Author(s):  
Jun Huang ◽  
Linchuan Xu ◽  
Jing Wang ◽  
Lei Feng ◽  
Kenji Yamanishi

Existing multi-label learning (MLL) approaches mainly assume all the labels are observed and construct classification models with a fixed set of target labels (known labels). However, in some real applications, multiple latent labels may exist outside this set and hide in the data, especially for large-scale data sets. Discovering and exploring the latent labels hidden in the data may not only find interesting knowledge but also help us to build a more robust learning model. In this paper, a novel approach named DLCL (i.e., Discovering Latent Class Labels for MLL) is proposed which can not only discover the latent labels in the training data but also predict new instances with the latent and known labels simultaneously. Extensive experiments show a competitive performance of DLCL against other state-of-the-art MLL approaches.


2012 ◽  
Vol 490-495 ◽  
pp. 460-464 ◽  
Author(s):  
Xiao Dan Zhu ◽  
Jin Song Su ◽  
Qing Feng Wu ◽  
Huai Lin Dong

Naive Bayes classification algorithm is an effective simple classification algorithm. Most researches in traditional Naive Bayes classification focus on the improvement of the classification algorithm, ignoring the selection of training data which has a great effect on the performance of classifier. And so a method is proposed to optimize the selection of training data in this paper. Adopting this method, the noisy instances in training data are eliminated by user-defined effectiveness threshold, improving the performance of classifier. Experimental results on large-scale data show that our approach significantly outperforms the baseline classifier.


Blood ◽  
2018 ◽  
Vol 132 (Supplement 1) ◽  
pp. 664-664 ◽  
Author(s):  
Robert K. Hills ◽  
Rosemary Gale ◽  
David C. Linch ◽  
Brian J.P Huntly ◽  
Elli Papaemmanuil ◽  
...  

Abstract Introduction: The increasing delineation of acute myeloid leukemia (AML) has identified a number of genetic mutations which may be amenable to targeted therapies. However, such mutations typically only occur in a minority of patients, and this relative paucity presents challenges in drug development. Even for more common mutations such as FLT3 ITD, randomised trials can take many years to complete, and there is the issue of how to deal with patients who are tested but not eligible. Earlier phase trials therefore tend to be single arm studies, and often recruit in the relapsed/refractory population, where eligibility is known up front, and it is possible to obtain an early read out for efficacy. Such is the case for the recent evaluations of enasidenib and ivosidenib in IDH1/2 mutated patients. However, with single-arm studies there a need to contextualise results. We therefore looked at outcomes from the United Kingdom NCRI trials of AML for patients with IDH1/IDH2 mutations who were relapsed or refractory to therapy. Methods: A database search identified patients within the UK NCRI AML trials with an IDH1/IDH2 mutation, who had received intensive induction and who were either: in second relapse, relapsed post-transplant, refractory to two courses of induction, or who relapsed within 1 year of remission. Outcomes were measured from the point of eligibility: patients who were multiply eligible were included only once, at their first point of eligibility. The primary outcome was overall survival, with achievement of complete remission, with or without peripheral count recovery, as secondary outcome. Cox regression analysis was used to identify prognostic factors within the cohort of patients. Cytogenetics are evaluated using the MRC classification. Results: A total of 757 patients were identified with IDH1/2 mutation (IDH1 alone n=247; IDH2 alone n=504, both n=6). Of these 211 patients satisfied the relapsed/refractory criteria (IDH1 alone n=81; IDH2 alone n=128; both IDH1/2 n=2; refractory n=28; relapsed post SCT n=34; relapsed within 1 year with no SCT n=138; second relapse n=11 - Table). Median age was 54 years (range 22-77); 51% were male; and 95% of patients had intermediate risk cytogenetics. Remissions were achieved in 43/211 patients (20%; refractory 50%; relapsed post SCT 15%; relapsed within 1 year 17%; second relapse 9% - Table). Patients with IDH1 mutations had a remission rate of 23%; for IDH2 mutated patients, the rate was 18%. Median survival was 4.4 months for IDH1 mutated patients, and 6.6 months for IDH2 mutations; 2 year survival was 17%, 21% respectively. Split by age, median survival was 4.0 and 9.4 months respectively (2-year survival 19%; 27%) for patients aged <60; in patients aged 60 or over, median survival was 5.2, 2.9 months (2 year survival 13%; 8%). In multivariable analyses no presenting factor was significantly associated with survival among IDH1 patients. In particular, there was no significant difference in survival by age or between the four different eligibility groups. By contrast, among IDH2 patients, patients in second relapse had worst survival, followed by those relapsing post transplant, those relapsing within 1 year, and those with disease refractory to two courses of therapy (p=0.001); older patients had significantly worse survival (p=0.004 for age older or younger than 60). Conclusions: These results give context to the recent findings in single arm studies of ivosidenib for relapsed/refractory IDH1 mutated patients, and enasidenib for patients harbouring an IDH2 mutation. In the two studies reported, median survival was respectively 8.8 and 9.3 months, compared to 4.4 and 6.6 months in a younger group of patients identified from the UK NCRI AML trials treated with a variety of therapies. In both monotherapy trials the median survival was extended: however, reported one-year survival was not greatly improved (enasidenib 1 year survival 39% vs 34% for the NCRI cohort; ivosidenib, approximately 35% vs 32%). The difference in survival for IDH2 mutated patients in the NCRI cohort, by age and route to eligibility indicates that the interpretation of the results of single arm studies, in a heterogeneous condition such as AML, is fraught with difficulties. Ideally the magnitude of benefit should be assessed using randomised data from large scale collaborations and platform trials. Table: Outcomes for IDH1/IDH2 mutated relapsed/refractory patients in the UK NCRI AML trials. Disclosures Hills: Daiichi Sankyo: Consultancy, Honoraria. Russell:Daiichi Sankyo: Consultancy; Jazz Pharma: Speakers Bureau; Pfizer: Consultancy, Honoraria, Speakers Bureau.


2019 ◽  
Author(s):  
Zachary B. Abrams ◽  
Caitlin E. Coombes ◽  
Suli Li ◽  
Kevin R. Coombes

AbstractSummaryUnsupervised data analysis in many scientific disciplines is based on calculating distances between observations and finding ways to visualize those distances. These kinds of unsupervised analyses help researchers uncover patterns in large-scale data sets. However, researchers can select from a vast number of different distance metrics, each designed to highlight different aspects of different data types. There are also numerous visualization methods with their own strengths and weaknesses. To help researchers perform unsupervised analyses, we developed the Mercator R package. Mercator enables users to see important patterns in their data by generating multiple visualizations using different standard algorithms, making it particularly easy to compare and contrast the results arising from different metrics. By allowing users to select the distance metric that best fits their needs, Mercator helps researchers perform unsupervised analyses that use pattern identification through computation and visual inspection.Availability and ImplementationMercator is freely available at the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/Mercator/index.html)[email protected] informationSupplementary data are available at Bioinformatics online.


Author(s):  
Muhammad Firmansyah Kasim ◽  
D. Watson-Parris ◽  
L. Deaconu ◽  
S. Oliver ◽  
P. Hatfield ◽  
...  

Abstract Computer simulations are invaluable tools for scientific discovery. However, accurate simulations are often slow to execute, which limits their applicability to extensive parameter exploration, large-scale data analysis, and uncertainty quantification. A promising route to accelerate simulations by building fast emulators with machine learning requires large training datasets, which can be prohibitively expensive to obtain with slow simulations. Here we present a method based on neural architecture search to build accurate emulators even with a limited number of training data. The method successfully emulates simulations in 10 scientific cases including astrophysics, climate sci-ence, biogeochemistry, high energy density physics, fusion energy, and seismology, using the same super-architecture, algorithm, and hyperparameters. Our approach also inherently provides emulator uncertainty estimation, adding further confidence in their use. We anticipate this work will accelerate research involving expensive simulations, allow more extensive parameters exploration, and enable new, previously unfeasible computational discovery.


2019 ◽  
Author(s):  
Junyang Qian ◽  
Yosuke Tanigawa ◽  
Wenfei Du ◽  
Matthew Aguirre ◽  
Chris Chang ◽  
...  

AbstractThe UK Biobank (Bycroft et al., 2018) is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with GWAS, have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso (Tibshirani, 1996), since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet (Friedman et al., 2010a) and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve superior predictive performance on quantitative and qualitative traits including height, body mass index, asthma and high cholesterol.Author SummaryWith the advent and evolution of large-scale and comprehensive biobanks, there come up unprecedented opportunities for researchers to further uncover the complex landscape of human genetics. One major direction that attracts long-standing interest is the investigation of the relationships between genotypes and phenotypes. This includes but doesn’t limit to the identification of genotypes that are significantly associated with the phenotypes, and the prediction of phenotypic values based on the genotypic information. Genome-wide association studies (GWAS) is a very powerful and widely used framework for the former task, having produced a number of very impactful discoveries. However, when it comes to the latter, its performance is fairly limited by the univariate nature. To address this, multiple regression methods have been suggested to fill in the gap. That said, challenges emerge as the dimension and the size of datasets both become large nowadays. In this paper, we present a novel computational framework that enables us to solve efficiently the entire lasso or elastic-net solution path on large-scale and ultrahigh-dimensional data, and therefore make simultaneous variable selection and prediction. Our approach can build on any existing lasso solver for small or moderate-sized problems, scale it up to a big-data solution, and incorporate other extensions easily. We provide a package snpnet that extends the glmnet package in R and optimizes for large phenotype-genotype data. On the UK Biobank, we observe improved prediction performance on height, body mass index (BMI), asthma and high cholesterol by the lasso over other univariate and multiple regression methods. That said, the scope of our approach goes beyond genetic studies. It can be applied to general sparse regression problems and build scalable solution for a variety of distribution families based on existing solvers.


2018 ◽  
Vol 37 (2) ◽  
pp. 121-124 ◽  
Author(s):  
Helen Prior ◽  
Paul Baldrick ◽  
Lolke de Haan ◽  
Noel Downes ◽  
Keith Jones ◽  
...  

As part of the safety assessment of new drugs, the use of two species (a rodent and a nonrodent) for regulatory toxicology studies is the typical approach taken for small molecules. For biologics, species selection is dictated by pharmacological relevance, and single species toxicology packages (typically using the nonhuman primate) are common. The UK National Centre for the Replacement, Refinement, and Reduction of Animals in Research and the Association of the British Pharmaceutical Industry are collaborating on a project to review the utility of two species in regulatory toxicology studies, with the aim to explore whether there are wider circumstances when data from a single species could be sufficient to enable safe progression in humans. An international working group consisting of 37 representatives from pharmaceutical and biotechnology companies, contract research organizations, academia, and regulatory bodies is coordinating a large-scale data sharing exercise to examine the potential for changes in current practice to reduce the number of species used for nonclinical safety testing at different stages of development. The challenge will be to determine whether two species toxicology adds significant value or whether in some instances data from a single species are sufficient (across a broader range of molecules than is currently the case) without compromising human safety.


2021 ◽  
Author(s):  
Tobias Greisager Rehfeldt ◽  
Konrad Krawczyk ◽  
Mathias Bøgebjerg ◽  
Veit Schwämmle ◽  
Richard Röttger

AbstractMotivationLiquid-chromatography mass-spectrometry (LC-MS) is the established standard for analyzing the proteome in biological samples by identification and quantification of thousands of proteins. Machine learning (ML) promises to considerably improve the analysis of the resulting data, however, there is yet to be any tool that mediates the path from raw data to modern ML applications. More specifically, ML applications are currently hampered by three major limitations: (1) absence of balanced training data with large sample size; (2) unclear definition of sufficiently information-rich data representations for e.g. peptide identification; (3) lack of benchmarking of ML methods on specific LC-MS problems.ResultsWe created the MS2AI pipeline that automates the process of gathering vast quantities of mass spectrometry (MS) data for large scale ML applications. The software retrieves raw data from either in-house sources or from the proteomics identifications database, PRIDE. Subsequently, the raw data is stored in a standardized format amenable for ML encompassing MS1/MS2 spectra and peptide identifications. This tool bridges the gap between MS and AI, and to this effect we also present an ML application in the form of a convolutional neural network for the identification of oxidized peptides.AvailabilityAn open source implementation of the software can be found freely available for non-commercial use at https://gitlab.com/roettgerlab/[email protected] informationSupplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document