scholarly journals Oracle inequalities for weighted group lasso in high-dimensional misspecified Cox models

2020 ◽  
Vol 2020 (1) ◽  
Author(s):  
Yijun Xiao ◽  
Ting Yan ◽  
Huiming Zhang ◽  
Yuanyuan Zhang

AbstractWe study the nonasymptotic properties of a general norm penalized estimator, which include Lasso, weighted Lasso, and group Lasso as special cases, for sparse high-dimensional misspecified Cox models with time-dependent covariates. Under suitable conditions on the true regression coefficients and random covariates, we provide oracle inequalities for prediction and estimation error based on the group sparsity of the true coefficient vector. The nonasymptotic oracle inequalities show that the penalized estimator has good sparse approximation of the true model and enables to select a few meaningful structure variables among the set of features.

2015 ◽  
Vol 2015 ◽  
pp. 1-13 ◽  
Author(s):  
Jin-Jia Wang ◽  
Fang Xue ◽  
Hui Li

Feature extraction and classification of EEG signals are core parts of brain computer interfaces (BCIs). Due to the high dimension of the EEG feature vector, an effective feature selection algorithm has become an integral part of research studies. In this paper, we present a new method based on a wrapped Sparse Group Lasso for channel and feature selection of fused EEG signals. The high-dimensional fused features are firstly obtained, which include the power spectrum, time-domain statistics, AR model, and the wavelet coefficient features extracted from the preprocessed EEG signals. The wrapped channel and feature selection method is then applied, which uses the logistical regression model with Sparse Group Lasso penalized function. The model is fitted on the training data, and parameter estimation is obtained by modified blockwise coordinate descent and coordinate gradient descent method. The best parameters and feature subset are selected by using a 10-fold cross-validation. Finally, the test data is classified using the trained model. Compared with existing channel and feature selection methods, results show that the proposed method is more suitable, more stable, and faster for high-dimensional feature fusion. It can simultaneously achieve channel and feature selection with a lower error rate. The test accuracy on the data used from international BCI Competition IV reached 84.72%.


Biometrika ◽  
2021 ◽  
Author(s):  
Pixu Shi ◽  
Yuchen Zhou ◽  
Anru R Zhang

Abstract In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain critical issues. In this article, we introduce a surprisingly simple, interpretable, and efficient method for the estimation of compositional data regression through the lens of a novel high-dimensional log-error-in-variable regression model. The proposed method provides both corrections on sequencing data with possible overdispersion and simultaneously avoids any subjective imputation of zero read counts. We provide theoretical justifications with matching upper and lower bounds for the estimation error. The merit of the procedure is illustrated through real data analysis and simulation studies.


Author(s):  
Joe Hollinghurst ◽  
Alan Watkins

IntroductionThe electronic Frailty Index (eFI) and the Hospital Frailty Risk Score (HFRS) have been developed in primary and secondary care respectively. Objectives and ApproachOur objective was to investigate how frailty progresses over time, and to include the progression of frailty in a survival analysis.To do this, we performed a retrospective cohort study using linked data from the Secure Anonymised Information Linkage Databank, comprising 445,771 people aged 65-95 living in Wales (United Kingdom) on 1st January 2010. We calculated frailty, using both the eFI and HFRS, for individuals at quarterly intervals for 8 years with a total of 11,702,242 observations. ResultsWe created a transition matrix for frailty states determined by the eFI (states: fit, mild, moderate, severe) and HFRS (states: no score, low, intermediate, high), with death as an absorbing state. The matrix revealed that frailty progressed over time, but that on a quarterly basis it was most likely that an individual remained in the same state. We calculated Hazard Ratios (HRs) using time dependent Cox models for mortality, with adjustments for age, gender and deprivation. Independent eFI and HFRS models showed increased risk of mortality as frailty severity increased. A combined eFI and HFRS revealed the highest risk was primarily determined by the HFRS and revealed further subgroups of individuals at increased risk of an adverse outcome. For example, the HRs (95% Confidence Interval) for individuals with an eFI as fit, mild, moderate and severe with a high HFRS were 18.11 [17.25,19.02], 20.58 [19.93,21.24], 21.45 [20.85,22.07] and 23.04 [22.34,23.76] respectively with eFI fit and no HFRS score as the reference category. ConclusionFrailty was found to vary over time, with progression likely in the 8-year time-frame analysed. We refined HR estimates of the eFI and HFRS for mortality by including time dependent covariates.


2016 ◽  
Vol 113 (23) ◽  
pp. E3221-E3230 ◽  
Author(s):  
Hao Wu ◽  
Fabian Paul ◽  
Christoph Wehmeyer ◽  
Frank Noé

We introduce the general transition-based reweighting analysis method (TRAM), a statistically optimal approach to integrate both unbiased and biased molecular dynamics simulations, such as umbrella sampling or replica exchange. TRAM estimates a multiensemble Markov model (MEMM) with full thermodynamic and kinetic information at all ensembles. The approach combines the benefits of Markov state models—clustering of high-dimensional spaces and modeling of complex many-state systems—with those of the multistate Bennett acceptance ratio of exploiting biased or high-temperature ensembles to accelerate rare-event sampling. TRAM does not depend on any rate model in addition to the widely used Markov state model approximation, but uses only fundamental relations such as detailed balance and binless reweighting of configurations between ensembles. Previous methods, including the multistate Bennett acceptance ratio, discrete TRAM, and Markov state models are special cases and can be derived from the TRAM equations. TRAM is demonstrated by efficiently computing MEMMs in cases where other estimators break down, including the full thermodynamics and rare-event kinetics from high-dimensional simulation data of an all-atom protein–ligand binding model.


2012 ◽  
Vol 21 (8) ◽  
pp. 844-850 ◽  
Author(s):  
Ronghui Xu ◽  
Yunjun Luo ◽  
Christina Chambers

2019 ◽  
Vol 116 (12) ◽  
pp. 5451-5460 ◽  
Author(s):  
Jean Barbier ◽  
Florent Krzakala ◽  
Nicolas Macris ◽  
Léo Miolane ◽  
Lenka Zdeborová

Generalized linear models (GLMs) are used in high-dimensional machine learning, statistics, communications, and signal processing. In this paper we analyze GLMs when the data matrix is random, as relevant in problems such as compressed sensing, error-correcting codes, or benchmark models in neural networks. We evaluate the mutual information (or “free entropy”) from which we deduce the Bayes-optimal estimation and generalization errors. Our analysis applies to the high-dimensional limit where both the number of samples and the dimension are large and their ratio is fixed. Nonrigorous predictions for the optimal errors existed for special cases of GLMs, e.g., for the perceptron, in the field of statistical physics based on the so-called replica method. Our present paper rigorously establishes those decades-old conjectures and brings forward their algorithmic interpretation in terms of performance of the generalized approximate message-passing algorithm. Furthermore, we tightly characterize, for many learning problems, regions of parameters for which this algorithm achieves the optimal performance and locate the associated sharp phase transitions separating learnable and nonlearnable regions. We believe that this random version of GLMs can serve as a challenging benchmark for multipurpose algorithms.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Jan Klosa ◽  
Noah Simon ◽  
Pål Olof Westermark ◽  
Volkmar Liebscher ◽  
Dörte Wittenburg

Abstract Background Statistical analyses of biological problems in life sciences often lead to high-dimensional linear models. To solve the corresponding system of equations, penalization approaches are often the methods of choice. They are especially useful in case of multicollinearity, which appears if the number of explanatory variables exceeds the number of observations or for some biological reason. Then, the model goodness of fit is penalized by some suitable function of interest. Prominent examples are the lasso, group lasso and sparse-group lasso. Here, we offer a fast and numerically cheap implementation of these operators via proximal gradient descent. The grid search for the penalty parameter is realized by warm starts. The step size between consecutive iterations is determined with backtracking line search. Finally, seagull -the R package presented here- produces complete regularization paths. Results Publicly available high-dimensional methylation data are used to compare seagull to the established R package SGL. The results of both packages enabled a precise prediction of biological age from DNA methylation status. But even though the results of seagull and SGL were very similar (R2 > 0.99), seagull computed the solution in a fraction of the time needed by SGL. Additionally, seagull enables the incorporation of weights for each penalized feature. Conclusions The following operators for linear regression models are available in seagull: lasso, group lasso, sparse-group lasso and Integrative LASSO with Penalty Factors (IPF-lasso). Thus, seagull is a convenient envelope of lasso variants.


Sign in / Sign up

Export Citation Format

Share Document