The Induced Smoothed lasso: A practical framework for hypothesis testing in high dimensional regression

2019 ◽  
Vol 29 (3) ◽  
pp. 765-777 ◽  
Author(s):  
Giovanna Cilluffo ◽  
Gianluca Sottile ◽  
Stefania La Grutta ◽  
Vito MR Muggeo

This paper focuses on hypothesis testing in lasso regression, when one is interested in judging statistical significance for the regression coefficients in the regression equation involving a lot of covariates. To get reliable p-values, we propose a new lasso-type estimator relying on the idea of induced smoothing which allows to obtain appropriate covariance matrix and Wald statistic relatively easily. Some simulation experiments reveal that our approach exhibits good performance when contrasted with the recent inferential tools in the lasso framework. Two real data analyses are presented to illustrate the proposed framework in practice.

2019 ◽  
Vol 81 (8) ◽  
pp. 535-542
Author(s):  
Robert A. Cooper

Statistical methods are indispensable to the practice of science. But statistical hypothesis testing can seem daunting, with P-values, null hypotheses, and the concept of statistical significance. This article explains the concepts associated with statistical hypothesis testing using the story of “the lady tasting tea,” then walks the reader through an application of the independent-samples t-test using data from Peter and Rosemary Grant's investigations of Darwin's finches. Understanding how scientists use statistics is an important component of scientific literacy, and students should have opportunities to use statistical methods like this in their science classes.


2022 ◽  
Vol 15 (1) ◽  
pp. 32
Author(s):  
Hrishikesh D. Vinod

Quantitative researchers often use Student’s t-test (and its p-values) to claim that a particular regressor is important (statistically significantly) for explaining the variation in a response variable. A study is subject to the p-hacking problem when its author relies too much on formal statistical significance while ignoring the size of what is at stake. We suggest reporting estimates using nonlinear kernel regressions and the standardization of all variables to avoid p-hacking. We are filling an essential gap in the literature because p-hacking-related papers do not even mention kernel regressions or standardization. Although our methods have general applicability in all sciences, our illustrations refer to risk management for a cross-section of firms and financial management in macroeconomic time series. We estimate nonlinear, nonparametric kernel regressions for both examples to illustrate the computation of scale-free generalized partial correlation coefficients (GPCCs). We suggest supplementing the usual p-values by “practical significance” revealed by scale-free GPCCs. We show that GPCCs also yield new pseudo regression coefficients to measure each regressor’s relative (nonlinear) contribution in a kernel regression.


2019 ◽  
Vol 09 (04) ◽  
pp. 2050017
Author(s):  
Zhiqiang Jiang ◽  
Zhensheng Huang ◽  
Guoliang Fan

This paper considers empirical likelihood inference for a high-dimensional partially functional linear model. An empirical log-likelihood ratio statistic is constructed for the regression coefficients of non-functional predictors and proved to be asymptotically normally distributed under some regularity conditions. Moreover, maximum empirical likelihood estimators of the regression coefficients of non-functional predictors are proposed and their asymptotic properties are obtained. Simulation studies are conducted to demonstrate the performance of the proposed procedure and a real data set is analyzed for illustration.


2020 ◽  
Vol 21 (S9) ◽  
Author(s):  
Qingyang Zhang ◽  
Thy Dao

Abstract Background Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology and economics. As the components in a composition must sum to one, traditional tests based on unconstrained data become inappropriate, and new statistical methods are needed to analyze this special type of data. Results In this paper, we consider a general problem of testing for the compositional difference between K populations. Motivated by microbiome and metagenomics studies, where the data are often over-dispersed and high-dimensional, we formulate a well-posed hypothesis from a Bayesian point of view and suggest a nonparametric test based on inter-point distance to evaluate statistical significance. Unlike most existing tests for compositional data, our method does not rely on any data transformation, sparsity assumption or regularity conditions on the covariance matrix, but directly analyzes the compositions. Simulated data and two real data sets on the human microbiome are used to illustrate the promise of our method. Conclusions Our simulation studies and real data applications demonstrate that the proposed test is more sensitive to the compositional difference than the mean-based method, especially when the data are over-dispersed or zero-inflated. The proposed test is easy to implement and computationally efficient, facilitating its application to large-scale datasets.


Author(s):  
Haoyang Cheng ◽  
Wenquan Cui

Heteroscedasticity often appears in the high-dimensional data analysis. In order to achieve a sparse dimension reduction direction for high-dimensional data with heteroscedasticity, we propose a new sparse sufficient dimension reduction method, called Lasso-PQR. From the candidate matrix derived from the principal quantile regression (PQR) method, we construct a new artificial response variable which is made up from top eigenvectors of the candidate matrix. Then we apply a Lasso regression to obtain sparse dimension reduction directions. While for the “large [Formula: see text] small [Formula: see text]” case that [Formula: see text], we use principal projection to solve the dimension reduction problem in a lower-dimensional subspace and projection back to the original dimension reduction problem. Theoretical properties of the methodology are established. Compared with several existing methods in the simulations and real data analysis, we demonstrate the advantages of our method in the high dimension data with heteroscedasticity.


2020 ◽  
Vol 32 (6) ◽  
pp. 1168-1221
Author(s):  
Masaaki Takada ◽  
Taiji Suzuki ◽  
Hironori Fujisawa

Sparse regularization such as [Formula: see text] regularization is a quite powerful and widely used strategy for high-dimensional learning problems. The effectiveness of sparse regularization has been supported practically and theoretically by several studies. However, one of the biggest issues in sparse regularization is that its performance is quite sensitive to correlations between features. Ordinary [Formula: see text] regularization selects variables correlated with each other under weak regularizations, which results in deterioration of not only its estimation error but also interpretability. In this letter, we propose a new regularization method, independently interpretable lasso (IILasso), for generalized linear models. Our proposed regularizer suppresses selecting correlated variables, so that each active variable affects the response independently in the model. Hence, we can interpret regression coefficients intuitively, and the performance is also improved by avoiding overfitting. We analyze the theoretical property of the IILasso and show that the proposed method is advantageous for its sign recovery and achieves almost minimax optimal convergence rate. Synthetic and real data analyses also indicate the effectiveness of the IILasso.


Author(s):  
Fengyu Zhang ◽  
Claude Hughes

There have been a series of recent discussions and debates on the p-value and statistical significance. These discussions, including publications of more than 40 papers in a special issue of the American Statistician, provide an excellent opportunity to think about some technical measures for practical implementation in grant applications and publications. While several factors have been discussed, it may be the rigor of a study that determines the p-value for reporting study results and judging a consistent replication of research. Both p-values and power, which integrate Fisherian and Neyman-Pearson methods, should be used for hypothesis testing. We propose new criteria, which can be implemented without fundamental changes in existing statistics, to reduce false positives and irreplicability of studies that are either inadequately powered or overpowered.


Author(s):  
Fengyu Zhang ◽  
Claude Hughes

There have been a series of recent discussions and debates on the p-value and statistical significance. These discussions, including publications of more than 40 papers in a special issue of the American Statistician, provide an excellent opportunity to think about some technical measures for practical implementation in grant applications and publications. While several factors have been discussed, it may be the rigor of a study that determines the p-value for reporting study results and judging a consistent replication of research. Both p-values and power, which integrate Fisherian and Neyman-Pearson methods, should be used for hypothesis testing. We propose new criteria, which can be implemented without fundamental changes in existing statistics, to reduce false positives and irreplicability of studies that are either inadequately powered or overpowered.


Author(s):  
Roberto Benedetti ◽  
Maria Michela Dickson ◽  
Giuseppe Espa ◽  
Francesco Pantalone ◽  
Federica Piersimoni

AbstractBalanced sampling is a random method for sample selection, the use of which is preferable when auxiliary information is available for all units of a population. However, implementing balanced sampling can be a challenging task, and this is due in part to the computational efforts required and the necessity to respect balancing constraints and inclusion probabilities. In the present paper, a new algorithm for selecting balanced samples is proposed. This method is inspired by simulated annealing algorithms, as a balanced sample selection can be interpreted as an optimization problem. A set of simulation experiments and an example using real data shows the efficiency and the accuracy of the proposed algorithm.


Sign in / Sign up

Export Citation Format

Share Document