scholarly journals STATISTICAL INFERENCE WITH F-STATISTICS WHEN FITTING SIMPLE MODELS TO HIGH-DIMENSIONAL DATA

2021 ◽  
pp. 1-24
Author(s):  
Hannes Leeb ◽  
Lukas Steinberger

Abstract We study linear subset regression in the context of the high-dimensional overall model $y = \vartheta +\theta ' z + \epsilon $ with univariate response y and a d-vector of random regressors z, independent of $\epsilon $ . Here, “high-dimensional” means that the number d of available explanatory variables is much larger than the number n of observations. We consider simple linear submodels where y is regressed on a set of p regressors given by $x = M'z$ , for some $d \times p$ matrix M of full rank $p < n$ . The corresponding simple model, that is, $y=\alpha +\beta ' x + e$ , is usually justified by imposing appropriate restrictions on the unknown parameter $\theta $ in the overall model; otherwise, this simple model can be grossly misspecified in the sense that relevant variables may have been omitted. In this paper, we establish asymptotic validity of the standard F-test on the surrogate parameter $\beta $ , in an appropriate sense, even when the simple model is misspecified, that is, without any restrictions on $\theta $ whatsoever and without assuming Gaussian data.

Processes ◽  
2021 ◽  
Vol 9 (2) ◽  
pp. 259
Author(s):  
Qilan Ran ◽  
Yedong Song ◽  
Wenli Du ◽  
Wei Du ◽  
Xin Peng

In order to reduce pollutants of the emission from diesel vehicles, complex after-treatment technologies have been proposed, which make the fault detection of diesel engines become increasingly difficult. Thus, this paper proposes a canonical correlation analysis detection method based on fault-relevant variables selected by an elitist genetic algorithm to realize high-dimensional data-driven faults detection of diesel engines. The method proposed establishes a fault detection model by the actual operation data to overcome the limitations of the traditional methods, merely based on benchmark. Moreover, the canonical correlation analysis is used to extract the strong correlation between variables, which constructs the residual vector to realize the fault detection of the diesel engine air and after-treatment system. In particular, the elitist genetic algorithm is used to optimize the fault-relevant variables to reduce detection redundancy, eliminate additional noise interference, and improve the detection rate of the specific fault. The experiments are carried out by implementing the practical state data of a diesel engine, which show the feasibility and efficiency of the proposed approach.


1984 ◽  
Vol 13 (1) ◽  
pp. 97-102 ◽  
Author(s):  
Patricia D. Taylor ◽  
William G. Tomek

This study develops a simple model to forecast the basis for corn in a specific region. Improved forecasts can improve hedging decisions. Basis behavior, however, depends on explanatory variables that are themselves difficult to forecast with precision. This limits the usefulness of the basis model, but it does offer some benefit over naive forecasts.


Author(s):  
MA Islam ◽  
SK Paul

The objective of this research is to evaluate people’s perception on vulnerabilities of agriculture and to explore effective adaptation options with identifying the underlying demographic, socio-economic and other relevant variables that influence the adaptation strategies in the sea level rise (SLR) hazard induced coastal areas of Bangladesh. The study finds that climate change and induced SLR are emerging threats to coastal agriculture of Bangladesh; hence, farmers are applying different adaptation strategies to reduce the vulnerabilities of coastal agriculture. Selection of effective adaptation strategies to vulnerabilities of agriculture depends not only on the magnitude, intensity and the impacts of climate change and SLR, but also perceptions and types of farmer, land, educational level, indigenous knowledge about adaptation, locational advantages, external support, community awareness and sharing of different effective mechanisms among the farmers. Effective adaptation strategies with high perceptions have significant influence to reduce the vulnerabilities of agriculture considering the adverse impacts of climate change and SLR. In time of extreme climatic hazards when a great loss in agriculture hamper the coastal agrobased economy, different effective indigenous local adaptation strategies through farmer awareness and community co-operation become vital for minimizing the impact of climatic hazards and reducing the vulnerabilities of coastal agriculture.Int. J. Agril. Res. Innov. & Tech. 8 (1): 70-78, June, 2018


2008 ◽  
Vol 15 (04) ◽  
pp. 371-382 ◽  
Author(s):  
M. M. Al-sawalha ◽  
M. S. M. Noorani

This paper brings attention to hyperchaos anti-synchronization between two identical and distinctive hyperchaotic systems using active control theory. The sufficient conditions for achieving anti-synchronization of two high dimensional hyperchaotic systems is derived based on Lyapunov stability theory, where the controllers are designed by using the sum of relevant variables in hyperchaotic systems. Numerical results are presented to justify the theoretical analysis strategy.


PLoS ONE ◽  
2021 ◽  
Vol 16 (3) ◽  
pp. e0248046
Author(s):  
Elizabeth Hou ◽  
Earl Lawrence ◽  
Alfred O. Hero

The ensemble Kalman filter (EnKF) is a data assimilation technique that uses an ensemble of models, updated with data, to track the time evolution of a usually non-linear system. It does so by using an empirical approximation to the well-known Kalman filter. However, its performance can suffer when the ensemble size is smaller than the state space, as is often necessary for computationally burdensome models. This scenario means that the empirical estimate of the state covariance is not full rank and possibly quite noisy. To solve this problem in this high dimensional regime, we propose a computationally fast and easy to implement algorithm called the penalized ensemble Kalman filter (PEnKF). Under certain conditions, it can be theoretically proven that the PEnKF will be accurate (the estimation error will converge to zero) despite having fewer ensemble members than state dimensions. Further, as contrasted to localization methods, the proposed approach learns the covariance structure associated with the dynamical system. These theoretical results are supported with simulations of several non-linear and high dimensional systems.


2021 ◽  
Author(s):  
Jalmari Tuominen ◽  
Francesco Lomio ◽  
Niku Oksala ◽  
Ari Palomäki ◽  
Jaakko Peltonen ◽  
...  

Abstract Background and Objective Emergency Department (ED) overcrowding is a chronic international issue that is associated with adverse treatment outcomes. Accurate forecasts of future service demand would enable intelligent resource allocation that could alleviate the problem. There has been continued academic interest in ED forecasting but the number of used explanatory variables has been low, limited mainly to calendar and weather variables. In this study we investigate whether predictive accuracy of next day arrivals could be enhanced using high number of potentially relevant explanatory variables and document two feature selection processes that aim to identify which subset of variables is associated with number of next day arrivals.Methods We extracted numbers of total daily arrivals from Tampere University Hospital ED between the time period of June 1, 2015 and June 19, 2019. 158 potential explanatory variables were collected from multiple data sources consisting not only of weather and calendar variables but also an extensive list of local public events, numbers of website visits to two hospital domains, numbers of available hospital beds in 33 local hospitals or health centres and Google trends searches for the ED. We used two feature selection processes: Simulated Annealing (SA) and Floating Search (FS) with Recursive Least Squares (RLS) and Least Mean Squares (LMS). Performance of these approaches was compared against autoregressive integrated moving average (ARIMA), regression with ARIMA errors (ARIMAX) and Random Forest (RF). Mean Absolute Percentage Error (MAPE) was used as the main error metric.Results Calendar variables, load of secondary care facilities and local public events were dominant in the identified predictive features. RLS-SA and RLS-FA provided slightly better accuracy compared ARIMA. ARIMAX was the most accurate model but the difference between RLS-SA and RLS-FA was not statistically significant.Conclusions Our study provides new insight into potential underlying factors associated with number of next day presentations. It also suggests that predictive accuracy of next day arrivals can be increased using high-dimensional feature selection approach when compared to both univariate and nonfiltered high-dimensional approach. However, outperforming ARIMAX remains a challenge when working with daily data. Future work should focus on enhancing the feature selection mechanism, investigating its applicability to other domains and in identifying other potentially relevant explanatory variables.


Author(s):  
Verena Zuber ◽  
Korbinian Strimmer

Variable selection is a difficult problem that is particularly challenging in the analysis of high-dimensional genomic data. Here, we introduce the CAR score, a novel and highly effective criterion for variable ranking in linear regression based on Mahalanobis-decorrelation of the explanatory variables. The CAR score provides a canonical ordering that encourages grouping of correlated predictors and down-weights antagonistic variables. It decomposes the proportion of variance explained and it is an intermediate between marginal correlation and the standardized regression coefficient. As a population quantity, any preferred inference scheme can be applied for its estimation. Using simulations, we demonstrate that variable selection by CAR scores is very effective and yields prediction errors and true and false positive rates that compare favorably with modern regression techniques such as elastic net and boosting. We illustrate our approach by analyzing data concerned with diabetes progression and with the effect of aging on gene expression in the human brain. The R package “care” implementing CAR score regression is available from CRAN.


Sign in / Sign up

Export Citation Format

Share Document