scholarly journals Normality Testing of High-Dimensional Data Based on Principle Component and Jarque–Bera Statistics

Stats ◽  
2021 ◽  
Vol 4 (1) ◽  
pp. 216-227
Author(s):  
Yanan Song ◽  
Xuejing Zhao

The testing of high-dimensional normality is an important issue and has been intensively studied in the literature, it depends on the variance–covariance matrix of the sample and numerous methods have been proposed to reduce its complexity. Principle component analysis (PCA) has been widely used in high dimensions, since it can project high-dimensional data into a lower-dimensional orthogonal space. The normality of the reduced data can then be evaluated by Jarque–Bera (JB) statistics in each principle direction. We propose a combined test statistic—the summation of one-way JB statistics upon the independence of the principle directions—to test the multivariate normality of data in high dimensions. The performance of the proposed method is illustrated by the empirical power of the simulated normal and non-normal data. Two real data examples show the validity of our proposed method.


Author(s):  
Ya-nan Song ◽  
Xuejing Zhao

The testing of high-dimensional normality has been an important issue and has been intensively studied in literatures, it depends on the Variance-Covariance matrix of the sample, numerous methods have been proposed to reduce the complex of the Variance-Covariance matrix. The principle component analysis(PCA) was widely used since it can project the high-dimensional data into lower dimensional orthogonal space, and the normality of the reduced data can be evaluated by Jarque-Bera(JB) statistic on each principle direction. We propose two combined statistics, the summation and the maximum of one-way JB statistics, upon the independency of each principle direction, to test the multivariate normality of data in high dimensions. The performance of the proposed methods is illustrated by the empirical power of the simulated data of normal data and non-normal data. Two real examples show the validity of our proposed methods.



Author(s):  
Ya-nan Song ◽  
Xuejing Zhao

The testing of high-dimensional normality has been an important issue and has been intensively studied in literatures, it depends on the Variance-Covariance matrix of the sample, numerous methods have been proposed to reduce the complex of the Variance-Covariance matrix. The principle component analysis(PCA) was widely used since it can project the high-dimensional data into lower dimensional orthogonal space, and the normality of the reduced data can be evaluated by Jarque-Bera(JB) statistic on each principle direction. We propose two combined statistics, the summation and the maximum of one-way JB statistics, upon the independency of each principle direction, to test the multivariate normality of data in high dimensions. The performance of the proposed methods is illustrated by the empirical power of the simulated data of normal data and non-normal data. Two real examples show the validity of our proposed methods.



2018 ◽  
Vol 8 (2) ◽  
pp. 377-406
Author(s):  
Almog Lahav ◽  
Ronen Talmon ◽  
Yuval Kluger

Abstract A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored, which is the structure stemming from the relationships between the coordinates. Specifically, we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space. We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan–Meier survival plot.



2021 ◽  
pp. 1471082X2110410
Author(s):  
Elena Tuzhilina ◽  
Leonardo Tozzi ◽  
Trevor Hastie

Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. A regularized modification of canonical correlation analysis (RCCA) which imposes an [Formula: see text] penalty on the CCA coefficients is widely used in applications with high-dimensional data. One limitation of such regularization is that it ignores any data structure, treating all the features equally, which can be ill-suited for some applications. In this article we introduce several approaches to regularizing CCA that take the underlying data structure into account. In particular, the proposed group regularized canonical correlation analysis (GRCCA) is useful when the variables are correlated in groups. We illustrate some computational strategies to avoid excessive computations with regularized CCA in high dimensions. We demonstrate the application of these methods in our motivating application from neuroscience, as well as in a small simulation example.



Author(s):  
Haoyang Cheng ◽  
Wenquan Cui

Heteroscedasticity often appears in the high-dimensional data analysis. In order to achieve a sparse dimension reduction direction for high-dimensional data with heteroscedasticity, we propose a new sparse sufficient dimension reduction method, called Lasso-PQR. From the candidate matrix derived from the principal quantile regression (PQR) method, we construct a new artificial response variable which is made up from top eigenvectors of the candidate matrix. Then we apply a Lasso regression to obtain sparse dimension reduction directions. While for the “large [Formula: see text] small [Formula: see text]” case that [Formula: see text], we use principal projection to solve the dimension reduction problem in a lower-dimensional subspace and projection back to the original dimension reduction problem. Theoretical properties of the methodology are established. Compared with several existing methods in the simulations and real data analysis, we demonstrate the advantages of our method in the high dimension data with heteroscedasticity.



2013 ◽  
Vol 444-445 ◽  
pp. 604-609
Author(s):  
Guang Hui Fu ◽  
Pan Wang

LASSO is a very useful variable selection method for high-dimensional data , But it does not possess oracle property [Fan and Li, 200 and group effect [Zou and Hastie, 200. In this paper, we firstly review four improved LASSO-type methods which satisfy oracle property and (or) group effect, and then give another two new ones called WFEN and WFAEN. The performance on both the simulation and real data sets shows that WFEN and WFAEN are competitive with other LASSO-type methods.



Author(s):  
Guangzhu Guangzhu Yu ◽  
Shihuang Shao ◽  
Bin Luo ◽  
Xianhui Zeng

Existing algorithms for high-utility itemsets mining are column enumeration based, adopting an Apriorilike candidate set generation-and-test approach, and thus are inadequate in datasets with high dimensions or long patterns. To solve the problem, this paper proposed a hybrid model and a row enumerationbased algorithm, i.e., Inter-transaction, to discover high-utility itemsets from two directions: an existing algorithm can be used to seek short high-utility itemsets from the bottom, while Inter-transaction can be used to seek long high-utility itemsets from the top. Inter-transaction makes full use of the characteristic that there are few common items between or among long transactions. By intersecting relevant transactions, the new algorithm can identify long high-utility itemsets, without extending short itemsets step by step. In addition, we also developed new pruning strategies and an optimization technique to improve the performance of Inter-transaction.



2017 ◽  
Vol 1 (2) ◽  
pp. 118
Author(s):  
Knavoot Jiamwattanapong ◽  
Samruam Chongcharoen

<p><em>Modern measurement technology has enabled the capture of high-dimensional data by researchers and statisticians and classical statistical inferences, such as </em><em>the renowned Hotelling’s T<sup>2</sup> test, are no longer valid when the dimension of the data equals or exceeds the sample size. Importantly, when correlations among variables in a dataset exist, taking them into account in the analysis method would provide more accurate conclusions. In this article, we consider the hypothesis testing problem for two mean vectors in high-dimensional data with an underlying normality assumption. A new test is proposed based on the idea of keeping more information from the sample covariances. The asymptotic null distribution of the test statistic is derived. The simulation results show that the proposed test performs well comparing with other competing tests and becomes more powerful when the dimension increases for a given sample size. The proposed test is also illustrated with an analysis of DNA microarray data. </em></p>



Sign in / Sign up

Export Citation Format

Share Document