Normality Testing of High-Dimensional Data Based on Principle Component and Jarque–Bera Statistics

The testing of high-dimensional normality has been an important issue and has been intensively studied in literatures, it depends on the Variance-Covariance matrix of the sample, numerous methods have been proposed to reduce the complex of the Variance-Covariance matrix. The principle component analysis(PCA) was widely used since it can project the high-dimensional data into lower dimensional orthogonal space, and the normality of the reduced data can be evaluated by Jarque-Bera(JB) statistic on each principle direction. We propose two combined statistics, the summation and the maximum of one-way JB statistics, upon the independency of each principle direction, to test the multivariate normality of data in high dimensions. The performance of the proposed methods is illustrated by the empirical power of the simulated data of normal data and non-normal data. Two real examples show the validity of our proposed methods.

Normality Testing of High-Dimensional Data Based on Principle Component and Jarque-Bera Statistics

10.20944/preprints202102.0544.v2 ◽

2021 ◽

Author(s):

Ya-nan Song ◽

Xuejing Zhao

Keyword(s):

Covariance Matrix ◽

High Dimensional Data ◽

Simulated Data ◽

High Dimensional ◽

Empirical Power ◽

High Dimensions ◽

Principle Component ◽

Orthogonal Space ◽

Reduced Data ◽

Variance Covariance Matrix

The testing of high-dimensional normality has been an important issue and has been intensively studied in literatures, it depends on the Variance-Covariance matrix of the sample, numerous methods have been proposed to reduce the complex of the Variance-Covariance matrix. The principle component analysis(PCA) was widely used since it can project the high-dimensional data into lower dimensional orthogonal space, and the normality of the reduced data can be evaluated by Jarque-Bera(JB) statistic on each principle direction. We propose two combined statistics, the summation and the maximum of one-way JB statistics, upon the independency of each principle direction, to test the multivariate normality of data in high dimensions. The performance of the proposed methods is illustrated by the empirical power of the simulated data of normal data and non-normal data. Two real examples show the validity of our proposed methods.

Mahalanobis distance informed by clustering

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iay011 ◽

2018 ◽

Vol 8 (2) ◽

pp. 377-406

Author(s):

Almog Lahav ◽

Ronen Talmon ◽

Yuval Kluger

Keyword(s):

Mahalanobis Distance ◽

High Dimensional Data ◽

Hidden Variables ◽

Real Data ◽

Risk Groups ◽

High Dimensional ◽

Data Sets ◽

Kaplan Meier ◽

Data Points ◽

Survival Plot

Abstract A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored, which is the structure stemming from the relationships between the coordinates. Specifically, we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space. We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan–Meier survival plot.

Canonical correlation analysis in high dimensions with structured regularization

Statistical Modelling ◽

10.1177/1471082x211041033 ◽

2021 ◽

pp. 1471082X2110410

Author(s):

Elena Tuzhilina ◽

Leonardo Tozzi ◽

Trevor Hastie

Keyword(s):

Data Structure ◽

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

Multivariate Data ◽

High Dimensional Data ◽

High Dimensional ◽

High Dimensions ◽

Structured Regularization ◽

Data Matrices

Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. A regularized modification of canonical correlation analysis (RCCA) which imposes an [Formula: see text] penalty on the CCA coefficients is widely used in applications with high-dimensional data. One limitation of such regularization is that it ignores any data structure, treating all the features equally, which can be ill-suited for some applications. In this article we introduce several approaches to regularizing CCA that take the underlying data structure into account. In particular, the proposed group regularized canonical correlation analysis (GRCCA) is useful when the variables are correlated in groups. We illustrate some computational strategies to avoid excessive computations with regularized CCA in high dimensions. We demonstrate the application of these methods in our motivating application from neuroscience, as well as in a small simulation example.

A robust Hotelling test statistic for one sample case in high dimensional data

Communication in Statistics- Theory and Methods ◽

10.1080/03610926.2021.1996606 ◽

2021 ◽

pp. 1-15

Author(s):

Hasan Bulut

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Test Statistic ◽

Sample Case

Sparse sufficient dimension reduction with heteroscedasticity

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691321500375 ◽

2021 ◽

pp. 2150037

Author(s):

Haoyang Cheng ◽

Wenquan Cui

Keyword(s):

Data Analysis ◽

Dimension Reduction ◽

High Dimensional Data ◽

Real Data ◽

Dimensional Subspace ◽

High Dimensional ◽

Sufficient Dimension Reduction ◽

Lasso Regression ◽

High Dimension Data ◽

Reduction Problem

Heteroscedasticity often appears in the high-dimensional data analysis. In order to achieve a sparse dimension reduction direction for high-dimensional data with heteroscedasticity, we propose a new sparse sufficient dimension reduction method, called Lasso-PQR. From the candidate matrix derived from the principal quantile regression (PQR) method, we construct a new artificial response variable which is made up from top eigenvectors of the candidate matrix. Then we apply a Lasso regression to obtain sparse dimension reduction directions. While for the “large [Formula: see text] small [Formula: see text]” case that [Formula: see text], we use principal projection to solve the dimension reduction problem in a lower-dimensional subspace and projection back to the original dimension reduction problem. Theoretical properties of the methodology are established. Compared with several existing methods in the simulations and real data analysis, we demonstrate the advantages of our method in the high dimension data with heteroscedasticity.

LASSO-Type Variable Selection Methods for High-Dimensional Data

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.444-445.604 ◽

2013 ◽

Vol 444-445 ◽

pp. 604-609

Author(s):

Guang Hui Fu ◽

Pan Wang

Keyword(s):

Variable Selection ◽

High Dimensional Data ◽

Real Data ◽

Oracle Property ◽

High Dimensional ◽

Data Sets ◽

Group Effect ◽

Selection Methods ◽

Variable Selection Method ◽

Type Variable

LASSO is a very useful variable selection method for high-dimensional data , But it does not possess oracle property [Fan and Li, 200 and group effect [Zou and Hastie, 200. In this paper, we firstly review four improved LASSO-type methods which satisfy oracle property and (or) group effect, and then give another two new ones called WFEN and WFAEN. The performance on both the simulation and real data sets shows that WFEN and WFAEN are competitive with other LASSO-type methods.

A Hybrid Method for High-Utility Itemsets Mining in Large High-Dimensional Data

Integrations of Data Warehousing, Data Mining and Database Technologies ◽

10.4018/978-1-60960-537-7.ch004 ◽

2011 ◽

pp. 60-76

Author(s):

Guangzhu Guangzhu Yu ◽

Shihuang Shao ◽

Bin Luo ◽

Xianhui Zeng

Keyword(s):

Hybrid Model ◽

Hybrid Method ◽

Optimization Technique ◽

High Dimensional Data ◽

High Dimensional ◽

High Dimensions ◽

High Utility ◽

High Utility Itemsets ◽

Common Items ◽

Candidate Set

Existing algorithms for high-utility itemsets mining are column enumeration based, adopting an Apriorilike candidate set generation-and-test approach, and thus are inadequate in datasets with high dimensions or long patterns. To solve the problem, this paper proposed a hybrid model and a row enumerationbased algorithm, i.e., Inter-transaction, to discover high-utility itemsets from two directions: an existing algorithm can be used to seek short high-utility itemsets from the bottom, while Inter-transaction can be used to seek long high-utility itemsets from the top. Inter-transaction makes full use of the characteristic that there are few common items between or among long transactions. By intersecting relevant transactions, the new algorithm can identify long high-utility itemsets, without extending short itemsets step by step. In addition, we also developed new pruning strategies and an optimization technique to improve the performance of Inter-transaction.

A Two-Sample Test for Mean Vectors in High-Dimensional Data

Applied Science and Innovative Research ◽

10.22158/asir.v1n2p118 ◽

2017 ◽

Vol 1 (2) ◽

pp. 118

Author(s):

Knavoot Jiamwattanapong ◽

Samruam Chongcharoen

Keyword(s):

Sample Size ◽

High Dimensional Data ◽

Null Distribution ◽

High Dimensional ◽

Measurement Technology ◽

Test Statistic ◽

Dna Microarray Data ◽

Statistical Inferences ◽

Sample Test ◽

Mean Vectors

Modern measurement technology has enabled the capture of high-dimensional data by researchers and statisticians and classical statistical inferences, such as the renowned Hotelling’s T2 test, are no longer valid when the dimension of the data equals or exceeds the sample size. Importantly, when correlations among variables in a dataset exist, taking them into account in the analysis method would provide more accurate conclusions. In this article, we consider the hypothesis testing problem for two mean vectors in high-dimensional data with an underlying normality assumption. A new test is proposed based on the idea of keeping more information from the sample covariances. The asymptotic null distribution of the test statistic is derived. The simulation results show that the proposed test performs well comparing with other competing tests and becomes more powerful when the dimension increases for a given sample size. The proposed test is also illustrated with an analysis of DNA microarray data.

Finding differentially expressed genes in high dimensional data: Rank based test statistic via a distance measure

Statistical Methods in Medical Research ◽

10.1177/0962280211434428 ◽

2012 ◽

Vol 24 (6) ◽

pp. 968-979

Author(s):

Sunil Mathur ◽

Ajit Sadana

Keyword(s):

Differentially Expressed Genes ◽

Distance Measure ◽

High Dimensional Data ◽

Differentially Expressed ◽

High Dimensional ◽

Test Statistic