A Two-Sample Test for Mean Vectors in High-Dimensional Data

Modern measurement technology has enabled the capture of high-dimensional data by researchers and statisticians and classical statistical inferences, such as the renowned Hotelling’s T2 test, are no longer valid when the dimension of the data equals or exceeds the sample size. Importantly, when correlations among variables in a dataset exist, taking them into account in the analysis method would provide more accurate conclusions. In this article, we consider the hypothesis testing problem for two mean vectors in high-dimensional data with an underlying normality assumption. A new test is proposed based on the idea of keeping more information from the sample covariances. The asymptotic null distribution of the test statistic is derived. The simulation results show that the proposed test performs well comparing with other competing tests and becomes more powerful when the dimension increases for a given sample size. The proposed test is also illustrated with an analysis of DNA microarray data.

Download Full-text

Risk of Selection of Irrelevant Features from High-Dimensional Data with Small Sample Size

Springer Proceedings in Mathematics & Statistics - Stochastic Models, Statistics and Their Applications ◽

10.1007/978-3-319-13881-7_44 ◽

2015 ◽

pp. 399-405

Author(s):

Henryk Maciejewski

Keyword(s):

Sample Size ◽

Small Sample Size ◽

High Dimensional Data ◽

Small Sample ◽

High Dimensional ◽

Selection Of

Download Full-text

A two-sample test for high-dimensional data with applications to gene-set testing

The Annals of Statistics ◽

10.1214/09-aos716 ◽

2010 ◽

Vol 38 (2) ◽

pp. 808-835 ◽

Cited By ~ 213

Author(s):

Song Xi Chen ◽

Ying-Li Qin

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Gene Set ◽

Sample Test ◽

Gene Set Testing

Download Full-text

An Efficient Dimensionality Reduction Approach for Small-sample Size and High-dimensional Data Modeling

Journal of Computers ◽

10.4304/jcp.9.3.576-580 ◽

2014 ◽

Vol 9 (3) ◽

Cited By ~ 5

Author(s):

Xintao Qiu ◽

Dongmei Fu ◽

Zhenduo Fu

Keyword(s):

Dimensionality Reduction ◽

Sample Size ◽

Small Sample Size ◽

High Dimensional Data ◽

Data Modeling ◽

Small Sample ◽

High Dimensional ◽

Reduction Approach

Download Full-text

An adaptive spatial-sign-based test for mean vectors of elliptically distributed high-dimensional data

Statistics and Its Interface ◽

10.4310/sii.2019.v12.n1.a9 ◽

2019 ◽

Vol 12 (1) ◽

pp. 93-106

Author(s):

Bu Zhou ◽

Jia Guo ◽

Jianwei Chen ◽

Jin-Ting Zhang

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Mean Vectors

Download Full-text

Corrigendum to “A two sample test in high dimensional data” [J. Multivariate Anal. 114 (2013) 349–358]

Journal of Multivariate Analysis ◽

10.1016/j.jmva.2013.04.016 ◽

2013 ◽

Vol 119 ◽

pp. 209

Author(s):

Muni S. Srivastava ◽

Shota Katayama ◽

Yutaka Kano

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Sample Test

Download Full-text

Soft Subspace Clustering for High-Dimensional Data

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch276 ◽

2011 ◽

pp. 1810-1814

Author(s):

Liping Jing ◽

Michael K. Ng ◽

Joshua Zhexue Huang

Keyword(s):

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Special Treatment ◽

Clustering Methods ◽

Real World Data ◽

Text Data ◽

Data Set ◽

Dna Microarray Data ◽

Text Document

High dimensional data is a phenomenon in real-world data mining applications. Text data is a typical example. In text mining, a text document is viewed as a vector of terms whose dimension is equal to the total number of unique terms in a data set, which is usually in thousands. High dimensional data occurs in business as well. In retails, for example, to effectively manage supplier relationship, suppliers are often categorized according to their business behaviors (Zhang, Huang, Qian, Xu, & Jing, 2006). The supplier’s behavior data is high dimensional, which contains thousands of attributes to describe the supplier’s behaviors, including product items, ordered amounts, order frequencies, product quality and so forth. One more example is DNA microarray data. Clustering high-dimensional data requires special treatment (Swanson, 1990; Jain, Murty, & Flynn, 1999; Cai, He, & Han, 2005; Kontaki, Papadopoulos & Manolopoulos., 2007), although various methods for clustering are available (Jain & Dubes, 1988). One type of clustering methods for high dimensional data is referred to as subspace clustering, aiming at finding clusters from subspaces instead of the entire data space. In a subspace clustering, each cluster is a set of objects identified by a subset of dimensions and different clusters are represented in different subsets of dimensions. Soft subspace clustering considers that different dimensions make different contributions to the identification of objects in a cluster. It represents the importance of a dimension as a weight that can be treated as the degree of the dimension in contribution to the cluster. Soft subspace clustering can find the cluster memberships of objects and identify the subspace of each cluster in the same clustering process.

Download Full-text

A robust Hotelling test statistic for one sample case in high dimensional data

Communication in Statistics- Theory and Methods ◽

10.1080/03610926.2021.1996606 ◽

2021 ◽

pp. 1-15

Author(s):

Hasan Bulut

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Test Statistic ◽

Sample Case

Download Full-text

Sample Size Considerations of Prediction-Validation Methods in High-Dimensional Data for Survival Outcomes

Genetic Epidemiology ◽

10.1002/gepi.21721 ◽

2013 ◽

Vol 37 (3) ◽

pp. 276-282 ◽

Cited By ~ 22

Author(s):

Herbert Pang ◽

Sin-Ho Jung

Keyword(s):

Sample Size ◽

High Dimensional Data ◽

High Dimensional ◽

Survival Outcomes ◽

Validation Methods

Download Full-text

Robust two-sample test of high-dimensional mean vectors under dependence

Journal of Multivariate Analysis ◽

10.1016/j.jmva.2018.09.013 ◽

2019 ◽

Vol 169 ◽

pp. 312-329

Author(s):

Wei Wang ◽

Nan Lin ◽

Xiang Tang

Keyword(s):

High Dimensional ◽

Sample Test ◽

Mean Vectors

Download Full-text

Normality Testing of High-Dimensional Data Based on Principle Component and Jarque–Bera Statistics

Stats ◽

10.3390/stats4010016 ◽

2021 ◽

Vol 4 (1) ◽

pp. 216-227

Author(s):

Yanan Song ◽

Xuejing Zhao

Keyword(s):

High Dimensional Data ◽

Real Data ◽

High Dimensional ◽

Empirical Power ◽

Test Statistic ◽

High Dimensions ◽

Principle Component ◽

Orthogonal Space ◽

Combined Test ◽

Reduced Data

The testing of high-dimensional normality is an important issue and has been intensively studied in the literature, it depends on the variance–covariance matrix of the sample and numerous methods have been proposed to reduce its complexity. Principle component analysis (PCA) has been widely used in high dimensions, since it can project high-dimensional data into a lower-dimensional orthogonal space. The normality of the reduced data can then be evaluated by Jarque–Bera (JB) statistics in each principle direction. We propose a combined test statistic—the summation of one-way JB statistics upon the independence of the principle directions—to test the multivariate normality of data in high dimensions. The performance of the proposed method is illustrated by the empirical power of the simulated normal and non-normal data. Two real data examples show the validity of our proposed method.

Download Full-text