scholarly journals A Two-Sample Test for Mean Vectors in High-Dimensional Data

2017 ◽  
Vol 1 (2) ◽  
pp. 118
Author(s):  
Knavoot Jiamwattanapong ◽  
Samruam Chongcharoen

<p><em>Modern measurement technology has enabled the capture of high-dimensional data by researchers and statisticians and classical statistical inferences, such as </em><em>the renowned Hotelling’s T<sup>2</sup> test, are no longer valid when the dimension of the data equals or exceeds the sample size. Importantly, when correlations among variables in a dataset exist, taking them into account in the analysis method would provide more accurate conclusions. In this article, we consider the hypothesis testing problem for two mean vectors in high-dimensional data with an underlying normality assumption. A new test is proposed based on the idea of keeping more information from the sample covariances. The asymptotic null distribution of the test statistic is derived. The simulation results show that the proposed test performs well comparing with other competing tests and becomes more powerful when the dimension increases for a given sample size. The proposed test is also illustrated with an analysis of DNA microarray data. </em></p>

2019 ◽  
Vol 12 (1) ◽  
pp. 93-106
Author(s):  
Bu Zhou ◽  
Jia Guo ◽  
Jianwei Chen ◽  
Jin-Ting Zhang

Author(s):  
Liping Jing ◽  
Michael K. Ng ◽  
Joshua Zhexue Huang

High dimensional data is a phenomenon in real-world data mining applications. Text data is a typical example. In text mining, a text document is viewed as a vector of terms whose dimension is equal to the total number of unique terms in a data set, which is usually in thousands. High dimensional data occurs in business as well. In retails, for example, to effectively manage supplier relationship, suppliers are often categorized according to their business behaviors (Zhang, Huang, Qian, Xu, & Jing, 2006). The supplier’s behavior data is high dimensional, which contains thousands of attributes to describe the supplier’s behaviors, including product items, ordered amounts, order frequencies, product quality and so forth. One more example is DNA microarray data. Clustering high-dimensional data requires special treatment (Swanson, 1990; Jain, Murty, & Flynn, 1999; Cai, He, & Han, 2005; Kontaki, Papadopoulos & Manolopoulos., 2007), although various methods for clustering are available (Jain & Dubes, 1988). One type of clustering methods for high dimensional data is referred to as subspace clustering, aiming at finding clusters from subspaces instead of the entire data space. In a subspace clustering, each cluster is a set of objects identified by a subset of dimensions and different clusters are represented in different subsets of dimensions. Soft subspace clustering considers that different dimensions make different contributions to the identification of objects in a cluster. It represents the importance of a dimension as a weight that can be treated as the degree of the dimension in contribution to the cluster. Soft subspace clustering can find the cluster memberships of objects and identify the subspace of each cluster in the same clustering process.


2019 ◽  
Vol 169 ◽  
pp. 312-329
Author(s):  
Wei Wang ◽  
Nan Lin ◽  
Xiang Tang

Stats ◽  
2021 ◽  
Vol 4 (1) ◽  
pp. 216-227
Author(s):  
Yanan Song ◽  
Xuejing Zhao

The testing of high-dimensional normality is an important issue and has been intensively studied in the literature, it depends on the variance–covariance matrix of the sample and numerous methods have been proposed to reduce its complexity. Principle component analysis (PCA) has been widely used in high dimensions, since it can project high-dimensional data into a lower-dimensional orthogonal space. The normality of the reduced data can then be evaluated by Jarque–Bera (JB) statistics in each principle direction. We propose a combined test statistic—the summation of one-way JB statistics upon the independence of the principle directions—to test the multivariate normality of data in high dimensions. The performance of the proposed method is illustrated by the empirical power of the simulated normal and non-normal data. Two real data examples show the validity of our proposed method.


Sign in / Sign up

Export Citation Format

Share Document