Statistical analysis of very high-dimensional data sets of hierarchically structured binary variables with missing data: An application to marine corps readiness evaluations

1985 ◽  
Vol 32 (3) ◽  
pp. 467-490 ◽  
Author(s):  
S. Zacks ◽  
W. H. Marlow ◽  
S. S. Brier
2018 ◽  
Vol 8 (2) ◽  
pp. 377-406
Author(s):  
Almog Lahav ◽  
Ronen Talmon ◽  
Yuval Kluger

Abstract A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored, which is the structure stemming from the relationships between the coordinates. Specifically, we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space. We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan–Meier survival plot.


Author(s):  
Nguyen Thanh Tung ◽  
Joshua Zhexue Huang ◽  
Imran Khan ◽  
Mark Junjie Li ◽  
Graham Williams

2020 ◽  
Vol 25 (4) ◽  
pp. 1376-1391
Author(s):  
Liangfu Lu ◽  
Wenbo Wang ◽  
Zhiyuan Tan

AbstractThe Parallel Coordinates Plot (PCP) is a popular technique for the exploration of high-dimensional data. In many cases, researchers apply it as an effective method to analyze and mine data. However, when today’s data volume is getting larger, visual clutter and data clarity become two of the main challenges in parallel coordinates plot. Although Arc Coordinates Plot (ACP) is a popular approach to address these challenges, few optimization and improvement have been made on it. In this paper, we do three main contributions on the state-of-the-art PCP methods. One approach is the improvement of visual method itself. The other two approaches are mainly on the improvement of perceptual scalability when the scale or the dimensions of the data turn to be large in some mobile and wireless practical applications. 1) We present an improved visualization method based on ACP, termed as double arc coordinates plot (DACP). It not only reduces the visual clutter in ACP, but use a dimension-based bundling method with further optimization to deals with the issues of the conventional parallel coordinates plot (PCP). 2)To reduce the clutter caused by the order of the axes and reveal patterns that hidden in the data sets, we propose our first dimensional reordering method, a contribution-based method in DACP, which is based on the singular value decomposition (SVD) algorithm. The approach computes the importance score of attributes (dimensions) of the data using SVD and visualize the dimensions from left to right in DACP according the score in SVD. 3) Moreover, a similarity-based method, which is based on the combination of nonlinear correlation coefficient and SVD algorithm, is proposed as well in the paper. To measure the correlation between two dimensions and explains how the two dimensions interact with each other, we propose a reordering method based on non-linear correlation information measurements. We mainly use mutual information to calculate the partial similarity of dimensions in high-dimensional data visualization, and SVD is used to measure global data. Lastly, we use five case scenarios to evaluate the effectiveness of DACP, and the results show that our approaches not only do well in visualizing multivariate dataset, but also effectively alleviate the visual clutter in the conventional PCP, which bring users a better visual experience.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Michele Allegra ◽  
Elena Facco ◽  
Francesco Denti ◽  
Alessandro Laio ◽  
Antonietta Mira

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.


2013 ◽  
Vol 2013 ◽  
pp. 1-8 ◽  
Author(s):  
Douglas M. Hawkins ◽  
Edgard M. Maboudou-Tchao

Classification and prediction problems using spectral data lead to high-dimensional data sets. Spectral data are, however, different from most other high-dimensional data sets in that information usually varies smoothly with wavelength, suggesting that fitted models should also vary smoothly with wavelength. Functional data analysis, widely used in the analysis of spectral data, meets this objective by changing perspective from the raw spectra to approximations using smooth basis functions. This paper explores linear regression and linear discriminant analysis fitted directly to the spectral data, imposing penalties on the values and roughness of the fitted coefficients, and shows by example that this can lead to better fits than existing standard methodologies.


Sign in / Sign up

Export Citation Format

Share Document