scholarly journals Using Penalized Regression with Parallel Coordinates for Visualization of Significance in High Dimensional Data

Author(s):  
Shengwen Wang ◽  
Yi Yang ◽  
Jih-Sheng Chang ◽  
Fang-Pang Lin
2020 ◽  
Vol 25 (4) ◽  
pp. 1376-1391
Author(s):  
Liangfu Lu ◽  
Wenbo Wang ◽  
Zhiyuan Tan

AbstractThe Parallel Coordinates Plot (PCP) is a popular technique for the exploration of high-dimensional data. In many cases, researchers apply it as an effective method to analyze and mine data. However, when today’s data volume is getting larger, visual clutter and data clarity become two of the main challenges in parallel coordinates plot. Although Arc Coordinates Plot (ACP) is a popular approach to address these challenges, few optimization and improvement have been made on it. In this paper, we do three main contributions on the state-of-the-art PCP methods. One approach is the improvement of visual method itself. The other two approaches are mainly on the improvement of perceptual scalability when the scale or the dimensions of the data turn to be large in some mobile and wireless practical applications. 1) We present an improved visualization method based on ACP, termed as double arc coordinates plot (DACP). It not only reduces the visual clutter in ACP, but use a dimension-based bundling method with further optimization to deals with the issues of the conventional parallel coordinates plot (PCP). 2)To reduce the clutter caused by the order of the axes and reveal patterns that hidden in the data sets, we propose our first dimensional reordering method, a contribution-based method in DACP, which is based on the singular value decomposition (SVD) algorithm. The approach computes the importance score of attributes (dimensions) of the data using SVD and visualize the dimensions from left to right in DACP according the score in SVD. 3) Moreover, a similarity-based method, which is based on the combination of nonlinear correlation coefficient and SVD algorithm, is proposed as well in the paper. To measure the correlation between two dimensions and explains how the two dimensions interact with each other, we propose a reordering method based on non-linear correlation information measurements. We mainly use mutual information to calculate the partial similarity of dimensions in high-dimensional data visualization, and SVD is used to measure global data. Lastly, we use five case scenarios to evaluate the effectiveness of DACP, and the results show that our approaches not only do well in visualizing multivariate dataset, but also effectively alleviate the visual clutter in the conventional PCP, which bring users a better visual experience.


2020 ◽  
Vol 17 (2) ◽  
pp. 0550
Author(s):  
Ali Hameed Yousef ◽  
Omar Abdulmohsin Ali

         The issue of penalized regression model has received considerable critical attention to variable selection. It plays an essential role in dealing with high dimensional data. Arctangent denoted by the Atan penalty has been used in both estimation and variable selection as an efficient method recently. However, the Atan penalty is very sensitive to outliers in response to variables or heavy-tailed error distribution. While the least absolute deviation is a good method to get robustness in regression estimation. The specific objective of this research is to propose a robust Atan estimator from combining these two ideas at once. Simulation experiments and real data applications show that the proposed LAD-Atan estimator has superior performance compared with other estimators.  


Author(s):  
Stacey Winham ◽  
Chong Wang ◽  
Alison A Motsinger-Reif

Recently, the amount of high-dimensional data has exploded, creating new analytical challenges for human genetics. Furthermore, much evidence suggests that common complex diseases may be due to complex etiologies such as gene-gene interactions, which are difficult to identify in high-dimensional data using traditional statistical approaches. Data-mining approaches are gaining popularity for variable selection in association studies, and one of the most commonly used methods to evaluate potential gene-gene interactions is Multifactor Dimensionality Reduction (MDR). Additionally, a number of penalized regression techniques, such as Lasso, are gaining popularity within the statistical community and are now being applied to association studies, including extensions for interactions. In this study, we compare the performance of MDR, the traditional lasso with L1 penalty (TL1), and the group lasso for categorical data with group-wise L1 penalty (GL1) to detect gene-gene interactions through a broad range of simulations.We find that each method has both advantages and disadvantages, and relative performance is context dependent. TL1 frequently over-fits, identifying false positive as well as true positive loci. MDR has higher power for epistatic models that exhibit independent main effects; for both Lasso methods, main effects tend to dominate. For purely epistatic models, GL1 has the best performance for lower minor allele frequencies, but MDR performs best for higher frequencies. These results provide guidance of when each approach might be best suited for detecting and characterizing interactions with different mechanisms.


Sign in / Sign up

Export Citation Format

Share Document