scholarly journals A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis

Biostatistics ◽  
2009 ◽  
Vol 10 (3) ◽  
pp. 515-534 ◽  
Author(s):  
D. M. Witten ◽  
R. Tibshirani ◽  
T. Hastie
1997 ◽  
Vol 25 ◽  
pp. 347-352 ◽  
Author(s):  
Chris Derksen ◽  
Kkevin Misurak ◽  
Ellsworth Ledrew ◽  
Joe Piwowar ◽  
Barry Goodison

The stochastic relationships between terrestrial snow water equivalent (SWE) and measures of the atmospheric circulation were investigated for the Canadian Prairies and the American Great Plains for the winter of 1988. Snow-cover extent, derived from EASE-grid SSM/I satellite data, and griddcd atmospheric data from the National Meteorological Center were averaged at five day intervals. Principal components analysis (PCA) were performed for the time series of SSM/I snow-cover imagery as well as for 700 mb geopotential height and temperature, 500 mb height and 700–500 mb thickness. Canonical correlation analysis of the derived principal component weights was used to identify relationships between atmospheric variables and SWE. Results of the PCA indicate that a high degree of variance in upper air variables (>75%) can be explained by the first three principal components, while the first three SWE components account for over 90% of the variance in the original data. Results of the canonical correlation analysis show positive relationships between snow-cover accumulation and a meridional pressure distribution pattern, while snow ablation is linked to a zonal atmospheric pressure pattern.


2019 ◽  
Author(s):  
Theodoulos Rodosthenous ◽  
Vahid Shahrezaei ◽  
Marina Evangelou

AbstractMotivationRecent developments in technology have enabled researchers to collect multiple OMICS datasets for the same individuals. The conventional approach for understanding the relationships between the collected datasets and the complex trait of interest would be through the analysis of each OMIC dataset separately from the rest, or to test for associations between the OMICS datasets. In this work we show that by integrating multiple OMICS datasets together, instead of analysing them separately, improves our understanding of their in-between relationships as well as the predictive accuracy for the tested trait. As OMICS datasets are heterogeneous and high-dimensional (p >> n) integrating them can be done through Sparse Canonical Correlation Analysis (sCCA) that penalises the canonical variables for producing sparse latent variables while achieving maximal correlation between the datasets. Over the last years, a number of approaches for implementing sCCA have been proposed, where they differ on their objective functions, iterative algorithm for obtaining the sparse latent variables and make different assumptions about the original datasets.ResultsThrough a comparative study we have explored the performance of the conventional CCA proposed by Parkhomenko et al. [2009], penalised matrix decomposition CCA proposed by Witten and Tibshirani [2009] and its extension proposed by Suo et al. [2017]. The aferomentioned methods were modified to allow for different penalty functions. Although sCCA is an unsupervised learning approach for understanding of the in-between relationships, we have twisted the problem as a supervised learning one and investigated how the computed latent variables can be used for predicting complex traits. The approaches were extended to allow for multiple (more than two) datasets where the trait was included as one of the input datasets. Both ways have shown improvement over conventional predictive models that include one or multiple [email protected]


2018 ◽  
Author(s):  
Brielin C Brown ◽  
Nicolas L. Bray ◽  
Lior Pachter

AbstractPopulation structure in genotype data has been extensively studied, and is revealed by looking at the principal components of the genotype matrix. However, no similar analysis of population structure in gene expression data has been conducted, in part because a naïve principal components analysis of the gene expression matrix does not cluster by population. We identify a linear projection that reveals population structure in gene expression data. Our approach relies on the coupling of the principal components of genotype to the principal components of gene expression via canonical correlation analysis. Futhermore, we analyze the variance of each gene within the projection matrix to determine which genes significantly influence the projection. We identify thousands of significant genes, and show that a number of the top genes have been implicated in diseases that disproportionately impact African Americans.Author SummaryHigh dimensional, multi-modal genomics datasets are becoming increasingly common, which warrants investigation into analysis techniques that can reveal structure in the data without over-fitting. Here, we show that the coupling of principal component analysis to canonical correlation analysis offers an efficient approach to exploratory analysis of this kind of data. We apply this method to the GEUVADIS dataset of genotype and gene expression values of European and Yoruban individuals, finding as-of-yet unstudied population structure in the gene expression values. Moreover, many of the top genes identified by our method have been previously implicated in diseases that disproportionately impact African Americans.


1997 ◽  
Vol 25 ◽  
pp. 347-352 ◽  
Author(s):  
Chris Derksen ◽  
Kkevin Misurak ◽  
Ellsworth Ledrew ◽  
Joe Piwowar ◽  
Barry Goodison

The stochastic relationships between terrestrial snow water equivalent (SWE) and measures of the atmospheric circulation were investigated for the Canadian Prairies and the American Great Plains for the winter of 1988. Snow-cover extent, derived from EASE-grid SSM/I satellite data, and griddcd atmospheric data from the National Meteorological Center were averaged at five day intervals. Principal components analysis (PCA) were performed for the time series of SSM/I snow-cover imagery as well as for 700 mb geopotential height and temperature, 500 mb height and 700–500 mb thickness. Canonical correlation analysis of the derived principal component weights was used to identify relationships between atmospheric variables and SWE. Results of the PCA indicate that a high degree of variance in upper air variables (>75%) can be explained by the first three principal components, while the first three SWE components account for over 90% of the variance in the original data. Results of the canonical correlation analysis show positive relationships between snow-cover accumulation and a meridional pressure distribution pattern, while snow ablation is linked to a zonal atmospheric pressure pattern.


2020 ◽  
Vol 36 (17) ◽  
pp. 4616-4625 ◽  
Author(s):  
Theodoulos Rodosthenous ◽  
Vahid Shahrezaei ◽  
Marina Evangelou

Abstract Motivation Recent developments in technology have enabled researchers to collect multiple OMICS datasets for the same individuals. The conventional approach for understanding the relationships between the collected datasets and the complex trait of interest would be through the analysis of each OMIC dataset separately from the rest, or to test for associations between the OMICS datasets. In this work we show that integrating multiple OMICS datasets together, instead of analysing them separately, improves our understanding of their in-between relationships as well as the predictive accuracy for the tested trait. Several approaches have been proposed for the integration of heterogeneous and high-dimensional (p≫n) data, such as OMICS. The sparse variant of canonical correlation analysis (CCA) approach is a promising one that seeks to penalize the canonical variables for producing sparse latent variables while achieving maximal correlation between the datasets. Over the last years, a number of approaches for implementing sparse CCA (sCCA) have been proposed, where they differ on their objective functions, iterative algorithm for obtaining the sparse latent variables and make different assumptions about the original datasets. Results Through a comparative study we have explored the performance of the conventional CCA proposed by Parkhomenko et al., penalized matrix decomposition CCA proposed by Witten and Tibshirani and its extension proposed by Suo et al. The aforementioned methods were modified to allow for different penalty functions. Although sCCA is an unsupervised learning approach for understanding of the in-between relationships, we have twisted the problem as a supervised learning one and investigated how the computed latent variables can be used for predicting complex traits. The approaches were extended to allow for multiple (more than two) datasets where the trait was included as one of the input datasets. Both ways have shown improvement over conventional predictive models that include one or multiple datasets. Availability and implementation https://github.com/theorod93/sCCA. Supplementary information Supplementary data are available at Bioinformatics online.


1983 ◽  
Vol 61 (6) ◽  
pp. 1637-1646 ◽  
Author(s):  
J. W. Sheard ◽  
Dorothy W. Geale

Vegetation–environment relationships are defined with the aid of principal-components analysis and canonical correlation analysis. In both the uplands and lowlands a moisture gradient, determined by measuring gravimetric moisture and indicated by organic carbon, is the most important environmental influence on the vegetation. In the uplands this gradient is also associated with snow depth (drifting) and in the lowlands with conductivity. The second environmental gradient in the uplands is associated with depth to permafrost and its soil textural correlates. Thus soil texture, independent of its effect on soil moisture status, influences the distribution of plant communities. In the lowlands the second environmental gradient is less clear but is also associated with depth to permafrost and, in addition, elevation and CaCO3 equivalent. Canonical correlation analysis shows that the components extracted by principal-components analysis of the vegetation data did not conform to the important trends of variation in the environmental data. Principal-components analysis is nevertheless an essential means of data reduction prior to the application of canonical correlation. The statistical model used in the study has potential advantages over the independent use of either principal-components analysis or canonical correlation.


Sign in / Sign up

Export Citation Format

Share Document