scholarly journals Fast Principal Component Analysis of Large-Scale Genome-Wide Data

2014 ◽  
Author(s):  
Gad Abraham ◽  
Michael Inouye

Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.

2019 ◽  
Vol 35 (19) ◽  
pp. 3679-3683 ◽  
Author(s):  
Aritra Bose ◽  
Vassilis Kalantzis ◽  
Eugenia-Maria Kontopoulou ◽  
Mai Elkady ◽  
Peristera Paschou ◽  
...  

Abstract Motivation Principal Component Analysis is a key tool in the study of population structure in human genetics. As modern datasets become increasingly larger in size, traditional approaches based on loading the entire dataset in the system memory (Random Access Memory) become impractical and out-of-core implementations are the only viable alternative. Results We present TeraPCA, a C++ implementation of the Randomized Subspace Iteration method to perform Principal Component Analysis of large-scale datasets. TeraPCA can be applied both in-core and out-of-core and is able to successfully operate even on commodity hardware with a system memory of just a few gigabytes. Moreover, TeraPCA has minimal dependencies on external libraries and only requires a working installation of the BLAS and LAPACK libraries. When applied to a dataset containing a million individuals genotyped on a million markers, TeraPCA requires <5 h (in multi-threaded mode) to accurately compute the 10 leading principal components. An extensive experimental analysis shows that TeraPCA is both fast and accurate and is competitive with current state-of-the-art software for the same task. Availability and implementation Source code and documentation are both available at https://github.com/aritra90/TeraPCA. Supplementary information Supplementary data are available at Bioinformatics online.


2014 ◽  
Vol 94 (5) ◽  
pp. 662-676 ◽  
Author(s):  
Hugues Aschard ◽  
Bjarni J. Vilhjálmsson ◽  
Nicolas Greliche ◽  
Pierre-Emmanuel Morange ◽  
David-Alexandre Trégouët ◽  
...  

Entropy ◽  
2019 ◽  
Vol 21 (6) ◽  
pp. 548 ◽  
Author(s):  
Yuqing Sun ◽  
Jun Niu

Hydrological regionalization is a useful step in hydrological modeling and prediction. The regionalization is not always straightforward, however, due to the lack of long-term hydrological data and the complex multi-scale variability features embedded in the data. This study examines the multiscale soil moisture variability for the simulated data on a grid cell base obtained from a large-scale hydrological model, and clusters the grid-cell based soil moisture data using wavelet-based multiscale entropy and principal component analysis, over the Xijiang River basin in South China, for the period of 2002–2010. The effective regionalization, for 169 grid cells with the special resolution of 0.5° × 0.5°, produced homogeneous groups based on the pattern of wavelet-based entropy information. Four distinct modes explain 80.14% of the total embedded variability of the transformed wavelet power across different timescales. Moreover, the possible implications of the regionalization results for local hydrological applications, such as parameter estimation for an ungagged catchment and designing a uniform prediction strategy for a sub-area in a large-scale basin, are discussed.


2019 ◽  
Vol 116 (42) ◽  
pp. 21262-21267 ◽  
Author(s):  
Kenji Yano ◽  
Yoichi Morinaka ◽  
Fanmiao Wang ◽  
Peng Huang ◽  
Sayaka Takehara ◽  
...  

Elucidation of the genetic control of rice architecture is crucial due to the global demand for high crop yields. Rice architecture is a complex trait affected by plant height, tillering, and panicle morphology. In this study, principal component analysis (PCA) on 8 typical traits related to plant architecture revealed that the first principal component (PC), PC1, provided the most information on traits that determine rice architecture. A genome-wide association study (GWAS) using PC1 as a dependent variable was used to isolate a gene encoding rice, SPINDLY (OsSPY), that activates the gibberellin (GA) signal suppression protein SLR1. The effect of GA signaling on the regulation of rice architecture was confirmed in 9 types of isogenic plant having different levels of GA responsiveness. Further population genetics analysis demonstrated that the functional allele of OsSPY associated with semidwarfism and small panicles was selected in the process of rice breeding. In summary, the use of PCA in GWAS will aid in uncovering genes involved in traits with complex characteristics.


Sign in / Sign up

Export Citation Format

Share Document