Pcadapt: An R Package to Perform Genome Scans for Selection Based on Principal Component Analysis

AbstractThe R package pcadapt performs genome scans to detect genes under selection based on population genomic data. It assumes that candidate markers are outliers with respect to how they are related to population structure. Because population structure is ascertained with principal component analysis, the package is fast and works with large-scale data. It can handle missing data and pooled sequencing data. By contrast to population-based approaches, the package handle admixed individuals and does not require grouping individuals into populations. Since its first release, pcadapt has evolved both in terms of statistical approach and software implementation. We present results obtained with robust Mahalanobis distance, which is a new statistic for genome scans available in the 2.0 and later versions of the package. When hierarchical population structure occurs, Mahalanobis distance is more powerful than the communality statistic that was implemented in the first version of the package. Using simulated data, we compare pcadapt to other software for genome scans (BayeScan, hapflk, OutFLANK, sNMF). We find that the proportion of false discoveries is around a nominal false discovery rate set at 10% with the exception of BayeScan that generates 40% of false discoveries. We also find that the power of BayeScan is severely impacted by the presence of admixed individuals whereas pcadapt is not impacted. Last, we find that pcadapt and hapflk are the most powerful software in scenarios of population divergence and range expansion. Because pcadapt handles next-generation sequencing data, it is a valuable tool for data analysis in molecular ecology.

Download Full-text

IPCAPS: an R package for iterative pruning to capture population structure

10.1101/186874 ◽

2017 ◽

Cited By ~ 3

Author(s):

Kridsadakorn Chaichoompu ◽

Fentaw Abegaz Yazew ◽

Sissades Tongsima ◽

Philip James Shaw ◽

Anavaj Sakuntabhai ◽

...

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Principal Component ◽

R Package ◽

Component Analysis ◽

Genomic Variation ◽

Fine Scale ◽

Nucleotide Polymorphisms ◽

Measurement Scales ◽

Scale Population

AbstractBackgroundResolving population genetic structure is challenging, especially when dealing with closely related or geographically confined populations. Although Principal Component Analysis (PCA)-based methods and genomic variation with single nucleotide polymorphisms (SNPs) are widely used to describe shared genetic ancestry, improvements can be made especially when fine-scale population structure is the target.ResultsThis work presents an R package called IPCAPS, which uses SNP information for resolving possibly fine-scale population structure. The IPCAPS routines are built on the iterative pruning Principal Component Analysis (ipPCA) framework that systematically assigns individuals to genetically similar subgroups. In each iteration, our tool is able to detect and eliminate outliers, hereby avoiding severe misclassification errors.ConclusionsIPCAPS supports different measurement scales for variables used to identify substructure. Hence, panels of gene expression and methylation data can be accommodated as well. The tool can also be applied in patient sub-phenotyping contexts. IPCAPS is developed in R and is freely available from bio3.giga.ulg.ac.be/ipcaps

Download Full-text

Haplotype and Population Structure Inference using Neural Networks in Whole-Genome Sequencing Data

10.1101/2020.12.28.424587 ◽

2020 ◽

Author(s):

Jonas Meisner ◽

Anders Albrechtsen

Keyword(s):

Neural Networks ◽

Principal Component Analysis ◽

Population Structure ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Principal Component ◽

Component Analysis ◽

Whole Genome ◽

Sequencing Data ◽

Latent Space

AbstractAccurate inference of population structure is important in many studies of population genetics. In this paper we present, HaploNet, a novel method for performing dimensionality reduction and clustering in genetic data. The method is based on local clustering of phased haplotypes using neural networks from whole-genome sequencing or genotype data. By utilizing a Gaussian mixture prior in a variational autoencoder framework, we are able to learn a low-dimensional latent space in which we cluster haplotypes along the genome in a highly scalable manner. We demonstrate that we can use encodings of the latent space to infer global population structure using principal component analysis with haplotype information. Additionally, we derive an expectation-maximization algorithm for estimating ancestry proportions based on the haplotype clustering and the neural networks in a likelihood framework. Using different examples of sequencing data, we demonstrate that our approach is better at distinguishing closely related populations than standard principal component analysis and admixture analysis. We show that HaploNet performs similarly to ChromoPainter for principal component analysis while being much faster and allowing for unsupervised clustering.

Download Full-text

Inferring Population Structure and Admixture Proportions in Low Depth NGS Data

10.1101/302463 ◽

2018 ◽

Cited By ~ 5

Author(s):

Jonas Meisner ◽

Anders Albrechtsen

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Next Generation Sequencing ◽

Principal Component ◽

Component Analysis ◽

Allele Frequencies ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

ABSTRACTWe here present two methods for inferring population structure and admixture proportions in low depth next generation sequencing data. Inference of population structure is essential in both population genetics and association studies and is often performed using principal component analysis or clustering-based approaches. Next-generation sequencing methods provide large amounts of genetic data but are associated with statistical uncertainty for especially low depth sequencing data. Models can account for this uncertainty by working directly on genotype likelihoods of the unobserved genotypes. We propose a method for inferring population structure through principal component analysis in an iterative approach of estimating individual allele frequencies, where we demonstrate improved accuracy in samples with low and variable sequencing depth for both simulated and real datasets. We also use the estimated individual allele frequencies in a fast non-negative matrix factorization method to estimate admixture proportions. Both methods have been implemented in the PCAngsd framework available at http://www.popgen.dk/software/.

Download Full-text

Detecting selection in low-coverage high-throughput sequencing data using principal component analysis

BMC Bioinformatics ◽

10.1186/s12859-021-04375-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jonas Meisner ◽

Anders Albrechtsen ◽

Kristian Hanghøj

Keyword(s):

Principal Component Analysis ◽

High Throughput ◽

East Asian ◽

Principal Component ◽

Component Analysis ◽

Human Populations ◽

Population Genetic Study ◽

Sequencing Data ◽

High Quality ◽

Low Coverage

Abstract Background Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data. Materials and methods We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure. Results Here, we present two selections statistics which we have implemented in the framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes. Conclusion We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.

Download Full-text

Differentially Expressed Genes Extracted by the Tensor Robust Principal Component Analysis (TRPCA) Method

Complexity ◽

10.1155/2019/6136245 ◽

2019 ◽

Vol 2019 ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Yue Hu ◽

Jin-Xing Liu ◽

Ying-Lian Gao ◽

Sheng-Jun Li ◽

Juan Wang

Keyword(s):

Principal Component Analysis ◽

Differentially Expressed Genes ◽

Principal Component ◽

Component Analysis ◽

Differentially Expressed ◽

Low Rank ◽

Cancer Gene ◽

Sequencing Data ◽

Robust Principal Component Analysis ◽

The Matrix

In the big data era, sequencing technology has produced a large number of biological sequencing data. Different views of the cancer genome data provide sufficient complementary information to explore genetic activity. The identification of differentially expressed genes from multiview cancer gene data is of great importance in cancer diagnosis and treatment. In this paper, we propose a novel method for identifying differentially expressed genes based on tensor robust principal component analysis (TRPCA), which extends the matrix method to the processing of multiway data. To identify differentially expressed genes, the plan is carried out as follows. First, multiview data containing cancer gene expression data from different sources are prepared. Second, the original tensor is decomposed into a sum of a low-rank tensor and a sparse tensor using TRPCA. Third, the differentially expressed genes are considered to be sparse perturbed signals and then identified based on the sparse tensor. Fourth, the differentially expressed genes are evaluated using Gene Ontology and Gene Cards tools. The validity of the TRPCA method was tested using two sets of multiview data. The experimental results showed that our method is superior to the representative methods in efficiency and accuracy aspects.

Download Full-text

pcadapt: anRpackage to perform genome scans for selection based on principal component analysis

Molecular Ecology Resources ◽

10.1111/1755-0998.12592 ◽

2016 ◽

Vol 17 (1) ◽

pp. 67-77 ◽

Cited By ~ 241

Author(s):

Keurcien Luu ◽

Eric Bazin ◽

Michael G. B. Blum

Keyword(s):

Principal Component Analysis ◽

Principal Component ◽

Component Analysis ◽

Genome Scans

Download Full-text

Principal component analysis of coronaviruses reveals their diversity and seasonal and pandemic potential

PLoS ONE ◽

10.1371/journal.pone.0242954 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0242954

Author(s):

Tomokazu Konishi

Keyword(s):

Principal Component Analysis ◽

Genome Sequencing ◽

Principal Component ◽

Component Analysis ◽

Influenza Viruses ◽

Sequencing Data ◽

Similarities And Differences ◽

Annual Changes

Coronaviruses and influenza viruses have similarities and differences. In order to comprehensively compare them, their genome sequencing data were examined by principal component analysis. Coronaviruses had fewer variations than a subclass of influenza viruses. In addition, differences among coronaviruses that infect a variety of hosts were also small. These characteristics may have facilitated the infection of different hosts. Although many of the coronaviruses were conservative, those repeatedly found among humans showed annual changes. If SARS-CoV-2 changes its genome like the Influenza H type, it will repeatedly spread every few years. In addition, the coronavirus family has many other candidates for new pandemics.

Download Full-text

Exploring High-Dimensional Biological Data with Sparse Contrastive Principal Component Analysis

10.1101/836650 ◽

2019 ◽

Cited By ~ 1

Author(s):

Philippe Boileau ◽

Nima S. Hejazi ◽

Sandrine Dudoit

Keyword(s):

Principal Component Analysis ◽

Dimensionality Reduction ◽

High Throughput Sequencing ◽

Principal Component ◽

Component Analysis ◽

Biological Data ◽

Sequencing Data ◽

Microarray Gene Expression ◽

Biological Signal ◽

Reduction Techniques

AbstractMotivationStatistical analyses of high-throughput sequencing data have re-shaped the biological sciences. In spite of myriad advances, recovering interpretable biological signal from data corrupted by technical noise remains a prevalent open problem. Several classes of procedures, among them classical dimensionality reduction techniques and others incorporating subject-matter knowledge, have provided effective advances; however, no procedure currently satisfies the dual objectives of recovering stable and relevant features simultaneously.ResultsInspired by recent proposals for making use of control data in the removal of unwanted variation, we propose a variant of principal component analysis, sparse contrastive principal component analysis, that extracts sparse, stable, interpretable, and relevant biological signal. The new methodology is compared to competing dimensionality reduction approaches through a simulation study as well as via analyses of several publicly available protein expression, microarray gene expression, and single-cell transcriptome sequencing datasets.AvailabilityA free and open-source software implementation of the methodology, the scPCA R package, is made available via the Bioconductor Project. Code for all analyses presented in the paper is also available via GitHub.

Download Full-text