IPCAPS: an R package for iterative pruning to capture population structure

AbstractAntimicrobial resistance (AMR) in bacteria has been a global threat to public health for decades. A well-known driving force for the emergence, evolution and dissemination of genetic AMR determinants in bacterial populations is horizontal gene transfer, which is frequently mediated by mobile genetic elements (MGEs). Some MGEs can capture, maintain, and rearrange multiple AMR genes in a donor bacterium before moving them into recipients, giving rise to a phenomenon called horizontal gene co-transfer (HGcoT). This physical linkage or co-localisation between mobile AMR genes is of particular concern because it facilitates rapid dissemination of multidrug resistance within and across bacterial populations, providing opportunities for co-selection of AMR genes and limiting our therapeutic options. The study of HGcoT can be benefited from large-scale whole-genome sequencing (WGS) data, however, by far most published studies of HGcoT only consider simple co-occurrence measures, which can be confounded by strong bacterial population structure due to clonal reproduction, leading to spurious associations. To address this issue, we present GeneMates, an R package implementing a network approach to identification of HGcoT using WGS data. The package enables users to test for associations between presence-absence of bacterial genes using univariate linear mixed models controlling for population structure based on core-genome variation. Furthermore, when physical distances between genes of interest are measurable in bacterial genomes, users can evaluate distance consistency to further support their inference of putative horizontally co-transferred genes, whose co-occurrence cannot be completely explained by the population structure. We demonstrate how this package can be used to identify co-transferred AMR genes and recover known MGEs from Escherichia coli and Salmonella Typhimurium WGS data. GeneMates is accessible at github.com/wanyuac/GeneMates.

Download Full-text

GeneMates: an R package for detecting horizontal gene co-transfer between bacteria using gene-gene associations controlled for population structure

BMC Genomics ◽

10.1186/s12864-020-07019-6 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Yu Wan ◽

Ryan R. Wick ◽

Justin Zobel ◽

Danielle J. Ingle ◽

Michael Inouye ◽

...

Keyword(s):

Population Structure ◽

Population Level ◽

R Package ◽

Whole Genome Sequencing Data ◽

Nucleotide Polymorphisms ◽

Network Approach ◽

Sequencing Data ◽

Association Tests ◽

Novel Approach ◽

Antimicrobial Resistance Genes

Abstract Background Horizontal gene transfer contributes to bacterial evolution through mobilising genes across various taxonomical boundaries. It is frequently mediated by mobile genetic elements (MGEs), which may capture, maintain, and rearrange mobile genes and co-mobilise them between bacteria, causing horizontal gene co-transfer (HGcoT). This physical linkage between mobile genes poses a great threat to public health as it facilitates dissemination and co-selection of clinically important genes amongst bacteria. Although rapid accumulation of bacterial whole-genome sequencing data since the 2000s enables study of HGcoT at the population level, results based on genetic co-occurrence counts and simple association tests are usually confounded by bacterial population structure when sampled bacteria belong to the same species, leading to spurious conclusions. Results We have developed a network approach to explore WGS data for evidence of intraspecies HGcoT and have implemented it in R package GeneMates (github.com/wanyuac/GeneMates). The package takes as input an allelic presence-absence matrix of interested genes and a matrix of core-genome single-nucleotide polymorphisms, performs association tests with linear mixed models controlled for population structure, produces a network of significantly associated alleles, and identifies clusters within the network as plausible co-transferred alleles. GeneMates users may choose to score consistency of allelic physical distances measured in genome assemblies using a novel approach we have developed and overlay scores to the network for further evidence of HGcoT. Validation studies of GeneMates on known acquired antimicrobial resistance genes in Escherichia coli and Salmonella Typhimurium show advantages of our network approach over simple association analysis: (1) distinguishing between allelic co-occurrence driven by HGcoT and that driven by clonal reproduction, (2) evaluating effects of population structure on allelic co-occurrence, and (3) direct links between allele clusters in the network and MGEs when physical distances are incorporated. Conclusion GeneMates offers an effective approach to detection of intraspecies HGcoT using WGS data.

Download Full-text

pophelper: an R package and web app to analyse and visualize population structure

Molecular Ecology Resources ◽

10.1111/1755-0998.12509 ◽

2016 ◽

Vol 17 (1) ◽

pp. 27-32 ◽

Cited By ~ 238

Author(s):

R. M. Francis

Keyword(s):

Population Structure ◽

R Package ◽

Web App

Download Full-text

Pcadapt: An R Package to Perform Genome Scans for Selection Based on Principal Component Analysis

10.1101/056135 ◽

2016 ◽

Cited By ~ 6

Author(s):

Keurcien Luu ◽

Eric Bazin ◽

Michael G. B. Blum

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Mahalanobis Distance ◽

Principal Component ◽

R Package ◽

Component Analysis ◽

Population Divergence ◽

Sequencing Data ◽

Genome Scans ◽

False Discoveries

AbstractThe R package pcadapt performs genome scans to detect genes under selection based on population genomic data. It assumes that candidate markers are outliers with respect to how they are related to population structure. Because population structure is ascertained with principal component analysis, the package is fast and works with large-scale data. It can handle missing data and pooled sequencing data. By contrast to population-based approaches, the package handle admixed individuals and does not require grouping individuals into populations. Since its first release, pcadapt has evolved both in terms of statistical approach and software implementation. We present results obtained with robust Mahalanobis distance, which is a new statistic for genome scans available in the 2.0 and later versions of the package. When hierarchical population structure occurs, Mahalanobis distance is more powerful than the communality statistic that was implemented in the first version of the package. Using simulated data, we compare pcadapt to other software for genome scans (BayeScan, hapflk, OutFLANK, sNMF). We find that the proportion of false discoveries is around a nominal false discovery rate set at 10% with the exception of BayeScan that generates 40% of false discoveries. We also find that the power of BayeScan is severely impacted by the presence of admixed individuals whereas pcadapt is not impacted. Last, we find that pcadapt and hapflk are the most powerful software in scenarios of population divergence and range expansion. Because pcadapt handles next-generation sequencing data, it is a valuable tool for data analysis in molecular ecology.

Download Full-text

Summix: A method for detecting and adjusting for population structure in genetic summary data

10.1101/2021.02.03.429446 ◽

2021 ◽

Author(s):

IS Arriaga-MacKenzie ◽

G Matesi ◽

S Chen ◽

A Ronco ◽

KM Marker ◽

...

Keyword(s):

Population Structure ◽

South Asian ◽

R Package ◽

Individual Level ◽

Level Data ◽

Causal Variants ◽

High Utility ◽

Ancestry Proportions ◽

Summary Data ◽

Reference Samples

AbstractPublicly available genetic summary data have high utility in research and the clinic including prioritizing putative causal variants, polygenic scoring, and leveraging common controls. However, summarizing individual-level data can mask population structure resulting in confounding, reduced power, and incorrect prioritization of putative causal variants. This limits the utility of publicly available data, especially for understudied or admixed populations where additional research and resources are most needed. While several methods exist to estimate ancestry in individual-level data, methods to estimate ancestry proportions in summary data are lacking. Here, we present Summix, a method to efficiently deconvolute ancestry and provide ancestry-adjusted allele frequencies from summary data. Using continental reference ancestry, African (AFR), Non-Finnish European (EUR), East Asian (EAS), Indigenous American (IAM), South Asian (SAS), we obtain accurate and precise estimates (within 0.1%) for all simulation scenarios. We apply Summix to gnomAD v2.1 exome and genome groups and subgroups finding heterogeneous continental ancestry for several groups including African/African American (∼84% AFR, ∼14% EUR) and American/Latinx (∼4% AFR, ∼5% EAS, ∼43% EUR, ∼46% IAM). Compared to the unadjusted gnomAD AFs, Summix’s ancestry-adjusted AFs more closely match respective African and Latinx reference samples. Even on modern, dense panels of summary statistics, Summix yields results in seconds allowing for estimation of confidence intervals via block bootstrap. Given an accompanying R package, Summix increases the utility and equity of public genetic resources, empowering novel research opportunities.

Download Full-text

FSTruct: An Fst-based tool for measuring ancestry variation in inference of population structure

10.1101/2021.09.24.461741 ◽

2021 ◽

Author(s):

Maike L Morrison ◽

Nicolas Alcala ◽

Noah A Rosenberg

Keyword(s):

Population Structure ◽

Dirichlet Distribution ◽

Clustering Algorithms ◽

Genetic Data ◽

R Package ◽

Bootstrap Test ◽

Individual Level ◽

Time Periods ◽

Frequency Vectors ◽

Membership Vector

In model-based inference of population structure from individual-level genetic data, individuals are assigned membership coefficients in a series of statistical clusters generated by clustering algorithms. Distinct patterns of variability in membership coefficients can be produced for different groups of individuals, for example, representing different predefined populations, sampling sites, or time periods. Such variability can be difficult to capture in a single numerical value; membership coefficient vectors are multivariate and potentially incommensurable across groups, as the number of clusters over which individuals are distributed can vary among groups of interest. Further, two groups might share few clusters in common, so that membership coefficient vectors are concentrated on different clusters. We introduce a method for measuring the variability of membership coefficients of individuals in a predefined group, making use of an analogy between variability across individuals in membership coefficient vectors and variation across populations in allele frequency vectors. We show that in a model in which membership coefficient vectors in a population follow a Dirichlet distribution, the measure increases linearly with a parameter describing the variance of a specified component of the membership vector. We apply the approach, which makes use of a normalized Fst statistic, to data on inferred population structure in three example scenarios. We also introduce a bootstrap test for equivalence of two or more groups in their level of membership coefficient variability. Our methods are implemented in the R package FSTruct.

Download Full-text

ipADMIXTURE: R package for inferring sub-population clusters based on genetic admixture

10.1101/2020.03.21.001206 ◽

2020 ◽

Author(s):

Chainarong Amornbunchornvej ◽

Pongsakorn Wangkumhang ◽

Sissades Tongsima

Keyword(s):

Population Structure ◽

R Package ◽

Structure Study ◽

Genetic Population Structure ◽

Genetic Admixture ◽

Admixture Analysis ◽

Genetic Population ◽

Minimum Number ◽

User Friendly

AbstractipADMIXTURE is an R package to infer clusters and their phylogeny based on Q matrices of genetic admixture analysis. It is the first software of its kind to infer not just only clusters, but also the hierarchy of sub-populations w.r.t. the minimum number of ancestors that split any pair of clusters apart. Since inputs of the package, Q matrices, can be obtained from well-known software (ADMIXTURE, STRUCTURE, etc.) and the Q matrices are mandatory information that are used in genetic population structure study, our package has a potential to help scientists and researchers to find deeper explanation of admixture analysis in their studies. Our package comes with a user-friendly interface to make the software accessible for everyone.

Download Full-text

IPCAPS: an R package for iterative pruning to capture population structure

10.1101/186874 ◽

2017 ◽

Cited By ~ 3

Author(s):

Kridsadakorn Chaichoompu ◽

Fentaw Abegaz Yazew ◽

Sissades Tongsima ◽

Philip James Shaw ◽

Anavaj Sakuntabhai ◽

...

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Principal Component ◽

R Package ◽

Component Analysis ◽

Genomic Variation ◽

Fine Scale ◽

Nucleotide Polymorphisms ◽

Measurement Scales ◽

Scale Population

AbstractBackgroundResolving population genetic structure is challenging, especially when dealing with closely related or geographically confined populations. Although Principal Component Analysis (PCA)-based methods and genomic variation with single nucleotide polymorphisms (SNPs) are widely used to describe shared genetic ancestry, improvements can be made especially when fine-scale population structure is the target.ResultsThis work presents an R package called IPCAPS, which uses SNP information for resolving possibly fine-scale population structure. The IPCAPS routines are built on the iterative pruning Principal Component Analysis (ipPCA) framework that systematically assigns individuals to genetically similar subgroups. In each iteration, our tool is able to detect and eliminate outliers, hereby avoiding severe misclassification errors.ConclusionsIPCAPS supports different measurement scales for variables used to identify substructure. Hence, panels of gene expression and methylation data can be accommodated as well. The tool can also be applied in patient sub-phenotyping contexts. IPCAPS is developed in R and is freely available from bio3.giga.ulg.ac.be/ipcaps

Download Full-text