pophelper: an R package and web app to analyse and visualize population structure

Background: With the development of continuous glucose monitoring systems (CGMS), detailed glycemic data are now available for analysis. Yet analysis of this data-rich information can be formidable. The power of CGMS-derived data lies in its characterization of glycemic variability. In contrast, many standard glycemic measures like hemoglobin A1c (HbA1c) and self-monitored blood glucose inadequately describe glycemic variability and run the risk of bias toward overreporting hyperglycemia. Methods that adjust for this bias are often overlooked in clinical research due to difficulty of computation and lack of accessible analysis tools. Methods: In response, we have developed a new R package rGV, which calculates a suite of 16 glycemic variability metrics when provided a single individual’s CGM data. rGV is versatile and robust; it is capable of handling data of many formats from many sensor types. We also created a companion R Shiny web app that provides these glycemic variability analysis tools without prior knowledge of R coding. We analyzed the statistical reliability of all the glycemic variability metrics included in rGV and illustrate the clinical utility of rGV by analyzing CGM data from three studies. Results: In subjects without diabetes, greater glycemic variability was associated with higher HbA1c values. In patients with type 2 diabetes mellitus (T2DM), we found that high glucose is the primary driver of glycemic variability. In patients with type 1 diabetes (T1DM), we found that naltrexone use may potentially reduce glycemic variability. Conclusions: We present a new R package and accompanying web app to facilitate quick and easy computation of a suite of glycemic variability metrics.

Download Full-text

GeneMates: an R package for Detecting Horizontal Gene Co-transfer between Bacteria Using Gene-gene Associations Controlled for Population Structure

10.1101/2020.02.29.970970 ◽

2020 ◽

Author(s):

Yu Wan ◽

Ryan R. Wick ◽

Justin Zobel ◽

Danielle J. Ingle ◽

Michael Inouye ◽

...

Keyword(s):

Population Structure ◽

Large Scale ◽

R Package ◽

Bacterial Genomes ◽

Bacterial Populations ◽

Gene Associations ◽

Physical Linkage ◽

Rapid Dissemination ◽

Bacterial Genes ◽

Global Threat

AbstractAntimicrobial resistance (AMR) in bacteria has been a global threat to public health for decades. A well-known driving force for the emergence, evolution and dissemination of genetic AMR determinants in bacterial populations is horizontal gene transfer, which is frequently mediated by mobile genetic elements (MGEs). Some MGEs can capture, maintain, and rearrange multiple AMR genes in a donor bacterium before moving them into recipients, giving rise to a phenomenon called horizontal gene co-transfer (HGcoT). This physical linkage or co-localisation between mobile AMR genes is of particular concern because it facilitates rapid dissemination of multidrug resistance within and across bacterial populations, providing opportunities for co-selection of AMR genes and limiting our therapeutic options. The study of HGcoT can be benefited from large-scale whole-genome sequencing (WGS) data, however, by far most published studies of HGcoT only consider simple co-occurrence measures, which can be confounded by strong bacterial population structure due to clonal reproduction, leading to spurious associations. To address this issue, we present GeneMates, an R package implementing a network approach to identification of HGcoT using WGS data. The package enables users to test for associations between presence-absence of bacterial genes using univariate linear mixed models controlling for population structure based on core-genome variation. Furthermore, when physical distances between genes of interest are measurable in bacterial genomes, users can evaluate distance consistency to further support their inference of putative horizontally co-transferred genes, whose co-occurrence cannot be completely explained by the population structure. We demonstrate how this package can be used to identify co-transferred AMR genes and recover known MGEs from Escherichia coli and Salmonella Typhimurium WGS data. GeneMates is accessible at github.com/wanyuac/GeneMates.

Download Full-text

SampleSizePlanner: A Tool to Estimate and Justify Sample Size for Two-Group Studies

10.31222/osf.io/rm9dn ◽

2021 ◽

Author(s):

Marton Kovacs ◽

Don van Ravenzwaaij ◽

Rink Hoekstra ◽

Balazs Aczel

Keyword(s):

Sample Size ◽

R Package ◽

Statistical Technique ◽

Important Decision ◽

Decision Points ◽

Web App ◽

Study Designs

Planning sample size often requires researchers to identify a statistical technique and to make several choices during their calculations. Currently, there is a lack of clear guidelines for researchers to find and use the applicable procedure. In the present tutorial, we introduce a web app and R package that offer nine different procedures to determine and justify the sample size for independent two-group study designs. The application highlights the most important decision points for each procedure and suggests example justifications for them. The resulting sample size report can serve as a template for preregistrations and manuscripts.

Download Full-text

GeneMates: an R package for detecting horizontal gene co-transfer between bacteria using gene-gene associations controlled for population structure

BMC Genomics ◽

10.1186/s12864-020-07019-6 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Yu Wan ◽

Ryan R. Wick ◽

Justin Zobel ◽

Danielle J. Ingle ◽

Michael Inouye ◽

...

Keyword(s):

Population Structure ◽

Population Level ◽

R Package ◽

Whole Genome Sequencing Data ◽

Nucleotide Polymorphisms ◽

Network Approach ◽

Sequencing Data ◽

Association Tests ◽

Novel Approach ◽

Antimicrobial Resistance Genes

Abstract Background Horizontal gene transfer contributes to bacterial evolution through mobilising genes across various taxonomical boundaries. It is frequently mediated by mobile genetic elements (MGEs), which may capture, maintain, and rearrange mobile genes and co-mobilise them between bacteria, causing horizontal gene co-transfer (HGcoT). This physical linkage between mobile genes poses a great threat to public health as it facilitates dissemination and co-selection of clinically important genes amongst bacteria. Although rapid accumulation of bacterial whole-genome sequencing data since the 2000s enables study of HGcoT at the population level, results based on genetic co-occurrence counts and simple association tests are usually confounded by bacterial population structure when sampled bacteria belong to the same species, leading to spurious conclusions. Results We have developed a network approach to explore WGS data for evidence of intraspecies HGcoT and have implemented it in R package GeneMates (github.com/wanyuac/GeneMates). The package takes as input an allelic presence-absence matrix of interested genes and a matrix of core-genome single-nucleotide polymorphisms, performs association tests with linear mixed models controlled for population structure, produces a network of significantly associated alleles, and identifies clusters within the network as plausible co-transferred alleles. GeneMates users may choose to score consistency of allelic physical distances measured in genome assemblies using a novel approach we have developed and overlay scores to the network for further evidence of HGcoT. Validation studies of GeneMates on known acquired antimicrobial resistance genes in Escherichia coli and Salmonella Typhimurium show advantages of our network approach over simple association analysis: (1) distinguishing between allelic co-occurrence driven by HGcoT and that driven by clonal reproduction, (2) evaluating effects of population structure on allelic co-occurrence, and (3) direct links between allele clusters in the network and MGEs when physical distances are incorporated. Conclusion GeneMates offers an effective approach to detection of intraspecies HGcoT using WGS data.

Download Full-text

PheWAS-ME: a web-app for interactive exploration of multimorbidity patterns in PheWAS

Bioinformatics ◽

10.1093/bioinformatics/btaa870 ◽

2020 ◽

Author(s):

Nick Strayer ◽

Jana K Shirey-Rice ◽

Yu Shyr ◽

Joshua C Denny ◽

Jill M Pulley ◽

...

Keyword(s):

Genetic Variant ◽

Statistical Tests ◽

R Package ◽

Supplementary Information ◽

Health Records ◽

Individual Level ◽

Level Data ◽

Phenotype Data ◽

Tests Of Association ◽

Web App

Abstract Summary Electronic health records (EHRs) linked with a DNA biobank provide unprecedented opportunities for biomedical research in precision medicine. The Phenome-wide association study (PheWAS) is a widely used technique for the evaluation of relationships between genetic variants and a large collection of clinical phenotypes recorded in EHRs. PheWAS analyses are typically presented as static tables and charts of summary statistics obtained from statistical tests of association between a genetic variant and individual phenotypes. Comorbidities are common and typically lead to complex, multivariate gene–disease association signals that are challenging to interpret. Discovering and interrogating multimorbidity patterns and their influence in PheWAS is difficult and time-consuming. We present PheWAS-ME: an interactive dashboard to visualize individual-level genotype and phenotype data side-by-side with PheWAS analysis results, allowing researchers to explore multimorbidity patterns and their associations with a genetic variant of interest. We expect this application to enrich PheWAS analyses by illuminating clinical multimorbidity patterns present in the data. Availability and implementation A demo PheWAS-ME application is publicly available at https://prod.tbilab.org/phewas_me/. Sample datasets are provided for exploration with the option to upload custom PheWAS results and corresponding individual-level data. Online versions of the appendices are available at https://prod.tbilab.org/phewas_me_info/. The source code is available as an R package on GitHub (https://github.com/tbilab/multimorbidity_explorer). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Pcadapt: An R Package to Perform Genome Scans for Selection Based on Principal Component Analysis

10.1101/056135 ◽

2016 ◽

Cited By ~ 6

Author(s):

Keurcien Luu ◽

Eric Bazin ◽

Michael G. B. Blum

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Mahalanobis Distance ◽

Principal Component ◽

R Package ◽

Component Analysis ◽

Population Divergence ◽

Sequencing Data ◽

Genome Scans ◽

False Discoveries

AbstractThe R package pcadapt performs genome scans to detect genes under selection based on population genomic data. It assumes that candidate markers are outliers with respect to how they are related to population structure. Because population structure is ascertained with principal component analysis, the package is fast and works with large-scale data. It can handle missing data and pooled sequencing data. By contrast to population-based approaches, the package handle admixed individuals and does not require grouping individuals into populations. Since its first release, pcadapt has evolved both in terms of statistical approach and software implementation. We present results obtained with robust Mahalanobis distance, which is a new statistic for genome scans available in the 2.0 and later versions of the package. When hierarchical population structure occurs, Mahalanobis distance is more powerful than the communality statistic that was implemented in the first version of the package. Using simulated data, we compare pcadapt to other software for genome scans (BayeScan, hapflk, OutFLANK, sNMF). We find that the proportion of false discoveries is around a nominal false discovery rate set at 10% with the exception of BayeScan that generates 40% of false discoveries. We also find that the power of BayeScan is severely impacted by the presence of admixed individuals whereas pcadapt is not impacted. Last, we find that pcadapt and hapflk are the most powerful software in scenarios of population divergence and range expansion. Because pcadapt handles next-generation sequencing data, it is a valuable tool for data analysis in molecular ecology.

Download Full-text

Summix: A method for detecting and adjusting for population structure in genetic summary data

10.1101/2021.02.03.429446 ◽

2021 ◽

Author(s):

IS Arriaga-MacKenzie ◽

G Matesi ◽

S Chen ◽

A Ronco ◽

KM Marker ◽

...

Keyword(s):

Population Structure ◽

South Asian ◽

R Package ◽

Individual Level ◽

Level Data ◽

Causal Variants ◽

High Utility ◽

Ancestry Proportions ◽

Summary Data ◽

Reference Samples

AbstractPublicly available genetic summary data have high utility in research and the clinic including prioritizing putative causal variants, polygenic scoring, and leveraging common controls. However, summarizing individual-level data can mask population structure resulting in confounding, reduced power, and incorrect prioritization of putative causal variants. This limits the utility of publicly available data, especially for understudied or admixed populations where additional research and resources are most needed. While several methods exist to estimate ancestry in individual-level data, methods to estimate ancestry proportions in summary data are lacking. Here, we present Summix, a method to efficiently deconvolute ancestry and provide ancestry-adjusted allele frequencies from summary data. Using continental reference ancestry, African (AFR), Non-Finnish European (EUR), East Asian (EAS), Indigenous American (IAM), South Asian (SAS), we obtain accurate and precise estimates (within 0.1%) for all simulation scenarios. We apply Summix to gnomAD v2.1 exome and genome groups and subgroups finding heterogeneous continental ancestry for several groups including African/African American (∼84% AFR, ∼14% EUR) and American/Latinx (∼4% AFR, ∼5% EAS, ∼43% EUR, ∼46% IAM). Compared to the unadjusted gnomAD AFs, Summix’s ancestry-adjusted AFs more closely match respective African and Latinx reference samples. Even on modern, dense panels of summary statistics, Summix yields results in seconds allowing for estimation of confidence intervals via block bootstrap. Given an accompanying R package, Summix increases the utility and equity of public genetic resources, empowering novel research opportunities.

Download Full-text

How to automatically document data with the codebook package to facilitate data re-use

10.31234/osf.io/5qc6h ◽

2018 ◽

Cited By ~ 1

Author(s):

Ruben C. Arslan

Keyword(s):

Scientific Community ◽

R Package ◽

Research Waste ◽

Web App ◽

Psychological Scales ◽

Data Documentation ◽

Machine Readable ◽

Basic Standards ◽

Existing Data ◽

Over Time

Data documentation in psychology lags behind not only many other disciplines, but also basic standards of usefulness. Psychological scientists often prefer to invest the time and effort necessary to document existing data well into other duties such as writing and collecting more data. Codebooks therefore tend to be unstandardised and stored in proprietary formats, and are rarely properly indexed in search engines. This means that rich datasets are sometimes used only once—by their creators—and left to disappear into oblivion; even if they can find it, researchers are unlikely to publish analyses based on existing datasets if they cannot be confident they understand them well enough. My codebook package makes it easier to generate rich metadata in human- and machine-readable codebooks. By using metadata from existing sources and by automating some tedious tasks such as documenting psychological scales and reliabilities, summarising descriptives, and identifying missingness patterns, I aim to encourage researchers to use the package for their own or their team's benefit. The codebook R package and web app make it possible to generate rich codebooks in a few minutes and just three clicks. Over time, this could lead to psychological data becoming findable, accessible, interoperable, and reusable, and to reduced research waste, thereby benefiting the scientific community as a whole.

Download Full-text

FSTruct: An Fst-based tool for measuring ancestry variation in inference of population structure

10.1101/2021.09.24.461741 ◽

2021 ◽

Author(s):

Maike L Morrison ◽

Nicolas Alcala ◽

Noah A Rosenberg

Keyword(s):

Population Structure ◽

Dirichlet Distribution ◽

Clustering Algorithms ◽

Genetic Data ◽

R Package ◽

Bootstrap Test ◽

Individual Level ◽

Time Periods ◽

Frequency Vectors ◽

Membership Vector

In model-based inference of population structure from individual-level genetic data, individuals are assigned membership coefficients in a series of statistical clusters generated by clustering algorithms. Distinct patterns of variability in membership coefficients can be produced for different groups of individuals, for example, representing different predefined populations, sampling sites, or time periods. Such variability can be difficult to capture in a single numerical value; membership coefficient vectors are multivariate and potentially incommensurable across groups, as the number of clusters over which individuals are distributed can vary among groups of interest. Further, two groups might share few clusters in common, so that membership coefficient vectors are concentrated on different clusters. We introduce a method for measuring the variability of membership coefficients of individuals in a predefined group, making use of an analogy between variability across individuals in membership coefficient vectors and variation across populations in allele frequency vectors. We show that in a model in which membership coefficient vectors in a population follow a Dirichlet distribution, the measure increases linearly with a parameter describing the variance of a specified component of the membership vector. We apply the approach, which makes use of a normalized Fst statistic, to data on inferred population structure in three example scenarios. We also introduce a bootstrap test for equivalence of two or more groups in their level of membership coefficient variability. Our methods are implemented in the R package FSTruct.

Download Full-text

ipADMIXTURE: R package for inferring sub-population clusters based on genetic admixture

10.1101/2020.03.21.001206 ◽

2020 ◽

Author(s):

Chainarong Amornbunchornvej ◽

Pongsakorn Wangkumhang ◽

Sissades Tongsima

Keyword(s):

Population Structure ◽

R Package ◽

Structure Study ◽

Genetic Population Structure ◽

Genetic Admixture ◽

Admixture Analysis ◽

Genetic Population ◽

Minimum Number ◽

User Friendly

AbstractipADMIXTURE is an R package to infer clusters and their phylogeny based on Q matrices of genetic admixture analysis. It is the first software of its kind to infer not just only clusters, but also the hierarchy of sub-populations w.r.t. the minimum number of ancestors that split any pair of clusters apart. Since inputs of the package, Q matrices, can be obtained from well-known software (ADMIXTURE, STRUCTURE, etc.) and the Q matrices are mandatory information that are used in genetic population structure study, our package has a potential to help scientists and researchers to find deeper explanation of admixture analysis in their studies. Our package comes with a user-friendly interface to make the software accessible for everyone.

Download Full-text