scholarly journals FSTruct: An Fst-based tool for measuring ancestry variation in inference of population structure

2021 ◽  
Author(s):  
Maike L Morrison ◽  
Nicolas Alcala ◽  
Noah A Rosenberg

In model-based inference of population structure from individual-level genetic data, individuals are assigned membership coefficients in a series of statistical clusters generated by clustering algorithms. Distinct patterns of variability in membership coefficients can be produced for different groups of individuals, for example, representing different predefined populations, sampling sites, or time periods. Such variability can be difficult to capture in a single numerical value; membership coefficient vectors are multivariate and potentially incommensurable across groups, as the number of clusters over which individuals are distributed can vary among groups of interest. Further, two groups might share few clusters in common, so that membership coefficient vectors are concentrated on different clusters. We introduce a method for measuring the variability of membership coefficients of individuals in a predefined group, making use of an analogy between variability across individuals in membership coefficient vectors and variation across populations in allele frequency vectors. We show that in a model in which membership coefficient vectors in a population follow a Dirichlet distribution, the measure increases linearly with a parameter describing the variance of a specified component of the membership vector. We apply the approach, which makes use of a normalized Fst statistic, to data on inferred population structure in three example scenarios. We also introduce a bootstrap test for equivalence of two or more groups in their level of membership coefficient variability. Our methods are implemented in the R package FSTruct.

2021 ◽  
Author(s):  
IS Arriaga-MacKenzie ◽  
G Matesi ◽  
S Chen ◽  
A Ronco ◽  
KM Marker ◽  
...  

AbstractPublicly available genetic summary data have high utility in research and the clinic including prioritizing putative causal variants, polygenic scoring, and leveraging common controls. However, summarizing individual-level data can mask population structure resulting in confounding, reduced power, and incorrect prioritization of putative causal variants. This limits the utility of publicly available data, especially for understudied or admixed populations where additional research and resources are most needed. While several methods exist to estimate ancestry in individual-level data, methods to estimate ancestry proportions in summary data are lacking. Here, we present Summix, a method to efficiently deconvolute ancestry and provide ancestry-adjusted allele frequencies from summary data. Using continental reference ancestry, African (AFR), Non-Finnish European (EUR), East Asian (EAS), Indigenous American (IAM), South Asian (SAS), we obtain accurate and precise estimates (within 0.1%) for all simulation scenarios. We apply Summix to gnomAD v2.1 exome and genome groups and subgroups finding heterogeneous continental ancestry for several groups including African/African American (∼84% AFR, ∼14% EUR) and American/Latinx (∼4% AFR, ∼5% EAS, ∼43% EUR, ∼46% IAM). Compared to the unadjusted gnomAD AFs, Summix’s ancestry-adjusted AFs more closely match respective African and Latinx reference samples. Even on modern, dense panels of summary statistics, Summix yields results in seconds allowing for estimation of confidence intervals via block bootstrap. Given an accompanying R package, Summix increases the utility and equity of public genetic resources, empowering novel research opportunities.


Author(s):  
Andrea Peña-Malavera ◽  
Cecilia Bruno ◽  
Elmer Fernandez ◽  
Monica Balzarini

AbstractIdentifying population genetic structure (PGS) is crucial for breeding and conservation. Several clustering algorithms are available to identify the underlying PGS to be used with genetic data of maize genotypes. In this work, six methods to identify PGS from unlinked molecular marker data were compared using simulated and experimental data consisting of multilocus-biallelic genotypes. Datasets were delineated under different biological scenarios characterized by three levels of genetic divergence among populations (low, medium, and high


2017 ◽  
Author(s):  
Josine Min ◽  
Gibran Hemani ◽  
George Davey Smith ◽  
Caroline Relton ◽  
Matthew Suderman

AbstractBackgroundTechnological advances in high throughput DNA methylation microarrays have allowed dramatic growth of a new branch of epigenetic epidemiology. DNA methylation datasets are growing ever larger in terms of the number of samples profiled, the extent of genome coverage, and the number of studies being meta-analysed. Novel computational solutions are required to efficiently handle these data.MethodsWe have developed meffil, an R package designed to quality control, normalize and perform epigenome-wide association studies (EWAS) efficiently on large samples of Illumina Infinium HumanMethylation450 and MethylationEPIC BeadChip microarrays. We tested meffil by applying it to 6000 450k microarrays generated from blood collected for two different datasets, Accessible Resource for Integrative Epigenomic Studies (ARIES) and The Genetics of Overweight Young Adults (GOYA) study.ResultsA complete reimplementation of functional normalization minimizes computational memory requirements to 5% of that required by other R packages, without increasing running time. Incorporating fixed and random effects alongside functional normalization, and automated estimation of functional normalisation parameters reduces technical variation in DNA methylation levels, thus reducing false positive associations and improving power. We also demonstrate that the ability to normalize datasets distributed across physically different locations without sharing any biologically-based individual-level data may reduce heterogeneity in meta-analyses of epigenome-wide association studies. However, we show that when batch is perfectly confounded with cases and controls functional normalization is unable to prevent spurious associations.Conclusionsmeffil is available online (https://github.com/perishky/meffil/) along with tutorials covering typical use cases.


1997 ◽  
Vol 48 (3) ◽  
pp. 235 ◽  
Author(s):  
Dean R. Jerry ◽  
David J. Woodland

Genetic data were collected from eight allopatric populations of the common freshwater catfish, Tandanus tandanus. Catfish sampled from the New South Wales (NSW) mid-northern coastal rivers of the Bellinger, Macleay, Hastings and Manning exhibited fixed allelic differences from T. tandanus from the type locality (Namoi River) at four enzymatic loci (GPI-1*, EST*, UMB-1* and UMB-2*), suggesting that, collectively, catfish from these four river systems constitute an undescribed species of Tandanus. Catfish from the northern coastal rivers of NSW (Tweed, Richmond and Clarence) displayed a complex pattern of population structure that was not fully resolved by the present study. More work is needed on the complex assemblage of populations of eel-tailed catfish in the eastern coastal drainages of Australia.


2020 ◽  
Vol 36 (20) ◽  
pp. 5027-5036 ◽  
Author(s):  
Mingzhou Song ◽  
Hua Zhong

Abstract Motivation Chromosomal patterning of gene expression in cancer can arise from aneuploidy, genome disorganization or abnormal DNA methylation. To map such patterns, we introduce a weighted univariate clustering algorithm to guarantee linear runtime, optimality and reproducibility. Results We present the chromosome clustering method, establish its optimality and runtime and evaluate its performance. It uses dynamic programming enhanced with an algorithm to reduce search-space in-place to decrease runtime overhead. Using the method, we delineated outstanding genomic zones in 17 human cancer types. We identified strong continuity in dysregulation polarity—dominance by either up- or downregulated genes in a zone—along chromosomes in all cancer types. Significantly polarized dysregulation zones specific to cancer types are found, offering potential diagnostic biomarkers. Unreported previously, a total of 109 loci with conserved dysregulation polarity across cancer types give insights into pan-cancer mechanisms. Efficient chromosomal clustering opens a window to characterize molecular patterns in cancer genome and beyond. Availability and implementation Weighted univariate clustering algorithms are implemented within the R package ‘Ckmeans.1d.dp’ (4.0.0 or above), freely available at https://cran.r-project.org/package=Ckmeans.1d.dp. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Jie Dong ◽  
Min-Feng Zhu ◽  
Yong-Huan Yun ◽  
Ai-Ping Lu ◽  
Ting-Jun Hou ◽  
...  

Abstract Background With the increasing development of biotechnology and information technology, publicly available data in chemistry and biology are undergoing explosive growth. Such wealthy information in these resources needs to be extracted and then transformed to useful knowledge by various data mining methods. However, a main computational challenge is how to effectively represent or encode molecular objects under investigation such as chemicals, proteins, DNAs and even complicated interactions when data mining methods are employed. To further explore these complicated data, an integrated toolkit to represent different types of molecular objects and support various data mining algorithms is urgently needed. Results We developed a freely available R/CRAN package, called BioMedR, for molecular representations of chemicals, proteins, DNAs and pairwise samples of their interactions. The current version of BioMedR could calculate 293 molecular descriptors and 13 kinds of molecular fingerprints for small molecules, 9920 protein descriptors based on protein sequences and six types of generalized scale-based descriptors for proteochemometric modeling, more than 6000 DNA descriptors from nucleotide sequences and six types of interaction descriptors using three different combining strategies. Moreover, this package realized five similarity calculation methods and four powerful clustering algorithms as well as several useful auxiliary tools, which aims at building an integrated analysis pipeline for data acquisition, data checking, descriptor calculation and data modeling. Conclusion BioMedR provides a comprehensive and uniform R package to link up different representations of molecular objects with each other and will benefit cheminformatics/bioinformatics and other biomedical users. It is available at: https://CRAN.R-project.org/package=BioMedR and https://github.com/wind22zhu/BioMedR/.


2020 ◽  
Vol 2020 ◽  
pp. 1-7
Author(s):  
Abdelilah Et-taleby ◽  
Mohammed Boussetta ◽  
Mohamed Benslimane

Clustering or grouping is among the most important image processing methods that aim to split an image into different groups. Examining the literature, many clustering algorithms have been carried out, where the K-means algorithm is considered among the simplest and most used to classify an image into many regions. In this context, the main objective of this work is to detect and locate precisely the damaged area in photovoltaic (PV) fields based on the clustering of a thermal image through the K-means algorithm. The clustering quality depends on the number of clusters chosen; hence, the elbow, the average silhouette, and NbClust R package methods are used to find the optimal number K. The simulations carried out show that the use of the K-means algorithm allows detecting precisely the faults in PV panels. The excellent result is given with three clusters that is suggested by the elbow method.


Author(s):  
Xiaofan Lu ◽  
Jialin Meng ◽  
Yujie Zhou ◽  
Liyun Jiang ◽  
Fangrong Yan

Abstract Summary Stratification of cancer patients into distinct molecular subgroups based on multi-omics data is an important issue in the context of precision medicine. Here, we present MOVICS, an R package for multi-omics integration and visualization in cancer subtyping. MOVICS provides a unified interface for 10 state-of-the-art multi-omics integrative clustering algorithms, and incorporates the most commonly used downstream analyses in cancer subtyping researches, including characterization and comparison of identified subtypes from multiple perspectives, and verification of subtypes in external cohort using two model-free approaches for multiclass prediction. MOVICS also creates feature rich customizable visualizations with minimal effort. By analysing two published breast cancer cohort, we signifies that MOVICS can serve a wide range of users and assist cancer therapy by moving away from the ‘one-size-fits-all’ approach to patient care. Availability and implementation MOVICS package and online tutorial are freely available at https://github.com/xlucpu/MOVICS. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 12 (1) ◽  
Author(s):  
M. Inês Neves ◽  
Joanne P. Webster ◽  
Martin Walker

Abstract Background Sibship reconstruction is a form of parentage analysis that can be used to identify the number of helminth parental genotypes infecting individual hosts using genetic data on only their offspring. This has the potential to be used for estimating individual worm burdens when adult parasites are otherwise inaccessible, the case for many of the most globally important human helminthiases and neglected tropical diseases. Yet methods of inferring worm burdens from sibship reconstruction data on numbers of unique parental genotypes are lacking, limiting the method’s scope of application. Results We developed a novel statistical method for estimating female worm burdens from data on the number of unique female parental genotypes derived from sibship reconstruction. We illustrate the approach using genotypic data on Schistosoma mansoni (miracidial) offspring collected from schoolchildren in Tanzania. We show how the bias and precision of worm burden estimates critically depends on the number of sampled offspring and we discuss strategies for obtaining sufficient sample sizes and for incorporating judiciously formulated prior information to improve the accuracy of estimates. Conclusions This work provides a novel approach for estimating individual-level worm burdens using genetic data on helminth offspring. This represents a step towards a wider scope of application of parentage analysis techniques. We discuss how the method could be used to assist in the interpretation of monitoring and evaluation data collected during mass drug administration programmes targeting human helminthiases and to help resolve outstanding questions on key population biological processes that govern the transmission dynamics of these neglected tropical diseases.


2006 ◽  
Vol 84 (4) ◽  
pp. 573-582 ◽  
Author(s):  
J.K. Young ◽  
W.F. Andelt ◽  
P.A. Terletzky ◽  
J.A. Shivik

Most ecological studies of coyotes are of short duration and studies are generally never repeated, thus the opportunity to compare changes in coyote ( Canis latrans Say, 1823) ecology over time is rare. We compared coyote home ranges, activity patterns, age, and diet at the Welder Wildlife Refuge in south Texas between 1978–1979 and 2003–2004 (25 years later). The Minta index of overlap between 1978 and 2003 home ranges was 51.7 ± 7.0 (n = 7), much greater than the Minta index value based on randomized tests (28.7 ± 8.6), indicating similar spatial patterns between time periods. The Minta index was 12.3 ± 6.2 (n = 7) for core areas, whereas the Minta index value based on randomized tests was 4.0 ± 3.0. Although overall diets were similar between 1978 and 2003, we detected some differences in prey species consumed. Activity patterns were similar between the two study periods, with peaks in movement occurring around sunrise and sunset. There was no difference in the mean age between the two populations (P = 0.44, n = 68, t[66] = 2.00). Our findings suggest that population features, such as home-range position and age structure, are similar between extended time periods, while individual-level patterns, such as the prey species consumed and distribution of locations within a home range, are dynamic and may reflect changes in the local environment.


Sign in / Sign up

Export Citation Format

Share Document