Visualizing Population Structure with Variational Autoencoders

AbstractDimensionality reduction is a common tool for visualization and inference of population structure from genotypes, but popular methods either return too many dimensions for easy plotting (PCA) or fail to preserve global geometry (t-SNE and UMAP). Here we explore the utility of variational autoencoders (VAEs) – generative machine learning models in which a pair of neural networks seek to first compress and then recreate the input data – for visualizing population genetic variation. VAEs incorporate non-linear relationships, allow users to define the dimensionality of the latent space, and in our tests preserve global geometry better than t-SNE and UMAP. Our implementation, which we call popvae, is available as a command-line python program at github.com/kr-colab/popvae. The approach yields latent embeddings that capture subtle aspects of population structure in humans and Anopheles mosquitoes, and can generate artificial genotypes characteristic of a given sample or population.

Download Full-text

Visualizing population structure with variational autoencoders

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkaa036 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

C J Battey ◽

Gabrielle C Coffing ◽

Andrew D Kern

Keyword(s):

Population Structure ◽

Population Genetic ◽

Input Data ◽

Command Line ◽

Anopheles Mosquitoes ◽

Global Geometry ◽

Latent Space ◽

Population Genetic Variation ◽

Machine Learning Models ◽

Better Than

Abstract Dimensionality reduction is a common tool for visualization and inference of population structure from genotypes, but popular methods either return too many dimensions for easy plotting (PCA) or fail to preserve global geometry (t-SNE and UMAP). Here we explore the utility of variational autoencoders (VAEs)—generative machine learning models in which a pair of neural networks seek to first compress and then recreate the input data—for visualizing population genetic variation. VAEs incorporate nonlinear relationships, allow users to define the dimensionality of the latent space, and in our tests preserve global geometry better than t-SNE and UMAP. Our implementation, which we call popvae, is available as a command-line python program at github.com/kr-colab/popvae. The approach yields latent embeddings that capture subtle aspects of population structure in humans and Anopheles mosquitoes, and can generate artificial genotypes characteristic of a given sample or population.

Download Full-text

Posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1412301112 ◽

2015 ◽

Vol 112 (26) ◽

pp. E3441-E3450 ◽

Cited By ~ 6

Author(s):

David Mimno ◽

David M. Blei ◽

Barbara E. Engelhardt

Keyword(s):

Population Structure ◽

Genetic Variation ◽

Population Genetic ◽

Association Studies ◽

Model Fit ◽

Widespread Application ◽

Posterior Predictive Checks ◽

Genomic Studies ◽

Population Genetic Variation

Admixture models are a ubiquitous approach to capture latent population structure in genetic samples. Despite the widespread application of admixture models, little thought has been devoted to the quality of the model fit or the accuracy of the estimates of parameters of interest for a particular study. Here we develop methods for validating admixture models based on posterior predictive checks (PPCs), a Bayesian method for assessing the quality of fit of a statistical model to a specific dataset. We develop PPCs for five population-level statistics of interest: within-population genetic variation, background linkage disequilibrium, number of ancestral populations, between-population genetic variation, and the downstream use of admixture parameters to correct for population structure in association studies. Using PPCs, we evaluate the quality of the admixture model fit to four qualitatively different population genetic datasets: the population reference sample (POPRES) European individuals, the HapMap phase 3 individuals, continental Indians, and African American individuals. We found that the same model fitted to different genomic studies resulted in highly study-specific results when evaluated using PPCs, illustrating the utility of PPCs for model-based analyses in large genomic studies.

Download Full-text

Certifai: A Toolkit for Building Trust in AI Systems

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/759 ◽

2020 ◽

Author(s):

Jette Henderson ◽

Shubham Sharma ◽

Alan Gee ◽

Valeri Alexiev ◽

Steve Draper ◽

...

Keyword(s):

Input Data ◽

Synthetic Data ◽

Command Line ◽

Tabular Data ◽

Command Line Interface ◽

Building Trust ◽

Compliance Officers ◽

Data Points ◽

Evaluation Dataset ◽

Machine Learning Models

As more companies and governments build and use machine learning models to automate decisions, there is an ever-growing need to monitor and evaluate these models' behavior once they are deployed. Our team at CognitiveScale has developed a toolkit called Cortex Certifai to answer this need. Cortex Certifai is a framework that assesses aspects of robustness, fairness, and interpretability of any classification or regression model trained on tabular data, without requiring access to its internal workings. Additionally, Cortex Certifai allows users to compare models along these different axes and only requires 1) query access to the model and 2) an “evaluation” dataset. At its foundation, Cortex Certifai generates counterfactual explanations, which are synthetic data points close to input data points but differing in terms of model prediction. The tool then harnesses characteristics of these counterfactual explanations to analyze different aspects of the supplied model and delivers evaluations relevant to a variety of different stakeholders (e.g., model developers, risk analysts, compliance officers). Cortex Certifai can be configured and executed using a command-line interface (CLI), within jupyter notebooks, or on the cloud, and the results are recorded in JSON files and can be visualized in an interactive console. Using these reports, stakeholders can understand, monitor, and build trust in their AI systems. In this paper, we provide a brief overview of a demonstration of Cortex Certifai's capabilities.

Download Full-text

Reproductive system and population structure in two Hedysarum subspecies. I. Genetic variation within and between populations

Genome ◽

10.1139/g91-061 ◽

1991 ◽

Vol 34 (3) ◽

pp. 396-406 ◽

Cited By ~ 9

Author(s):

Hedi Baatout ◽

Daniel Combes ◽

Mohamed Marrakchi

Keyword(s):

Population Structure ◽

Genetic Variation ◽

Genetic Variability ◽

Population Genetic ◽

Allozyme Variation ◽

Wild Populations ◽

Population Means ◽

Quantitative Characters ◽

Different Populations ◽

Population Genetic Variation

Several samples of wild populations of two subspecies of the genus Hedysarum (H. spinosissimum subspecies capitatum, an outcrosser, and H. spinosissimum subspecies euspinosissimum, a selfer) were examined with respect to variability of 25 quantitative characters and allozyme variation at 13 loci. The amount of phenotypic and genetic variation within and among populations was documented. For most of the 25 quantitative characters, the differences between population means and between the total variances of the populations were higher in the selfer than in the outbreeder. Significant among-population genetic variation was found for nearly all characters in the two subspecies, but the outbreeder had higher within-population variability than the selfer with heterogeneity among characters. However, allozyme variation at 13 loci in about the same number of populations showed higher levels of genetic variability in the outcrossing subspecies capitatum compared with the selfing subspecies euspinosissimum, based on measures of mean number of alleles per locus, mean proportion of polymorphic loci, and mean heterozygosity. Therefore, H. spinosissimum subsp. capitatum appeared to be highly polymorphic in contrast to the greater monomorphism within populations of H. spinosissimum subsp. euspinosissimum. The genetic affinities of different populations of a subspecies are uniformly high, with Nei's genetic identity ranging from 0.983 to 0.997 in the selfing subspecies euspinosissimum and from 0.922 to 1.000 in the outcrossing subspecies capitatum.Key words: Hedysarum, genetic variation, populations, electrophoresis.

Download Full-text

Identification of Anti-cancer Peptides Based on Multi-classifier System

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666191203141102 ◽

2020 ◽

Vol 22 (10) ◽

pp. 694-704 ◽

Cited By ~ 2

Author(s):

Wanben Zhong ◽

Bineng Zhong ◽

Hongbo Zhang ◽

Ziyi Chen ◽

Yan Chen

Keyword(s):

Machine Learning ◽

Side Effect ◽

Learning Models ◽

Normal Cells ◽

Classifier System ◽

Prediction Rate ◽

Anti Cancer ◽

Feature Information ◽

Machine Learning Models ◽

Better Than

Aim and Objective: Cancer is one of the deadliest diseases, taking the lives of millions every year. Traditional methods of treating cancer are expensive and toxic to normal cells. Fortunately, anti-cancer peptides (ACPs) can eliminate this side effect. However, the identification and development of new anti Materials and Methods: In our study, a multi-classifier system was used, combined with multiple machine learning models, to predict anti-cancer peptides. These individual learners are composed of different feature information and algorithms, and form a multi-classifier system by voting. Results and Conclusion: The experiments show that the overall prediction rate of each individual learner is above 80% and the overall accuracy of multi-classifier system for anti-cancer peptides prediction can reach 95.93%, which is better than the existing prediction model.

Download Full-text

Genome-Wide Single Nucleotide Polymorphisms are Robust in Resolving Fine-Scale Population Genetic Structure of the Small Brown Planthopper, Laodelphax striatellus (Fallén) (Hemiptera: Delphacidae)

Journal of Economic Entomology ◽

10.1093/jee/toz145 ◽

2019 ◽

Vol 112 (5) ◽

pp. 2362-2368

Author(s):

Yan Liu ◽

Lei Chen ◽

Xing-Zhi Duan ◽

Dian-Shu Zhao ◽

Jing-Tao Sun ◽

...

Keyword(s):

Population Structure ◽

Discriminant Analysis ◽

Genetic Structure ◽

Population Genetic ◽

Brown Planthopper ◽

Fine Scale ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Small Brown Planthopper ◽

Scale Population

Abstract Deciphering genetic structure and inferring migration routes of insects with high migratory ability have been challenging, due to weak genetic differentiation and limited resolution offered by traditional genotyping methods. Here, we tested the ability of double digest restriction-site associated DNA sequencing (ddRADseq)-based single nucleotide polymorphisms (SNPs) in revealing the population structure relative to 13 microsatellite markers by using four small brown planthopper populations as subjects. Using ddRADseq, we identified 230,000 RAD loci and 5,535 SNP sites, which were present in at least 80% of individuals across the four populations with a minimum sequencing depth of 10. Our results show that this large SNP panel is more powerful than traditional microsatellite markers in revealing fine-scale population structure among the small brown planthopper populations. In contrast to the mixed population structure suggested by microsatellites, discriminant analysis of principal components (DAPC) of the SNP dataset clearly separated the individuals into four geographic populations. Our results also suggest the DAPC analysis is more powerful than the principal component analysis (PCA) in resolving population genetic structure of high migratory taxa, probably due to the advantages of DAPC in using more genetic variation and the discriminant analysis function. Together, these results point to ddRADseq being a promising approach for population genetic and migration studies of small brown planthopper.

Download Full-text

Allozyme variation and an estimate of the inbreeding coefficient in the coffee berry borer, Hypothenemus hampei (Coleoptera: Scolytidae)

Bulletin of Entomological Research ◽

10.1017/s000748530005197x ◽

1995 ◽

Vol 85 (1) ◽

pp. 21-28 ◽

Cited By ~ 9

Author(s):

Philippe Borsa ◽

D. Pierre Gingerich

Keyword(s):

Genetic Diversity ◽

Population Structure ◽

Population Genetic ◽

New Caledonia ◽

Allozyme Variation ◽

Inbreeding Coefficient ◽

Hypothenemus Hampei ◽

Coffee Berry ◽

Biological Features ◽

High Degree

AbstractSeven presumed Mendelian enzyme loci (Est-2, Est-3, Gpi, Idh-l, Idh-2, Mdh-2 and Mpi) were characterized and tested for polymorphism in coffee berry borers, Hypothenemus hampei (Ferrari), sampled in Côte d′Ivoire, Mexico and New Caledonia. The average genetic diversity was H = 0.080. Two loci, Mdh-2 and Mpi were polymorphic, and thus usable as genetic markers. The population structure of H. hampei was analysed using Weir & Cockerham's estimators of Wright's F-statistics. A high degree of inbreeding (f = 0.298) characterized the elementary geographic sampling unit, the coffee field. The estimate of gene flow between fields within a country was Nm = 10.6 and that between countries was Nm = 2. The population genetic structure in H. hampei could be related to its known population biological features and history.

Download Full-text

Long-Term Experimental Evolution in Escherichia coli. VI. Environmental Constraints on Adaptation and Divergence

Genetics ◽

10.1093/genetics/146.2.471 ◽

1997 ◽

Vol 146 (2) ◽

pp. 471-479 ◽

Cited By ~ 2

Author(s):

Michael Travisano

Keyword(s):

Escherichia Coli ◽

Genetic Variation ◽

Experimental Evolution ◽

Population Genetic ◽

Environmental Constraints ◽

Asymmetric Pattern ◽

Nutrient Environment ◽

Mean Fitness ◽

Population Genetic Variation

The effect of environment on adaptation and divergence was examined in two sets of populations of Escherichia coli selected for 1000 generations in either maltose- or glucose-limited media. Twelve replicate populations selected in maltose-limited medium improved in fitness in the selected environment, by an average of 22.5%. Statistically significant among-population genetic variation for fitness was observed during the course of the propagation, but this variation was small relative to the fitness improvement. Mean fitness in a novel nutrient environment, glucose-limited medium, improved to the same extent as in the selected environment, with no statistically significant among-population genetic variation. In contrast, 12 replicate populations previously selected for 1000 generations in glucose-limited medium showed no improvement, as a group, in fitness in maltose-limited medium and substantial genetic variation. This asymmetric pattern of correlated responses suggests that small changes in the environment can have profound effects on adaptation and divergence.

Download Full-text

Population genetic structure of a major reef-building coral species Acropora downingi in northeastern Arabian Peninsula

Coral Reefs ◽

10.1007/s00338-021-02158-y ◽

2021 ◽

Author(s):

Felipe Torquato ◽

Jessica Bouwmeester ◽

Pedro Range ◽

Alyssa Marshell ◽

Mark A. Priest ◽

...

Keyword(s):

Population Structure ◽

Genetic Study ◽

Persian Gulf ◽

Population Genetic ◽

Coral Species ◽

Arabian Peninsula ◽

Ecological Speciation ◽

Population Genetic Study ◽

Sea Of Oman ◽

Neutral Markers

AbstractCurrent seawater temperatures around the northeastern Arabian Peninsula resemble future global forecasts as temperatures > 35 °C are commonly observed in summer. To provide a more fundamental aim of understanding the structure of wild populations in extreme environmental conditions, we conducted a population genetic study of a widespread, regional endemic table coral species, Acropora downingi, across the northeastern Arabian Peninsula. A total of 63 samples were collected in the southern Arabian/Persian Gulf (Abu Dhabi and Qatar) and the Sea of Oman (northeastern Oman). Using RAD-seq techniques, we described the population structure of A. downingi across the study area. Pairwise G’st and distance-based analyses using neutral markers displayed two distinct genetic clusters: one represented by Arabian/Persian Gulf individuals, and the other by Sea of Oman individuals. Nevertheless, a model-based method applied to the genetic data suggested a panmictic population encompassing both seas. Hypotheses to explain the distinctiveness of phylogeographic subregions in the northeastern Arabian Peninsula rely on either (1) bottleneck events due to successive mass coral bleaching, (2) recent founder effect, (3) ecological speciation due to the large spatial gradients in physical conditions, or (4) the combination of seascape features, ocean circulation and larval traits. Neutral markers indicated a slightly structured population of A. downingi, which exclude the ecological speciation hypothesis. Future studies across a broader range of organisms are required to furnish evidence for existing hypotheses explaining a population structure observed in the study area. Though this is the most thermally tolerant acroporid species worldwide, A. downingi corals in the Arabian/Persian Gulf have undergone major mortality events over the past three decades. Therefore, the present genetic study has important implications for understanding patterns and processes of differentiation in this group, whose populations may be pushed to extinction as the Arabian/Persian Gulf warms.

Download Full-text

Artificial neural network-based seismic detector

Bulletin of the Seismological Society of America ◽

10.1785/bssa0850010308 ◽

1995 ◽

Vol 85 (1) ◽

pp. 308-319 ◽

Cited By ~ 6

Author(s):

Jin Wang ◽

Ta-Liang Teng

Keyword(s):

Neural Network ◽

Artificial Neural Network ◽

Input Data ◽

Seismic Event ◽

Signal To Noise Ratio ◽

Type B ◽

Earthquake Detection ◽

Seismic Event Detection ◽

Artificial Neural ◽

Better Than

Abstract An artificial neural network-based pattern classification system is applied to seismic event detection. We have designed two types of Artificial Neural Detector (AND) for real-time earthquake detection. Type A artificial neural detector (AND-A) uses the recursive STA/LTA time series as input data, and type B (AND-B) uses moving window spectrograms as input data to detect earthquake signals. The two AND's are trained under supervised learning by using a set of seismic recordings, and then the trained AND's are applied to another set of recordings for testing. Results show that the accuracy of the artificial neural network-based seismic detectors is better than that of the conventional algorithms solely based on the STA/LTA threshold. This is especially true for signals with either low signal-to-noise ratio or spikelike noises.

Download Full-text