scholarly journals Visualizing population structure with variational autoencoders

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
C J Battey ◽  
Gabrielle C Coffing ◽  
Andrew D Kern

Abstract Dimensionality reduction is a common tool for visualization and inference of population structure from genotypes, but popular methods either return too many dimensions for easy plotting (PCA) or fail to preserve global geometry (t-SNE and UMAP). Here we explore the utility of variational autoencoders (VAEs)—generative machine learning models in which a pair of neural networks seek to first compress and then recreate the input data—for visualizing population genetic variation. VAEs incorporate nonlinear relationships, allow users to define the dimensionality of the latent space, and in our tests preserve global geometry better than t-SNE and UMAP. Our implementation, which we call popvae, is available as a command-line python program at github.com/kr-colab/popvae. The approach yields latent embeddings that capture subtle aspects of population structure in humans and Anopheles mosquitoes, and can generate artificial genotypes characteristic of a given sample or population.

Author(s):  
C. J. Battey ◽  
Gabrielle C. Coffing ◽  
Andrew D. Kern

AbstractDimensionality reduction is a common tool for visualization and inference of population structure from genotypes, but popular methods either return too many dimensions for easy plotting (PCA) or fail to preserve global geometry (t-SNE and UMAP). Here we explore the utility of variational autoencoders (VAEs) – generative machine learning models in which a pair of neural networks seek to first compress and then recreate the input data – for visualizing population genetic variation. VAEs incorporate non-linear relationships, allow users to define the dimensionality of the latent space, and in our tests preserve global geometry better than t-SNE and UMAP. Our implementation, which we call popvae, is available as a command-line python program at github.com/kr-colab/popvae. The approach yields latent embeddings that capture subtle aspects of population structure in humans and Anopheles mosquitoes, and can generate artificial genotypes characteristic of a given sample or population.


2015 ◽  
Vol 112 (26) ◽  
pp. E3441-E3450 ◽  
Author(s):  
David Mimno ◽  
David M. Blei ◽  
Barbara E. Engelhardt

Admixture models are a ubiquitous approach to capture latent population structure in genetic samples. Despite the widespread application of admixture models, little thought has been devoted to the quality of the model fit or the accuracy of the estimates of parameters of interest for a particular study. Here we develop methods for validating admixture models based on posterior predictive checks (PPCs), a Bayesian method for assessing the quality of fit of a statistical model to a specific dataset. We develop PPCs for five population-level statistics of interest: within-population genetic variation, background linkage disequilibrium, number of ancestral populations, between-population genetic variation, and the downstream use of admixture parameters to correct for population structure in association studies. Using PPCs, we evaluate the quality of the admixture model fit to four qualitatively different population genetic datasets: the population reference sample (POPRES) European individuals, the HapMap phase 3 individuals, continental Indians, and African American individuals. We found that the same model fitted to different genomic studies resulted in highly study-specific results when evaluated using PPCs, illustrating the utility of PPCs for model-based analyses in large genomic studies.


Author(s):  
Jette Henderson ◽  
Shubham Sharma ◽  
Alan Gee ◽  
Valeri Alexiev ◽  
Steve Draper ◽  
...  

As more companies and governments build and use machine learning models to automate decisions, there is an ever-growing need to monitor and evaluate these models' behavior once they are deployed. Our team at CognitiveScale has developed a toolkit called Cortex Certifai to answer this need. Cortex Certifai is a framework that assesses aspects of robustness, fairness, and interpretability of any classification or regression model trained on tabular data, without requiring access to its internal workings. Additionally, Cortex Certifai allows users to compare models along these different axes and only requires 1) query access to the model and 2) an “evaluation” dataset. At its foundation, Cortex Certifai generates counterfactual explanations, which are synthetic data points close to input data points but differing in terms of model prediction. The tool then harnesses characteristics of these counterfactual explanations to analyze different aspects of the supplied model and delivers evaluations relevant to a variety of different stakeholders (e.g., model developers, risk analysts, compliance officers). Cortex Certifai can be configured and executed using a command-line interface (CLI), within jupyter notebooks, or on the cloud, and the results are recorded in JSON files and can be visualized in an interactive console. Using these reports, stakeholders can understand, monitor, and build trust in their AI systems. In this paper, we provide a brief overview of a demonstration of Cortex Certifai's capabilities.


Genome ◽  
1991 ◽  
Vol 34 (3) ◽  
pp. 396-406 ◽  
Author(s):  
Hedi Baatout ◽  
Daniel Combes ◽  
Mohamed Marrakchi

Several samples of wild populations of two subspecies of the genus Hedysarum (H. spinosissimum subspecies capitatum, an outcrosser, and H. spinosissimum subspecies euspinosissimum, a selfer) were examined with respect to variability of 25 quantitative characters and allozyme variation at 13 loci. The amount of phenotypic and genetic variation within and among populations was documented. For most of the 25 quantitative characters, the differences between population means and between the total variances of the populations were higher in the selfer than in the outbreeder. Significant among-population genetic variation was found for nearly all characters in the two subspecies, but the outbreeder had higher within-population variability than the selfer with heterogeneity among characters. However, allozyme variation at 13 loci in about the same number of populations showed higher levels of genetic variability in the outcrossing subspecies capitatum compared with the selfing subspecies euspinosissimum, based on measures of mean number of alleles per locus, mean proportion of polymorphic loci, and mean heterozygosity. Therefore, H. spinosissimum subsp. capitatum appeared to be highly polymorphic in contrast to the greater monomorphism within populations of H. spinosissimum subsp. euspinosissimum. The genetic affinities of different populations of a subspecies are uniformly high, with Nei's genetic identity ranging from 0.983 to 0.997 in the selfing subspecies euspinosissimum and from 0.922 to 1.000 in the outcrossing subspecies capitatum.Key words: Hedysarum, genetic variation, populations, electrophoresis.


2020 ◽  
Vol 22 (10) ◽  
pp. 694-704 ◽  
Author(s):  
Wanben Zhong ◽  
Bineng Zhong ◽  
Hongbo Zhang ◽  
Ziyi Chen ◽  
Yan Chen

Aim and Objective: Cancer is one of the deadliest diseases, taking the lives of millions every year. Traditional methods of treating cancer are expensive and toxic to normal cells. Fortunately, anti-cancer peptides (ACPs) can eliminate this side effect. However, the identification and development of new anti Materials and Methods: In our study, a multi-classifier system was used, combined with multiple machine learning models, to predict anti-cancer peptides. These individual learners are composed of different feature information and algorithms, and form a multi-classifier system by voting. Results and Conclusion: The experiments show that the overall prediction rate of each individual learner is above 80% and the overall accuracy of multi-classifier system for anti-cancer peptides prediction can reach 95.93%, which is better than the existing prediction model.


2019 ◽  
Vol 112 (5) ◽  
pp. 2362-2368
Author(s):  
Yan Liu ◽  
Lei Chen ◽  
Xing-Zhi Duan ◽  
Dian-Shu Zhao ◽  
Jing-Tao Sun ◽  
...  

Abstract Deciphering genetic structure and inferring migration routes of insects with high migratory ability have been challenging, due to weak genetic differentiation and limited resolution offered by traditional genotyping methods. Here, we tested the ability of double digest restriction-site associated DNA sequencing (ddRADseq)-based single nucleotide polymorphisms (SNPs) in revealing the population structure relative to 13 microsatellite markers by using four small brown planthopper populations as subjects. Using ddRADseq, we identified 230,000 RAD loci and 5,535 SNP sites, which were present in at least 80% of individuals across the four populations with a minimum sequencing depth of 10. Our results show that this large SNP panel is more powerful than traditional microsatellite markers in revealing fine-scale population structure among the small brown planthopper populations. In contrast to the mixed population structure suggested by microsatellites, discriminant analysis of principal components (DAPC) of the SNP dataset clearly separated the individuals into four geographic populations. Our results also suggest the DAPC analysis is more powerful than the principal component analysis (PCA) in resolving population genetic structure of high migratory taxa, probably due to the advantages of DAPC in using more genetic variation and the discriminant analysis function. Together, these results point to ddRADseq being a promising approach for population genetic and migration studies of small brown planthopper.


1995 ◽  
Vol 85 (1) ◽  
pp. 21-28 ◽  
Author(s):  
Philippe Borsa ◽  
D. Pierre Gingerich

AbstractSeven presumed Mendelian enzyme loci (Est-2, Est-3, Gpi, Idh-l, Idh-2, Mdh-2 and Mpi) were characterized and tested for polymorphism in coffee berry borers, Hypothenemus hampei (Ferrari), sampled in Côte d′Ivoire, Mexico and New Caledonia. The average genetic diversity was H = 0.080. Two loci, Mdh-2 and Mpi were polymorphic, and thus usable as genetic markers. The population structure of H. hampei was analysed using Weir & Cockerham's estimators of Wright's F-statistics. A high degree of inbreeding (f = 0.298) characterized the elementary geographic sampling unit, the coffee field. The estimate of gene flow between fields within a country was Nm = 10.6 and that between countries was Nm = 2. The population genetic structure in H. hampei could be related to its known population biological features and history.


Genetics ◽  
1997 ◽  
Vol 146 (2) ◽  
pp. 471-479 ◽  
Author(s):  
Michael Travisano

The effect of environment on adaptation and divergence was examined in two sets of populations of Escherichia coli selected for 1000 generations in either maltose- or glucose-limited media. Twelve replicate populations selected in maltose-limited medium improved in fitness in the selected environment, by an average of 22.5%. Statistically significant among-population genetic variation for fitness was observed during the course of the propagation, but this variation was small relative to the fitness improvement. Mean fitness in a novel nutrient environment, glucose-limited medium, improved to the same extent as in the selected environment, with no statistically significant among-population genetic variation. In contrast, 12 replicate populations previously selected for 1000 generations in glucose-limited medium showed no improvement, as a group, in fitness in maltose-limited medium and substantial genetic variation. This asymmetric pattern of correlated responses suggests that small changes in the environment can have profound effects on adaptation and divergence.


Coral Reefs ◽  
2021 ◽  
Author(s):  
Felipe Torquato ◽  
Jessica Bouwmeester ◽  
Pedro Range ◽  
Alyssa Marshell ◽  
Mark A. Priest ◽  
...  

AbstractCurrent seawater temperatures around the northeastern Arabian Peninsula resemble future global forecasts as temperatures > 35 °C are commonly observed in summer. To provide a more fundamental aim of understanding the structure of wild populations in extreme environmental conditions, we conducted a population genetic study of a widespread, regional endemic table coral species, Acropora downingi, across the northeastern Arabian Peninsula. A total of 63 samples were collected in the southern Arabian/Persian Gulf (Abu Dhabi and Qatar) and the Sea of Oman (northeastern Oman). Using RAD-seq techniques, we described the population structure of A. downingi across the study area. Pairwise G’st and distance-based analyses using neutral markers displayed two distinct genetic clusters: one represented by Arabian/Persian Gulf individuals, and the other by Sea of Oman individuals. Nevertheless, a model-based method applied to the genetic data suggested a panmictic population encompassing both seas. Hypotheses to explain the distinctiveness of phylogeographic subregions in the northeastern Arabian Peninsula rely on either (1) bottleneck events due to successive mass coral bleaching, (2) recent founder effect, (3) ecological speciation due to the large spatial gradients in physical conditions, or (4) the combination of seascape features, ocean circulation and larval traits. Neutral markers indicated a slightly structured population of A. downingi, which exclude the ecological speciation hypothesis. Future studies across a broader range of organisms are required to furnish evidence for existing hypotheses explaining a population structure observed in the study area. Though this is the most thermally tolerant acroporid species worldwide, A. downingi corals in the Arabian/Persian Gulf have undergone major mortality events over the past three decades. Therefore, the present genetic study has important implications for understanding patterns and processes of differentiation in this group, whose populations may be pushed to extinction as the Arabian/Persian Gulf warms.


1995 ◽  
Vol 85 (1) ◽  
pp. 308-319 ◽  
Author(s):  
Jin Wang ◽  
Ta-Liang Teng

Abstract An artificial neural network-based pattern classification system is applied to seismic event detection. We have designed two types of Artificial Neural Detector (AND) for real-time earthquake detection. Type A artificial neural detector (AND-A) uses the recursive STA/LTA time series as input data, and type B (AND-B) uses moving window spectrograms as input data to detect earthquake signals. The two AND's are trained under supervised learning by using a set of seismic recordings, and then the trained AND's are applied to another set of recordings for testing. Results show that the accuracy of the artificial neural network-based seismic detectors is better than that of the conventional algorithms solely based on the STA/LTA threshold. This is especially true for signals with either low signal-to-noise ratio or spikelike noises.


Sign in / Sign up

Export Citation Format

Share Document