scholarly journals Haplotype and Population Structure Inference using Neural Networks in Whole-Genome Sequencing Data

2020 ◽  
Author(s):  
Jonas Meisner ◽  
Anders Albrechtsen

AbstractAccurate inference of population structure is important in many studies of population genetics. In this paper we present, HaploNet, a novel method for performing dimensionality reduction and clustering in genetic data. The method is based on local clustering of phased haplotypes using neural networks from whole-genome sequencing or genotype data. By utilizing a Gaussian mixture prior in a variational autoencoder framework, we are able to learn a low-dimensional latent space in which we cluster haplotypes along the genome in a highly scalable manner. We demonstrate that we can use encodings of the latent space to infer global population structure using principal component analysis with haplotype information. Additionally, we derive an expectation-maximization algorithm for estimating ancestry proportions based on the haplotype clustering and the neural networks in a likelihood framework. Using different examples of sequencing data, we demonstrate that our approach is better at distinguishing closely related populations than standard principal component analysis and admixture analysis. We show that HaploNet performs similarly to ChromoPainter for principal component analysis while being much faster and allowing for unsupervised clustering.

PLoS ONE ◽  
2020 ◽  
Vol 15 (12) ◽  
pp. e0242954
Author(s):  
Tomokazu Konishi

Coronaviruses and influenza viruses have similarities and differences. In order to comprehensively compare them, their genome sequencing data were examined by principal component analysis. Coronaviruses had fewer variations than a subclass of influenza viruses. In addition, differences among coronaviruses that infect a variety of hosts were also small. These characteristics may have facilitated the infection of different hosts. Although many of the coronaviruses were conservative, those repeatedly found among humans showed annual changes. If SARS-CoV-2 changes its genome like the Influenza H type, it will repeatedly spread every few years. In addition, the coronavirus family has many other candidates for new pandemics.


2020 ◽  
Vol 47 (12) ◽  
pp. 9995-10003
Author(s):  
Agnieszka Kaczmarczyk-Ziemba

AbstractThe freshwater true bug Aphelocheirus aestivalis (Aphelocheiridae) is widely distributed in Europe but occurs rather locally and often in isolated populations. Moreover, it is threatened with extinction in parts of its range. Unfortunately, little is known about the genetic diversity and population structure due to the lack of molecular tools for this species. Thus, to overcome the limitations, a whole-genome sequencing has been performed to identify polymorphic microsatellite markers for A. aestivalis. The whole-genome sequencing has been performed with the Illumina MiSeq platform. Obtained paired-end reads were processed and overlapped into 2,378,426 sequences, and the subset of 267 sequences containing microsatellite motifs were then used for in silico primer designing. Finally, 56 microsatellite markers were determined and 34 of them were polymorphic. Analyses performed in two samples (collected from Drawa and Gowienica rivers, respectively) showed that the number of alleles per locus ranged from 2 to 21, and the observed and expected heterozygosity varied from 0 to 0.933 and 0.064 to 0.931, respectively. The microsatellite markers developed in the present study provide new suitable tools available for the scientific community to study A. aestivalis population dynamics. The assessment of its genetic diversity and population structure will provide important data, that can be used in population management and conservation efforts, elucidating the broad- and fine-scale population genetic structure of A. aestivalis.


2016 ◽  
Author(s):  
Keurcien Luu ◽  
Eric Bazin ◽  
Michael G. B. Blum

AbstractThe R package pcadapt performs genome scans to detect genes under selection based on population genomic data. It assumes that candidate markers are outliers with respect to how they are related to population structure. Because population structure is ascertained with principal component analysis, the package is fast and works with large-scale data. It can handle missing data and pooled sequencing data. By contrast to population-based approaches, the package handle admixed individuals and does not require grouping individuals into populations. Since its first release, pcadapt has evolved both in terms of statistical approach and software implementation. We present results obtained with robust Mahalanobis distance, which is a new statistic for genome scans available in the 2.0 and later versions of the package. When hierarchical population structure occurs, Mahalanobis distance is more powerful than the communality statistic that was implemented in the first version of the package. Using simulated data, we compare pcadapt to other software for genome scans (BayeScan, hapflk, OutFLANK, sNMF). We find that the proportion of false discoveries is around a nominal false discovery rate set at 10% with the exception of BayeScan that generates 40% of false discoveries. We also find that the power of BayeScan is severely impacted by the presence of admixed individuals whereas pcadapt is not impacted. Last, we find that pcadapt and hapflk are the most powerful software in scenarios of population divergence and range expansion. Because pcadapt handles next-generation sequencing data, it is a valuable tool for data analysis in molecular ecology.


2020 ◽  
Author(s):  
T. Konishi

AbstractThe coronavirus and the influenza virus have similarities and differences. In order to comprehensively compare them, their genome sequencing data were examined by principal component analysis. Variations in coronavirus were smaller than those in a subclass of the influenza virus. In addition, differences among coronaviruses in a variety of hosts were small. These characteristics may have facilitated the infection of different hosts. Although many of the coronaviruses were more conservative, those repeatedly found among humans showed annual changes. If SARS-CoV-2 changes its genome like the Influenza H type, it will repeatedly spread every few years. In addition, the coronavirus family has many other candidates for subsequent pandemics.One Sentence SummaryThe genome data of coronavirus were compared to influenza virus, to investigate its spreading mechanism and future status. Coronavirus would repeatedly spread every few years. In addition, the coronavirus family has many other candidates for subsequent pandemics.


2018 ◽  
Author(s):  
Jonas Meisner ◽  
Anders Albrechtsen

ABSTRACTWe here present two methods for inferring population structure and admixture proportions in low depth next generation sequencing data. Inference of population structure is essential in both population genetics and association studies and is often performed using principal component analysis or clustering-based approaches. Next-generation sequencing methods provide large amounts of genetic data but are associated with statistical uncertainty for especially low depth sequencing data. Models can account for this uncertainty by working directly on genotype likelihoods of the unobserved genotypes. We propose a method for inferring population structure through principal component analysis in an iterative approach of estimating individual allele frequencies, where we demonstrate improved accuracy in samples with low and variable sequencing depth for both simulated and real datasets. We also use the estimated individual allele frequencies in a fast non-negative matrix factorization method to estimate admixture proportions. Both methods have been implemented in the PCAngsd framework available at http://www.popgen.dk/software/.


2021 ◽  
Vol 9 (8) ◽  
pp. 1585
Author(s):  
Ana C. Reis ◽  
Liliana C. M. Salvador ◽  
Suelee Robbe-Austerman ◽  
Rogério Tenreiro ◽  
Ana Botelho ◽  
...  

Classical molecular analyses of Mycobacterium bovis based on spoligotyping and Variable Number Tandem Repeat (MIRU-VNTR) brought the first insights into the epidemiology of animal tuberculosis (TB) in Portugal, showing high genotypic diversity of circulating strains that mostly cluster within the European 2 clonal complex. Previous surveillance provided valuable information on the prevalence and spatial occurrence of TB and highlighted prevalent genotypes in areas where livestock and wild ungulates are sympatric. However, links at the wildlife–livestock interfaces were established mainly via classical genotype associations. Here, we apply whole genome sequencing (WGS) to cattle, red deer and wild boar isolates to reconstruct the M. bovis population structure in a multi-host, multi-region disease system and to explore links at a fine genomic scale between M. bovis from wildlife hosts and cattle. Whole genome sequences of 44 representative M. bovis isolates, obtained between 2003 and 2015 from three TB hotspots, were compared through single nucleotide polymorphism (SNP) variant calling analyses. Consistent with previous results combining classical genotyping with Bayesian population admixture modelling, SNP-based phylogenies support the branching of this M. bovis population into five genetic clades, three with apparent geographic specificities, as well as the establishment of an SNP catalogue specific to each clade, which may be explored in the future as phylogenetic markers. The core genome alignment of SNPs was integrated within a spatiotemporal metadata framework to further structure this M. bovis population by host species and TB hotspots, providing a baseline for network analyses in different epidemiological and disease control contexts. WGS of M. bovis isolates from Portugal is reported for the first time in this pilot study, refining the spatiotemporal context of TB at the wildlife–livestock interface and providing further support to the key role of red deer and wild boar on disease maintenance. The SNP diversity observed within this dataset supports the natural circulation of M. bovis for a long time period, as well as multiple introduction events of the pathogen in this Iberian multi-host system.


Sign in / Sign up

Export Citation Format

Share Document