Haplotype and Population Structure Inference using Neural Networks in Whole-Genome Sequencing Data

Population Structure ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Principal Component ◽

Component Analysis ◽

Whole Genome ◽

Sequencing Data ◽

Latent Space

AbstractAccurate inference of population structure is important in many studies of population genetics. In this paper we present, HaploNet, a novel method for performing dimensionality reduction and clustering in genetic data. The method is based on local clustering of phased haplotypes using neural networks from whole-genome sequencing or genotype data. By utilizing a Gaussian mixture prior in a variational autoencoder framework, we are able to learn a low-dimensional latent space in which we cluster haplotypes along the genome in a highly scalable manner. We demonstrate that we can use encodings of the latent space to infer global population structure using principal component analysis with haplotype information. Additionally, we derive an expectation-maximization algorithm for estimating ancestry proportions based on the haplotype clustering and the neural networks in a likelihood framework. Using different examples of sequencing data, we demonstrate that our approach is better at distinguishing closely related populations than standard principal component analysis and admixture analysis. We show that HaploNet performs similarly to ChromoPainter for principal component analysis while being much faster and allowing for unsupervised clustering.

Detecting the Population Structure and Scanning for Signatures of Selection in Horses (Equus caballus) From Whole-Genome Sequencing Data

Evolutionary Bioinformatics ◽

10.1177/1176934318775106 ◽

2018 ◽

Vol 14 ◽

pp. 117693431877510 ◽

Cited By ~ 9

Author(s):

Cheng Zhang ◽

Pan Ni ◽

Hafiz Ishfaq Ahmad ◽

M Gemingguli ◽

A Baizilaitibei ◽

...

Keyword(s):

Population Structure ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Equus Caballus ◽

Whole Genome ◽

Sequencing Data ◽

Signatures Of Selection

Principal component analysis of coronaviruses reveals their diversity and seasonal and pandemic potential

PLoS ONE ◽

10.1371/journal.pone.0242954 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0242954

Author(s):

Tomokazu Konishi

Keyword(s):

Genome Sequencing ◽

Principal Component ◽

Component Analysis ◽

Influenza Viruses ◽

Sequencing Data ◽

Similarities And Differences ◽

Annual Changes

Coronaviruses and influenza viruses have similarities and differences. In order to comprehensively compare them, their genome sequencing data were examined by principal component analysis. Coronaviruses had fewer variations than a subclass of influenza viruses. In addition, differences among coronaviruses that infect a variety of hosts were also small. These characteristics may have facilitated the infection of different hosts. Although many of the coronaviruses were conservative, those repeatedly found among humans showed annual changes. If SARS-CoV-2 changes its genome like the Influenza H type, it will repeatedly spread every few years. In addition, the coronavirus family has many other candidates for new pandemics.

Relating Phage Genomes to Helicobacter pylori Population Structure: General Steps Using Whole-Genome Sequencing Data

International Journal of Molecular Sciences ◽

10.3390/ijms19071831 ◽

2018 ◽

Vol 19 (7) ◽

pp. 1831 ◽

Cited By ~ 4

Author(s):

Filipa Vale ◽

Philippe Lehours

Keyword(s):

Helicobacter Pylori ◽

Population Structure ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome ◽

Sequencing Data

Rapid development of 56 novel microsatellite markers for the benthic freshwater bug Aphelocheirus aestivalis using Illumina paired-end sequencing data and M13-tailed primers

Molecular Biology Reports ◽

10.1007/s11033-020-05974-7 ◽

2020 ◽

Vol 47 (12) ◽

pp. 9995-10003

Author(s):

Agnieszka Kaczmarczyk-Ziemba

Keyword(s):

Genetic Diversity ◽

Population Structure ◽

Microsatellite Markers ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Rapid Development ◽

Whole Genome ◽

Sequencing Data ◽

Isolated Populations ◽

Two Samples

AbstractThe freshwater true bug Aphelocheirus aestivalis (Aphelocheiridae) is widely distributed in Europe but occurs rather locally and often in isolated populations. Moreover, it is threatened with extinction in parts of its range. Unfortunately, little is known about the genetic diversity and population structure due to the lack of molecular tools for this species. Thus, to overcome the limitations, a whole-genome sequencing has been performed to identify polymorphic microsatellite markers for A. aestivalis. The whole-genome sequencing has been performed with the Illumina MiSeq platform. Obtained paired-end reads were processed and overlapped into 2,378,426 sequences, and the subset of 267 sequences containing microsatellite motifs were then used for in silico primer designing. Finally, 56 microsatellite markers were determined and 34 of them were polymorphic. Analyses performed in two samples (collected from Drawa and Gowienica rivers, respectively) showed that the number of alleles per locus ranged from 2 to 21, and the observed and expected heterozygosity varied from 0 to 0.933 and 0.064 to 0.931, respectively. The microsatellite markers developed in the present study provide new suitable tools available for the scientific community to study A. aestivalis population dynamics. The assessment of its genetic diversity and population structure will provide important data, that can be used in population management and conservation efforts, elucidating the broad- and fine-scale population genetic structure of A. aestivalis.

Pcadapt: An R Package to Perform Genome Scans for Selection Based on Principal Component Analysis

10.1101/056135 ◽

2016 ◽

Cited By ~ 6

Author(s):

Keurcien Luu ◽

Eric Bazin ◽

Michael G. B. Blum

Keyword(s):

Population Structure ◽

Mahalanobis Distance ◽

Principal Component ◽

R Package ◽

Component Analysis ◽

Population Divergence ◽

Sequencing Data ◽

Genome Scans ◽

False Discoveries

AbstractThe R package pcadapt performs genome scans to detect genes under selection based on population genomic data. It assumes that candidate markers are outliers with respect to how they are related to population structure. Because population structure is ascertained with principal component analysis, the package is fast and works with large-scale data. It can handle missing data and pooled sequencing data. By contrast to population-based approaches, the package handle admixed individuals and does not require grouping individuals into populations. Since its first release, pcadapt has evolved both in terms of statistical approach and software implementation. We present results obtained with robust Mahalanobis distance, which is a new statistic for genome scans available in the 2.0 and later versions of the package. When hierarchical population structure occurs, Mahalanobis distance is more powerful than the communality statistic that was implemented in the first version of the package. Using simulated data, we compare pcadapt to other software for genome scans (BayeScan, hapflk, OutFLANK, sNMF). We find that the proportion of false discoveries is around a nominal false discovery rate set at 10% with the exception of BayeScan that generates 40% of false discoveries. We also find that the power of BayeScan is severely impacted by the presence of admixed individuals whereas pcadapt is not impacted. Last, we find that pcadapt and hapflk are the most powerful software in scenarios of population divergence and range expansion. Because pcadapt handles next-generation sequencing data, it is a valuable tool for data analysis in molecular ecology.

Coronavirus, as a source of pandemic pathogens

10.1101/2020.04.26.063032 ◽

2020 ◽

Author(s):

T. Konishi

Keyword(s):

Influenza Virus ◽

Genome Sequencing ◽

Principal Component ◽

Component Analysis ◽

Sequencing Data ◽

Genome Data ◽

Similarities And Differences ◽

Annual Changes ◽

Future Status

AbstractThe coronavirus and the influenza virus have similarities and differences. In order to comprehensively compare them, their genome sequencing data were examined by principal component analysis. Variations in coronavirus were smaller than those in a subclass of the influenza virus. In addition, differences among coronaviruses in a variety of hosts were small. These characteristics may have facilitated the infection of different hosts. Although many of the coronaviruses were more conservative, those repeatedly found among humans showed annual changes. If SARS-CoV-2 changes its genome like the Influenza H type, it will repeatedly spread every few years. In addition, the coronavirus family has many other candidates for subsequent pandemics.One Sentence SummaryThe genome data of coronavirus were compared to influenza virus, to investigate its spreading mechanism and future status. Coronavirus would repeatedly spread every few years. In addition, the coronavirus family has many other candidates for subsequent pandemics.

Inferring Population Structure and Admixture Proportions in Low Depth NGS Data

10.1101/302463 ◽

2018 ◽

Cited By ~ 5

Author(s):

Jonas Meisner ◽

Anders Albrechtsen

Keyword(s):

Population Structure ◽

Next Generation Sequencing ◽

Principal Component ◽

Component Analysis ◽

Allele Frequencies ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

ABSTRACTWe here present two methods for inferring population structure and admixture proportions in low depth next generation sequencing data. Inference of population structure is essential in both population genetics and association studies and is often performed using principal component analysis or clustering-based approaches. Next-generation sequencing methods provide large amounts of genetic data but are associated with statistical uncertainty for especially low depth sequencing data. Models can account for this uncertainty by working directly on genotype likelihoods of the unobserved genotypes. We propose a method for inferring population structure through principal component analysis in an iterative approach of estimating individual allele frequencies, where we demonstrate improved accuracy in samples with low and variable sequencing depth for both simulated and real datasets. We also use the estimated individual allele frequencies in a fast non-negative matrix factorization method to estimate admixture proportions. Both methods have been implemented in the PCAngsd framework available at http://www.popgen.dk/software/.

From whole genome sequencing data toward a simple genotyping tool: application to the animal pathogen Mycobacterium bovis

10.26226/morressier.56d5ba2ad462b80296c965c0 ◽

2016 ◽

Author(s):

Lorraine Michelet

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Mycobacterium Bovis ◽

Whole Genome ◽

Sequencing Data

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Whole Genome Sequencing Refines Knowledge on the Population Structure of Mycobacterium bovis from a Multi-Host Tuberculosis System

Microorganisms ◽

10.3390/microorganisms9081585 ◽

2021 ◽

Vol 9 (8) ◽

pp. 1585

Author(s):

Ana C. Reis ◽

Liliana C. M. Salvador ◽

Suelee Robbe-Austerman ◽

Rogério Tenreiro ◽

Ana Botelho ◽

...

Keyword(s):

Population Structure ◽

Whole Genome Sequencing ◽

Wild Boar ◽

Genome Sequencing ◽

Mycobacterium Bovis ◽

Red Deer ◽

Variable Number Tandem Repeat ◽

Variant Calling ◽

Whole Genome ◽

Network Analyses

Classical molecular analyses of Mycobacterium bovis based on spoligotyping and Variable Number Tandem Repeat (MIRU-VNTR) brought the first insights into the epidemiology of animal tuberculosis (TB) in Portugal, showing high genotypic diversity of circulating strains that mostly cluster within the European 2 clonal complex. Previous surveillance provided valuable information on the prevalence and spatial occurrence of TB and highlighted prevalent genotypes in areas where livestock and wild ungulates are sympatric. However, links at the wildlife–livestock interfaces were established mainly via classical genotype associations. Here, we apply whole genome sequencing (WGS) to cattle, red deer and wild boar isolates to reconstruct the M. bovis population structure in a multi-host, multi-region disease system and to explore links at a fine genomic scale between M. bovis from wildlife hosts and cattle. Whole genome sequences of 44 representative M. bovis isolates, obtained between 2003 and 2015 from three TB hotspots, were compared through single nucleotide polymorphism (SNP) variant calling analyses. Consistent with previous results combining classical genotyping with Bayesian population admixture modelling, SNP-based phylogenies support the branching of this M. bovis population into five genetic clades, three with apparent geographic specificities, as well as the establishment of an SNP catalogue specific to each clade, which may be explored in the future as phylogenetic markers. The core genome alignment of SNPs was integrated within a spatiotemporal metadata framework to further structure this M. bovis population by host species and TB hotspots, providing a baseline for network analyses in different epidemiological and disease control contexts. WGS of M. bovis isolates from Portugal is reported for the first time in this pilot study, refining the spatiotemporal context of TB at the wildlife–livestock interface and providing further support to the key role of red deer and wild boar on disease maintenance. The SNP diversity observed within this dataset supports the natural circulation of M. bovis for a long time period, as well as multiple introduction events of the pathogen in this Iberian multi-host system.