VCF2PopTree: a one-click client-side software to construct population phylogeny from genome-wide SNPs

10.7287/peerj.preprints.27682 ◽

2019 ◽

Author(s):

Sankar Subramanian ◽

Umayal Ramasamy ◽

David Chen

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Web Applications ◽

Third Party ◽

Genotype Data ◽

Whole Genome ◽

Genome Data ◽

Genome Wide ◽

Software Programs ◽

Computationally Intensive

In the past decades a number of software programs have been developed to deduce the phylogenetic relationship between populations. However, these programs are not suited for large-scale whole genome data. Recently, a few standalone or web applications have been developed to handle genome-wide data, but they were either computationally intensive, dependent on third party software or required significant time and resource of a web server. In the post-genomic era, researchers are able to obtain bioinformatically processed high-quality publication-ready whole genome data for many individuals in a population from next generation sequencing companies due to the reduction in the cost of sequencing and analysis. Such genotype data is typically presented in the Variant Call Format (VCF) and there is no simple software available that uses this data to construct the phylogeny of populations in a short time. To address this limitation, we have developed a one-click user-friendly software, VCF2PopTree that uses gnome-wide SNPs to construct and display phylogenetic trees in seconds to minutes. For example, it reads a 1 GB VCF file and draws a tree in less than 5 minutes. VCF2PopTree accepts genotype data from a local machine, constructs a tree using UPGMA and Neighbour-Joining algorithms and displays it on a web-browser. It also produces pairwise-diversity matrix in MEGA and PHYLIP file formats as well as trees in the Newick format which could be directly used by other popular phylogenetic software programs. The software including the source code, a test VCF input file and short documentation are available at: https://github.com/sansubs/vcf2pop.

Download Full-text

VCF2PopTree: a one-click client-side software to construct population phylogeny from genome-wide SNPs

10.7287/peerj.preprints.27682v1 ◽

2019 ◽

Author(s):

Sankar Subramanian ◽

Umayal Ramasamy ◽

David Chen

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Web Applications ◽

Third Party ◽

Genotype Data ◽

Whole Genome ◽

Genome Data ◽

Genome Wide ◽

Software Programs ◽

Computationally Intensive

In the past decades a number of software programs have been developed to deduce the phylogenetic relationship between populations. However, these programs are not suited for large-scale whole genome data. Recently, a few standalone or web applications have been developed to handle genome-wide data, but they were either computationally intensive, dependent on third party software or required significant time and resource of a web server. In the post-genomic era, researchers are able to obtain bioinformatically processed high-quality publication-ready whole genome data for many individuals in a population from next generation sequencing companies due to the reduction in the cost of sequencing and analysis. Such genotype data is typically presented in the Variant Call Format (VCF) and there is no simple software available that uses this data to construct the phylogeny of populations in a short time. To address this limitation, we have developed a one-click user-friendly software, VCF2PopTree that uses gnome-wide SNPs to construct and display phylogenetic trees in seconds to minutes. For example, it reads a 1 GB VCF file and draws a tree in less than 5 minutes. VCF2PopTree accepts genotype data from a local machine, constructs a tree using UPGMA and Neighbour-Joining algorithms and displays it on a web-browser. It also produces pairwise-diversity matrix in MEGA and PHYLIP file formats as well as trees in the Newick format which could be directly used by other popular phylogenetic software programs. The software including the source code, a test VCF input file and short documentation are available at: http://sankarsubramanian.net/dat/index.html.

Download Full-text

VCF2PopTree: a client-side software to construct population phylogeny from genome-wide SNPs

PeerJ ◽

10.7717/peerj.8213 ◽

2019 ◽

Vol 7 ◽

pp. e8213 ◽

Cited By ~ 5

Author(s):

Sankar Subramanian ◽

Umayal Ramasamy ◽

David Chen

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Web Applications ◽

Third Party ◽

Genotype Data ◽

Whole Genome ◽

Genome Data ◽

Genome Wide ◽

Software Programs ◽

Computationally Intensive

In the past decades a number of software programs have been developed to infer phylogenetic relationships between populations. However, most of these programs typically use alignments of sequences from genes to build phylogeny. Recently, many standalone or web applications have been developed to handle large-scale whole genome data, but they are either computationally intensive, dependent on third party software or required significant time and resource of a web server. In the post-genomic era, researchers are able to obtain bioinformatically processed high-quality publication-ready whole genome data for many individuals in a population from next generation sequencing companies due to the reduction in the cost of sequencing and analysis. Such genotype data is typically presented in the Variant Call Format (VCF) and there is no simple software available that directly uses this data format to construct the phylogeny of populations in a short time. To address this limitation, we have developed a user-friendly software, VCF2PopTree that uses genome-wide SNPs to construct and display phylogenetic trees in seconds to minutes. For example, it reads a VCF file containing 4 million SNPs and draws a tree in less than 30 seconds. VCF2PopTree accepts genotype data from a local machine, constructs a tree using UPGMA and Neighbour-Joining algorithms and displays it on a web-browser. It also produces pairwise-diversity matrix in MEGA and PHYLIP file formats as well as trees in the Newick format which could be directly used by other popular phylogenetic software programs. The software including the source code, a test VCF file and a documentation are available at: https://github.com/sansubs/vcf2pop.

Download Full-text

A curated dataset of modern and ancient high-coverage shotgun human genomes

Scientific Data ◽

10.1038/s41597-021-00980-1 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Pierpaolo Maisano Delser ◽

Eppie R. Jones ◽

Anahit Hovhannisyan ◽

Lara Cassidy ◽

Ron Pinhasi ◽

...

Keyword(s):

Sequence Data ◽

Whole Genome ◽

Reference Dataset ◽

High Coverage ◽

Sample Distribution ◽

Human Samples ◽

Human Genomes ◽

Genome Wide ◽

Genome Wide Data ◽

Computationally Intensive

AbstractOver the last few years, genome-wide data for a large number of ancient human samples have been collected. Whilst datasets of captured SNPs have been collated, high coverage shotgun genomes (which are relatively few but allow certain types of analyses not possible with ascertained captured SNPs) have to be reprocessed by individual groups from raw reads. This task is computationally intensive. Here, we release a dataset including 35 whole-genome sequenced samples, previously published and distributed worldwide, together with the genetic pipeline used to process them. The dataset contains 72,041,355 sites called across 19 ancient and 16 modern individuals and includes sequence data from four previously published ancient samples which we sequenced to higher coverage (10–18x). Such a resource will allow researchers to analyse their new samples with the same genetic pipeline and directly compare them to the reference dataset without re-processing published samples. Moreover, this dataset can be easily expanded to increase the sample distribution both across time and space.

Download Full-text

halSynteny: a fast, easy-to-use conserved synteny block construction method for multiple whole-genome alignments

GigaScience ◽

10.1093/gigascience/giaa047 ◽

2020 ◽

Vol 9 (6) ◽

Author(s):

Ksenia Krasheninnikova ◽

Mark Diekhans ◽

Joel Armstrong ◽

Aleksei Dievskii ◽

Benedict Paten ◽

...

Keyword(s):

Large Scale ◽

Pairwise Alignment ◽

Synteny Block ◽

Rapid Identification ◽

Full Genome Sequence ◽

Whole Genome ◽

Full Genome ◽

Genome Data ◽

Multiple Alignments ◽

Binary Format

Abstract Background Large-scale sequencing projects provide high-quality full-genome data that can be used for reconstruction of chromosomal exchanges and rearrangements that disrupt conserved syntenic blocks. The highest resolution of cross-species homology can be obtained on the basis of whole-genome, reference-free alignments. Very large multiple alignments of full-genome sequence stored in a binary format demand an accurate and efficient computational approach for synteny block production. Findings halSynteny performs efficient processing of pairwise alignment blocks for any pair of genomes in the alignment. The tool is part of the HAL comparative genomics suite and is targeted to build synteny blocks for multi-hundred–way, reference-free vertebrate alignments built with the Cactus system. Conclusions halSynteny enables an accurate and rapid identification of synteny in multiple full-genome alignments. The method is implemented in C++11 as a component of the halTools software and released under MIT license. The package is available at https://github.com/ComparativeGenomicsToolkit/hal/.

Download Full-text

Population-level genome-wide STR typing in Plasmodium species reveals higher resolution population structure and genetic diversity relative to SNP typing

10.1101/2021.05.19.444768 ◽

2021 ◽

Author(s):

Jiru Han ◽

Jacob E Munro ◽

Anthony Kocoski ◽

Alyssa E Barry ◽

Melanie Bahlo

Keyword(s):

Genetic Diversity ◽

Large Scale ◽

Tandem Repeats ◽

Plasmodium Species ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Genome Wide ◽

Field Samples

Short tandem repeats (STRs) are highly informative genetic markers that have been used extensively in population genetics analysis. They are an important source of genetic diversity and can also have functional impact. Despite the availability of bioinformatic methods that permit large-scale genome-wide genotyping of STRs from whole genome sequencing data, they have not previously been applied to sequencing data from large collections of malaria parasite field samples. Here, we have genotyped STRs using HipSTR in more than 3,000 Plasmodium falciparum and 174 Plasmodium vivax published whole-genome sequence data from samples collected across the globe. High levels of noise and variability in the resultant callset necessitated the development of a novel method for quality control of STR genotype calls. A set of high-quality STR loci (6,768 from P. falciparum and 3,496 from P. vivax) were used to study Plasmodium genetic diversity, population structures and genomic signatures of selection and these were compared to genome-wide single nucleotide polymorphism (SNP) genotyping data. In addition, the genome-wide information about genetic variation and other characteristics of STRs in P. falciparum and P. vivax have been made available in an interactive web-based R Shiny application PlasmoSTR (https://github.com/bahlolab/PlasmoSTR).

Download Full-text

Seave: a comprehensive web platform for storing and interrogating human genomic variation

10.1101/258061 ◽

2018 ◽

Cited By ~ 3

Author(s):

Velimir Gayevskiy ◽

Tony Roscioli ◽

Marcel E Dinger ◽

Mark J Cowley

Keyword(s):

Cloud Computing ◽

Large Scale ◽

Variant Calling ◽

Genomic Variation ◽

Whole Genome ◽

Genome Data ◽

Pathogenicity Prediction ◽

Data Scaling ◽

Human Genomic ◽

Web Platform

AbstractCapability for genome sequencing and variant calling has increased dramatically, enabling large scale genomic interrogation of human disease. However, discovery is hindered by the current limitations in genomic interpretation, which remains a complicated and disjointed process. We introduce Seave, a web platform that enables variants to be easily filtered and annotated with in silico pathogenicity prediction scores and annotations from popular disease databases. Seave stores genomic variation of all types and sizes, and allows filtering for specific inheritance patterns, quality values, allele frequencies and gene lists. Seave is open source and deployable locally, or on a cloud computing provider, and works readily with gene panel, exome and whole genome data, scaling from single labs to multi-institution scale.

Download Full-text

A curated dataset of modern and ancient high-coverage shotgun human genomes

10.1101/2020.10.27.351692 ◽

2020 ◽

Author(s):

Pierpaolo Maisano Delser ◽

Eppie R. Jones ◽

Anahit Hovhannisyan ◽

Lara Cassidy ◽

Ron Pinhasi ◽

...

Keyword(s):

Sequence Data ◽

Whole Genome ◽

Reference Dataset ◽

High Coverage ◽

Sample Distribution ◽

Human Samples ◽

Human Genomes ◽

Genome Wide ◽

Genome Wide Data ◽

Computationally Intensive

AbstractOver the last few years, genome-wide data for a large number of ancient human samples have been collected. Whilst datasets of capture SNPs have been collated, high coverage shotgun genomes (which are relatively few but allow certain type of analyses not possible with ascertained captured SNPs) have to be reprocessed by individual groups from raw reads. This task is computationally intensive. Here, we release a dataset including 34 whole-genome sequenced samples, previously published and distributed worldwide, together with the genetic pipeline used to process them. The dataset contains 73,435,604 sites called across 18 ancient and 16 modern individuals and includes sequence data from four previously published ancient samples which we sequenced to higher coverage (10-18x). Such a resource will allow researchers to analyse their new samples with the same genetic pipeline and directly compare them to the reference dataset without re-processing published samples. Moreover, this dataset can be easily expanded to increase the sample distribution both across time and space.

Download Full-text

Epidemic Clostridioides difficile Ribotype 027 Lineages: Comparisons of Texas Versus Worldwide Strains

Open Forum Infectious Diseases ◽

10.1093/ofid/ofz013 ◽

2019 ◽

Vol 6 (2) ◽

Cited By ~ 7

Author(s):

Bradley T Endres ◽

Khurshida Begum ◽

Hua Sun ◽

Seth T Walk ◽

Ali Memariani ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Phylogenetic Trees ◽

Large Scale ◽

The United States ◽

Whole Genome Sequence ◽

Snp Analysis ◽

Whole Genome ◽

Ribotype 027 ◽

Clostridioides Difficile

Abstract Background The epidemic Clostridioides difficile ribotype 027 strain resulted from the dissemination of 2 separate fluoroquinolone-resistant lineages: FQR1 and FQR2. Both lineages were reported to originate in North America; however, confirmatory large-scale investigations of C difficile ribotype 027 epidemiology using whole genome sequencing has not been undertaken in the United States. Methods Whole genome sequencing and single-nucleotide polymorphism (SNP) analysis was performed on 76 clinical ribotype 027 isolates obtained from hospitalized patients in Texas with C difficile infection and compared with 32 previously sequenced worldwide strains. Maximum-likelihood phylogeny based on a set of core genome SNPs was used to construct phylogenetic trees investigating strain macro- and microevolution. Bayesian phylogenetic and phylogeographic analyses were used to incorporate temporal and geographic variables with the SNP strain analysis. Results Whole genome sequence analysis identified 2841 SNPs including 900 nonsynonymous mutations, 1404 synonymous substitutions, and 537 intergenic changes. Phylogenetic analysis separated the strains into 2 prominent groups, which grossly differed by 28 SNPs: the FQR1 and FQR2 lineages. Five isolates were identified as pre-epidemic strains. Phylogeny demonstrated unique clustering and resistance genes in Texas strains indicating that spatiotemporal bias has defined the microevolution of ribotype 027 genetics. Conclusions Clostridioides difficile ribotype 027 lineages emerged earlier than previously reported, coinciding with increased use of fluoroquinolones. Both FQR1 and FQR2 ribotype 027 epidemic lineages are present in Texas, but they have evolved geographically to represent region-specific public health threats.

Download Full-text

Evaluating the use of ABBA-BABA statistics to locate introgressed loci

10.1101/001347 ◽

2013 ◽

Cited By ~ 17

Author(s):

Simon H. Martin ◽

John W. Davey ◽

Chris D. Jiggins

Keyword(s):

Population Structure ◽

Population Size ◽

Effective Population Size ◽

Ancestral Population ◽

Whole Genome ◽

Effective Population ◽

Genome Data ◽

Genome Wide ◽

A Genome ◽

Genomic Regions

Several methods have been proposed to test for introgression across genomes. One method tests for a genome-wide excess of shared derived alleles between taxa using Patterson?s D statistic, but does not establish which loci show such an excess or whether the excess is due to introgression or ancestral population structure. Several recent studies have extended the use of D by applying the statistic to small genomic regions, rather than genome-wide. Here, we use simulations and whole genome data from Heliconius butterflies to investigate the behavior of D in small genomic regions. We find that D is unreliable in this situation as it gives inflated values when effective population size is low, causing D outliers to cluster in genomic regions of reduced diversity. As an alternative, we propose a related statistic f̂d, a modified version of a statistic originally developed to estimate the genome-wide fraction of admixture. f̂d is not subject to the same biases as D, and is better at identifying introgressed loci. Finally, we show that both D and f̂d outliers tend to cluster in regions of low absolute divergence (dXY), which can confound a recently proposed test for differentiating introgression from shared ancestral variation at individual loci.

Download Full-text