scholarly journals The Practical Haplotype Graph, a platform for storing and using pangenomes for imputation

2021 ◽  
Author(s):  
Peter Bradbury ◽  
Terry Casstevens ◽  
Sarah E Jensen ◽  
Lynn C Johnson ◽  
Zachary R Miller ◽  
...  

Motivation: Pangenomes provide novel insights for population and quantitative genetics, genomics, and breeding not available from studying a single reference genome. Instead, a species is better represented by a pangenome or collection of genomes. Unfortunately, managing and using pangenomes for genomically diverse species is computationally and practically challenging. We developed a trellis graph representation anchored to the reference genome that represents most pangenomes well and can be used to impute complete genomes from low density sequence or variant data. Results: The Practical Haplotype Graph (PHG) is a pangenome pipeline, database (PostGRES & SQLite), data model (Java, Kotlin, or R), and Breeding API (BrAPI) web service. The PHG has already been able to accurately represent diversity in four major crops including maize, one of the most genomically diverse species, with up to 1000-fold data compression. Using simulated data, we show that, at even 0.1X coverage, with appropriate reads and sequence alignment, imputation results in extremely accurate haplotype reconstruction. The PHG is a platform and environment for the understanding and application of genomic diversity. Availability: All resources listed here are freely available. The PHG Docker used to generate the simulation results is https://hub.docker.com/ as maizegenetics/phg:0.0.27. PHG source code is at https://bitbucket.org/bucklerlab/practicalhaplotypegraph/src/master/. The code used for the analysis of simulated data is at https://bitbucket.org/bucklerlab/phg-manuscript/src/master/. The PHG database of NAM parent haplotypes is in the CyVerse data store (https://de.cyverse.org/de/) and named /iplant/home/shared/panzea/panGenome/PHG_db_maize/phg_v5Assemblies_20200608.db.

GigaScience ◽  
2020 ◽  
Vol 9 (2) ◽  
Author(s):  
Stephen J Bush ◽  
Dona Foster ◽  
David W Eyre ◽  
Emily L Clark ◽  
Nicola De Maio ◽  
...  

Abstract Background Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella. Results We evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis. Conclusions The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka.


2019 ◽  
Author(s):  
Stephen J. Bush ◽  
Dona Foster ◽  
David W. Eyre ◽  
Emily L. Clark ◽  
Nicola De Maio ◽  
...  

AbstractBackgroundAccurately identifying SNPs from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained.This study evaluates the performance of 41 SNP calling pipelines using simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally-sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia and Klebsiella.ResultsWe evaluated the performance of 41 SNP calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic, bacteria such as Escherichia coli, but less dominant for clonal species such as Mycobacterium tuberculosis.ConclusionsThe accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest performing pipelines was Novoalign/GATK. However, across the full range of (divergent) genomes, among the consistently highest-performing pipelines was Snippy.


Animals ◽  
2021 ◽  
Vol 11 (3) ◽  
pp. 904
Author(s):  
Saif ur Rehman ◽  
Faiz-ul Hassan ◽  
Xier Luo ◽  
Zhipeng Li ◽  
Qingyou Liu

The buffalo was domesticated around 3000–6000 years ago and has substantial economic significance as a meat, dairy, and draught animal. The buffalo has remained underutilized in terms of the development of a well-annotated and assembled reference genome de novo. It is mandatory to explore the genetic architecture of a species to understand the biology that helps to manage its genetic variability, which is ultimately used for selective breeding and genomic selection. Morphological and molecular data have revealed that the swamp buffalo population has strong geographical genomic diversity with low gene flow but strong phenotypic consistency, while the river buffalo population has higher phenotypic diversity with a weak phylogeographic structure. The availability of recent high-quality reference genome and genotyping marker panels has invigorated many genome-based studies on evolutionary history, genetic diversity, functional elements, and performance traits. The increasing molecular knowledge syndicate with selective breeding should pave the way for genetic improvement in the climatic resilience, disease resistance, and production performance of water buffalo populations globally.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Christopher Alan Smith

AbstractThe basidiomycete fungus Lentinula novae-zelandiae is endemic to New Zealand and is a sister taxon to Lentinula edodes, the second most cultivated mushroom in the world. To explore the biology of this organism, a high-quality chromosome level reference genome of L. novae-zelandiae was produced. Macrosyntenic comparisons between the genome assembly of L. novae-zelandiae, L. edodes and a set of three genome assemblies of diverse species from the Agaricomycota reveal a high degree of macrosyntenic restructuring within L. edodes consistent with signal of domestication. These results show L. edodes has undergone significant genomic change during the course of its evolutionary history, likely a result of its cultivation and domestication over the last 1000 years.


2015 ◽  
Vol 14 ◽  
pp. CIN.S26470 ◽  
Author(s):  
Richard P. Finney ◽  
Qing-Rong Chen ◽  
Cu V. Nguyen ◽  
Chih Hao Hsu ◽  
Chunhua Yan ◽  
...  

The name Alview is a contraction of the term Alignment Viewer. Alview is a compiled to native architecture software tool for visualizing the alignment of sequencing data. Inputs are files of short-read sequences aligned to a reference genome in the SAM/BAM format and files containing reference genome data. Outputs are visualizations of these aligned short reads. Alview is written in portable C with optional graphical user interface (GUI) code written in C, C++, and Objective-C. The application can run in three different ways: as a web server, as a command line tool, or as a native, GUI program. Alview is compatible with Microsoft Windows, Linux, and Apple OS X. It is available as a web demo at https://cgwb.nci.nih.gov/cgi-bin/alview . The source code and Windows/Mac/Linux executables are available via https://github.com/NCIP/alview .


2021 ◽  
Vol 4 ◽  
pp. 1-5
Author(s):  
Bárbara Cubillos ◽  
Ángela Ortíz ◽  
Germán Aguilera ◽  
Sergio Rozas ◽  
Claudio Reyes ◽  
...  

Abstract. The digital cartographic coverage at 1:25,000 that the Military Geographic Institute is creating has been worked on using international standards, so that it constitutes a standardized and interoperable tool, for the various areas of activity in Chile. In this context, the ISO TC 211 standards and the TDS (Topographic Data Store) data model developed by the National Geospatial-Intelligence Agency (NGA) are being used.Apart from using these standards, efforts have been aimed, from an early stage, at the determination of the quality of this product, starting this process with the study for a methodology to measure Positional Accuracy. The method defined conforms to the NSSDA test; for this, points measured in the terrain especially for this control are used, also the elimination of points that are out of range under the Chauvenet Criteria. Finally, the positional accuracy is declared in the metadata.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Heng Li ◽  
Xiaowen Feng ◽  
Chong Chu

Abstract The recent advances in sequencing technologies enable the assembly of individual genomes to the quality of the reference genome. How to integrate multiple genomes from the same species and make the integrated representation accessible to biologists remains an open challenge. Here, we propose a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome. We implement our ideas in the minigraph toolkit and demonstrate that we can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome.


2008 ◽  
Vol 367 ◽  
pp. 71-78
Author(s):  
P.T. Moe ◽  
Yawar Abbas Khan ◽  
Henry Sigvart Valberg ◽  
Sigurd Støren

The article presents an outline of a scientific approach for testing constitutive relations for the aluminum extrusion process. By comparing ram force, container friction, die face pressure, outlet temperature measurement during rod extrusion with corresponding simulated data, inferences can in principle be drawn with respect to the validity models. The paper indicates that simulation results from the 2D ALMA2π program are in fair agreement with measurements during extrusion of AA6060, but more work needs to be done to control thermal conditions during extrusion.


2014 ◽  
Vol 908 ◽  
pp. 425-428
Author(s):  
Hao Chen ◽  
Fu Li Chen ◽  
Yong Ying Zhu ◽  
Ming Zeng ◽  
Chen Sun ◽  
...  

By the tracers of dissolved conservative substance, establishing the convection diffusion numerical model of water exchange in the bay. The water exchange simulation results derive the half-exchange period of each region.And based on the hydrological data,model results show the tidal current field and the half-exchange period,analyzing the water exchange properties and the convective transport of pollutants to the Pulandian Bay.The numerical simulation results provide the scientific basis and basic data for the sea area construction and environmental protection.


Author(s):  
Laura M. Carroll ◽  
Martin Wiedmann

AbstractCereulide-producing members of Bacillus cereus sensu lato (B. cereus s.l.) Group III, also known as “emetic B. cereus”, possess cereulide synthetase, a plasmid-encoded, non-ribosomal peptide synthetase encoded by the ces gene cluster. Despite the documented risks that cereulide-producing strains pose to public health, the level of genomic diversity encompassed by “emetic B. cereus” has never been evaluated at a whole-genome scale. Here, we employ a phylogenomic approach to characterize Group III B. cereus s.l. genomes which possess ces (ces-positive) alongside their closely related ces-negative counterparts to (i) assess the genomic diversity encompassed by “emetic B. cereus”, and (ii) identify potential ces loss and/or gain events within the evolutionary history of the high-risk and medically relevant sequence type (ST) 26 lineage often associated with emetic foodborne illness. Using all publicly available ces-positive Group III B. cereus s.l. genomes and the ces-negative genomes interspersed among them (n = 150), we show that “emetic B. cereus” is not clonal; rather, multiple lineages within Group III harbor cereulide-producing strains, all of which share a common ancestor incapable of producing cereulide (posterior probability [PP] 0.86-0.89). The ST 26 common ancestor was predicted to have emerged as ces-negative (PP 0.60-0.93) circa 1904 (95% highest posterior density [HPD] interval 1837.1-1957.8) and first acquired the ability to produce cereulide before 1931 (95% HPD 1893.2-1959.0). Three subsequent ces loss events within ST 26 were observed, including among isolates responsible for B. cereus s.l. toxicoinfection (i.e., “diarrheal” illness).Importance“B. cereus” is responsible for thousands of cases of foodborne disease each year worldwide, causing two distinct forms of illness: (i) intoxication via cereulide (i.e., “emetic” syndrome) or (ii) toxicoinfection via multiple enterotoxins (i.e., “diarrheal” syndrome). Here, we show that “emetic B. cereus” is not a clonal, homogenous unit that resulted from a single cereulide synthetase gain event followed by subsequent proliferation; rather, cereulide synthetase acquisition and loss is a dynamic, ongoing process that occurs across lineages, allowing some Group III B. cereus s.l. populations to oscillate between diarrheal and emetic foodborne pathogen over the course of their evolutionary histories. We also highlight the care that must be taken when selecting a reference genome for whole-genome sequencing-based investigation of emetic B. cereus s.l. outbreaks, as some reference genome selections can lead to a confounding loss of resolution and potentially hinder epidemiological investigations.


Sign in / Sign up

Export Citation Format

Share Document