scholarly journals A Graph Extension of the Positional Burrows-Wheeler Transform and its Applications

2016 ◽  
Author(s):  
Adam M. Novak ◽  
Erik Garrison ◽  
Benedict Paten

1AbstractWe present a generalization of the Positional Burrows-Wheeler Transform (PBWT) to genome graphs, which we call the gPBWT. A genome graph is a collapsed representation of a set of genomes described as a graph. In a genome graph, a haplotype corresponds to a restricted form of walk. The gPBWT is a compressible representation of a set of these graph-encoded haplotypes that allows for efficient subhaplotype match queries. We give efficient algorithms for gPBWT construction and query operations. We describe our implementation, showing the compression and search of 1000 Genomes data. As a demonstration, we use the gPBWT to quickly count the number of haplotypes consistent with random walks in a genome graph, and with the paths taken by mapped reads; results suggest that haplotype consistency information can be practically incorporated into graph-based read mappers.


2013 ◽  
Vol 42 (D1) ◽  
pp. D903-D909 ◽  
Author(s):  
Marc Pybus ◽  
Giovanni M. Dall’Olio ◽  
Pierre Luisi ◽  
Manu Uzkudun ◽  
Angel Carreño-Torres ◽  
...  


2021 ◽  
Author(s):  
Martin Hunt ◽  
Brice Letcher ◽  
Kerri M Malone ◽  
Giang Nguyen ◽  
Michael B Hall ◽  
...  

Short-read variant calling for bacterial genomics is a mature field, and there are many widely-used software tools. Different underlying approaches (eg pileup, local or global assembly, paired-read use, haplotype use) lend each tool different strengths, especially when considering non-SNP (single nucleotide polymorphism) variation or potentially distant reference genomes. It would therefore be valuable to be able to integrate the results from multiple variant callers, using a robust statistical approach to "adjudicate" at loci where there is disagreement between callers. To this end, we present a tool, Minos, for variant adjudication by mapping reads to a genome graph of variant calls. Minos allows users to combine output from multiple variant callers without loss of precision. Minos also addresses a second problem of joint genotyping SNPs and indels in bacterial cohorts, which can also be framed as an adjudication problem. We benchmark on 62 samples from 3 species (Mycobacterium tuberculosis, Staphylococcus aureus, Klebsiella pneumoniae) and an outbreak of 385 M. tuberculosis samples. Finally, we joint genotype a large M. tuberculosis cohort (N≈15k) for which the rifampicin phenotype is known. We build a map of non-synonymous variants in the RRDR (rifampicin resistance determining region) of the rpoB gene and extend current knowledge relating RRDR SNPs to heterogeneity in rifampicin resistance levels. We replicate this finding in a second M. tuberculosis cohort (N≈13k). Minos is released under the MIT license, available at https://github.com/iqbal-lab-org/minos.



Author(s):  
Jouni Sirén ◽  
Erik Garrison ◽  
Adam M Novak ◽  
Benedict Paten ◽  
Richard Durbin

Abstract Motivation The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. Results We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. Availability and implementation Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. Supplementary information Supplementary data are available at Bioinformatics online.



2019 ◽  
Vol 9 (1) ◽  
Author(s):  
C. Mary Schooling ◽  
Glen D. Johnson ◽  
Jean Grassman

Abstract Lead is pervasive, although lead exposure has fallen in response to public health efforts. Observationally, lead is positively associated with cardiovascular disease and hypertension. We used separate-sample instrumental variable analysis with genetic instruments (Mendelian randomization) based on 13 single nucleotide polymorphisms (SNP), from a genome wide association study, strongly (p-value < 5 × 10−6) and independently associated with blood lead. These SNPs were applied to a large extensively genotyped coronary artery disease (CAD) study (cases = <76014, controls = <264785) largely based on CARDIoGRAPMplusC4D 1000 Genomes and the UK Biobank SOFT CAD, to the UK Biobank (n = 361,194) for blood pressure and to the DIAGRAM 1000 genomes diabetes case (n = 26,676)-control (n = 132,532) study. SNP-specific Wald estimates were combined using inverse variance weighting, MR-Egger and MR-PRESSO. Genetically instrumented blood lead was not associated with CAD (odds ratio (OR) 1.01 per effect size of log transformed blood lead, 95% confidence interval (CI) 0.97, 1.05), blood pressure (systolic −0.18 mmHg, 95% CI −0.44 to 0.08 and diastolic −0.03 mmHg, 95% CI −0.09 to 0.15) or diabetes (OR 0.98, 95% CI 0.92 to 1.03) using MR-PRESSO estimates corrected for an outlier SNP (rs550057) from the highly pleiotropic gene ABO. Exogenous lead may have different effects from endogenous lead; nevertheless, this study raises questions about the role of blood lead in CAD.



2016 ◽  
Author(s):  
Cathal Seoighe ◽  
Aylwyn Scally

AbstractThe rate of germline mutation varies widely between species but little is known about the extent of variation in the germline mutation rate between individuals of the same species. Here we demonstrate that an allele that increases the rate of germline mutation can result in a distinctive signature in the genomic region linked to the affected locus, characterized by a number of haplotypes with a locally high proportion of derived alleles, against a background of haplotypes carrying a typical proportion of derived alleles. We searched for this signature in human haplotype data from phase 3 of the 1000 Genomes Project and report a number of candidate mutator loci, several of which are located close to or within genes involved in DNA repair or the DNA damage response. To investigate whether mutator alleles remained active at any of these loci, we used de novo mutation counts from human parent-offspring trios in the 1000 Genomes and Genome of the Netherlands cohorts, looking for an elevated number of de novo mutations in the offspring of parents carrying a candidate mutator haplotype at each of these loci. We found some support for two of the candidate loci, including one locus just upstream of the BRSK2 gene, which is expressed in the testis and has been reported to be involved in the response to DNA damage.Author SummaryEach time a genome is replicated there is the possibility of error resulting in the incorporation of an incorrect base or bases in the genome sequence. When these errors occur in cells that lead to the production of gametes they can be incorporated into the germline. Such germline mutations are the basis of evolutionary change; however, to date there has been little attempt to quantify the extent of genetic variation in human populations in the rate at which they occur. This is particularly important because new spontaneous mutations are thought to make an important contribution to many human diseases. Here we present a new way to identify genetic loci that may be associated with an elevated rate of germline mutation and report the application of this method to data from a large number of human genomes, generated by the 1000 Genomes Project. Several of the candidate loci we report are in or near genes involved in DNA repair and some were supported by direct measurement of the mutation rate obtained from parent-offspring trios.



PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e12502
Author(s):  
Nikita Moshkov ◽  
Aleksandr Smetanin ◽  
Tatiana V. Tatarinova

Summary We developed PyLAE, a new tool for determining local ancestry along a genome using whole-genome sequencing data or high-density genotyping experiments. PyLAE can process an arbitrarily large number of ancestral populations (with or without an informative prior). Since PyLAE does not involve estimating many parameters, it can process thousands of genomes within a day. PyLAE can run on phased or unphased genomic data. We have shown how PyLAE can be applied to the identification of differentially enriched pathways between populations. The local ancestry approach results in higher enrichment scores compared to whole-genome approaches. We benchmarked PyLAE using the 1000 Genomes dataset, comparing the aggregated predictions with the global admixture results and the current gold standard program RFMix. Computational efficiency, minimal requirements for data pre-processing, straightforward presentation of results, and ease of installation make PyLAE a valuable tool to study admixed populations. Availability and implementation The source code and installation manual are available at https://github.com/smetam/pylae.



F1000Research ◽  
2018 ◽  
Vol 7 ◽  
pp. 1391
Author(s):  
Evan Biederstedt ◽  
Jeffrey C. Oliver ◽  
Nancy F. Hansen ◽  
Aarti Jajoo ◽  
Nathan Dunn ◽  
...  

Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enables de novo assembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-based de novo assembly, including large structural variants and divergent haplotypes. Here we present NovoGraph, a method for the construction of a human genome graph directly from a set of de novo assemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles from de novo assemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped.



1991 ◽  
Vol 01 (03) ◽  
pp. 227-241 ◽  
Author(s):  
ARI RAPPOPORT

We introduce the Extended Convex Differences Tree (ECDT) representation for n-dimensional polyhedra. The ECDT is simultaneously a representation and an efficiency scheme. A set is represented by a tree. Every node holds a convex bound to the set which it represents (not necessarily the convex hull). The union of the sets represented recursively by the children is the set difference between the parent's convex bound and the set the parent represents. The fact that a node holds a bound to its set is useful for avoidance of unnecessary computations. This bound is convex, permitting efficient algorithms. The ECDT uses convex differences and is therefore able lo look at concave areas as being convex. The ECDT can be viewed as an extension to the Convex Differences Tree scheme, without its drawbacks, or as a restricted form of GSG. We show how Boolean operations are performed directly on the ECDT and how a CSG tree is converted to ECDT form. Various geometric operations on the ECDT are detailed, including point membership classification, slicing by a hyper-plane, and boundary evaluation.



BIOPHYSICS ◽  
2018 ◽  
Vol 63 (3) ◽  
pp. 311-317 ◽  
Author(s):  
S. N. Petrov ◽  
L. A. Uroshlev ◽  
A. S. Kasyanov ◽  
V. Yu. Makeev


2017 ◽  
Author(s):  
Adam M. Novak ◽  
Glenn Hickey ◽  
Erik Garrison ◽  
Sean Blum ◽  
Abram Connelly ◽  
...  

AbstractThere is increasing recognition that a single, monoploid reference genome is a poor universal reference structure for human genetics, because it represents only a tiny fraction of human variation. Adding this missing variation results in a structure that can be described as a mathematical graph: a genome graph. We demonstrate that, in comparison to the existing reference genome (GRCh38), genome graphs can substantially improve the fractions of reads that map uniquely and perfectly. Furthermore, we show that this fundamental simplification of read mapping transforms the variant calling problem from one in which many non-reference variants must be discovered de-novo to one in which the vast majority of variants are simply re-identified within the graph. Using standard benchmarks as well as a novel reference-free evaluation, we show that a simplistic variant calling procedure on a genome graph can already call variants at least as well as, and in many cases better than, a state-of-the-art method on the linear human reference genome. We anticipate that graph-based references will supplant linear references in humans and in other applications where cohorts of sequenced individuals are available.



Sign in / Sign up

Export Citation Format

Share Document