A Natural Encoding of Genetic Variation in a Burrows-Wheeler Transform to Enable Mapping and Genome Inference

Abstract Motivation The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. Results We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. Availability and implementation Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

An improved encoding of genetic variation in a Burrows-Wheeler transform

10.1101/658716 ◽

2019 ◽

Author(s):

Thomas Büchler ◽

Enno Ohlebusch

Keyword(s):

Genetic Variation ◽

Copy Number ◽

Reference Genome ◽

Search Algorithm ◽

The Other ◽

Read Mapping ◽

Marked Chromosome ◽

Number Variation ◽

Burrows Wheeler Transform ◽

Multiple Variants

AbstractMotivationIn resequencing experiments, a high-throughput sequencer produces DNA-fragments (called reads) and each read is then mapped to the locus in a reference genome at which it fits best. Currently dominant read mappers (Li and Durbin, 2009; Langmead and Salzberg, 2012) are based on the Burrows-Wheeler transform (BWT). A read can be mapped correctly if it is similar enough to a substring of the reference genome. However, since the reference genome does not represent all known variations, read mapping tends to be biased towards the reference and mapping errors may thus occur. To cope with this problem, Huang et al. (2013) encoded SNPs in a BWT by the IUPAC nucleotide code (Cornish-Bowden, 1985). In a different approach, Maciuca et al. (2016) provided a ‘natural encoding’ of SNPs and other genetic variations in a BWT. However, their encoding resulted in a significantly increased alphabet size (the modified alphabet can have millions of new symbols, which usually implies a loss of efficiency). Moreover, the two approaches do not handle all known kinds of variation.ResultsIn this article, we propose a method that is able to encode many kinds of genetic variation (SNPs, MNPs, indels, duplications, transpositions, inversions, and copy-number variation) in a BWT. It takes the best of both worlds: SNPs are encoded by the IUPAC nucleotide code as in (Huang et al., 2013) and the encoding of the other kinds of genetic variation relies on the idea introduced in (Maciuca et al., 2016). In contrast to Maciuca et al. (2016), however, we use only one additional symbol. This symbol marks variant sites in a chromosome and delimits multiple variants, which are added at the end of the ‘marked chromosome’. We show how the backward search algorithm, which is used in BWT-based read mappers, can be modified in such a way that it can cope with the genetic variation encoded in the BWT. We implemented our method and compared it to BWBBLE (Huang et al., 2013) and gramtools (Maciuca et al., 2016).Availabilityhttps://www.uni-ulm.de/in/theo/research/seqana/Contact:[email protected]

Download Full-text

A natural encoding of genetic variation in a Burrows-Wheeler Transform to enable mapping and genome inference

10.1101/059170 ◽

2016 ◽

Cited By ~ 6

Author(s):

Sorina Maciuca ◽

Carlos del Ojo Elias ◽

Gil McVean ◽

Zamin Iqbal

Keyword(s):

Genetic Variation ◽

Human Genome ◽

Genetic Variants ◽

Reference Genome ◽

Exact Matching ◽

Performance Impact ◽

Alphabet Size ◽

The Cost ◽

Burrows Wheeler Transform

AbstractWe show how positional markers can be used to encode genetic variation within aBurrows-Wheeler Transform (BWT), and use this to construct a generalisation ofthe traditional “reference genome”, incorporating known variation within aspecies. Our goal is to support the inference of the closest mosaic of previouslyknown sequences to the genome(s) under analysis.Our scheme results in an increased alphabet size, and by using a wavelet tree encoding of the BWT we reduce the performance impact on rank operations. We give a specialised form of the backward search that allows variation-aware exact matching. We implement this, and demonstrate the cost of constructing an index of the whole human genome with 8 million genetic variants is 25GB of RAM. We also show that inferring a closer reference can close large kilobase-scale coverage gaps in P. falciparum.

Download Full-text

An improved encoding of genetic variation in a Burrows–Wheeler transform

Bioinformatics ◽

10.1093/bioinformatics/btz782 ◽

2019 ◽

Author(s):

Thomas Büchler ◽

Enno Ohlebusch

Keyword(s):

Genetic Variation ◽

Reference Genome ◽

Search Algorithm ◽

International Workshop ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Short Read Alignment ◽

Marked Chromosome ◽

Number Variation ◽

Burrows Wheeler Transform

Abstract Motivation In resequencing experiments, a high-throughput sequencer produces DNA-fragments (called reads) and each read is then mapped to the locus in a reference genome at which it fits best. Currently dominant read mappers are based on the Burrows–Wheeler transform (BWT). A read can be mapped correctly if it is similar enough to a substring of the reference genome. However, since the reference genome does not represent all known variations, read mapping tends to be biased towards the reference and mapping errors may thus occur. To cope with this problem, Huang et al. encoded single nucleotide polymorphisms (SNPs) in a BWT by the International Union of Pure and Applied Chemistry (IUPAC) nucleotide code. In a different approach, Maciuca et al. provided a ‘natural encoding’ of SNPs and other genetic variations in a BWT. However, their encoding resulted in a significantly increased alphabet size (the modified alphabet can have millions of new symbols, which usually implies a loss of efficiency). Moreover, the two approaches do not handle all known kinds of variation. Results In this article, we propose a method that is able to encode many kinds of genetic variation (SNPs, multi-nucleotide polymorphisms, insertions or deletions, duplications, transpositions, inversions and copy-number variation) in a BWT. It takes the best of both worlds: SNPs are encoded by the IUPAC nucleotide code as in Huang et al. (2013, Short read alignment with populations of genomes. Bioinformatics, 29, i361–i370) and the encoding of the other kinds of genetic variation relies on the idea introduced in Maciuca et al. (2016, A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In: Proceedings of the 16th International Workshop on Algorithms in Bioinformatics, Volume 9838 of Lecture Notes in Computer Science, pp. 222–233. Springer). In contrast to Maciuca et al., however, we use only one additional symbol. This symbol marks variant sites in a chromosome and delimits multiple variants, which are added at the end of the ‘marked chromosome’. We show how the backward search algorithm, which is used in BWT-based read mappers, can be modified in such a way that it can cope with the genetic variation encoded in the BWT. We implemented our method and compared it with BWBBLE and gramtools. Availability and implementation https://www.uni-ulm.de/in/theo/research/seqana/. Contact [email protected]

Download Full-text