Distance indexing and seed clustering in sequence graphs

Abstract Motivation Graph representations of genomes are capable of expressing more genetic variation and can therefore better represent a population than standard linear genomes. However, due to the greater complexity of genome graphs relative to linear genomes, some functions that are trivial on linear genomes become much more difficult in genome graphs. Calculating distance is one such function that is simple in a linear genome but complicated in a graph context. In read mapping algorithms such distance calculations are fundamental to determining if seed alignments could belong to the same mapping. Results We have developed an algorithm for quickly calculating the minimum distance between positions on a sequence graph using a minimum distance index. We have also developed an algorithm that uses the distance index to cluster seeds on a graph. We demonstrate that our implementations of these algorithms are efficient and practical to use for a new generation of mapping algorithms based upon genome graphs. Availability and implementation Our algorithms have been implemented as part of the vg toolkit and are available at https://github.com/vgteam/vg.

Download Full-text

Distance Indexing and Seed Clustering in Sequence Graphs

10.1101/2019.12.20.884924 ◽

2019 ◽

Author(s):

Xian Chang ◽

Jordan Eizenga ◽

Adam M. Novak ◽

Jouni Sirén ◽

Benedict Paten

Keyword(s):

Genetic Variation ◽

Minimum Distance ◽

Clustering Algorithms ◽

Read Mapping ◽

Mapping Algorithms ◽

Graph Representations ◽

Sequence Graph ◽

Standard Linear ◽

The Cost ◽

Linear Genomes

AbstractGraph representations of genomes are capable of expressing more genetic variation and can therefore better represent a population than standard linear genomes. However, due to the greater complexity of genome graphs relative to linear genomes, some functions that are trivial on linear genomes become more difficult in genome graphs. Calculating distance is one such function that is simple in a linear genome but much more complicated in a graph context. In read mapping algorithms, distance calculations are commonly used in a clustering step to determine if seed alignments could belong to the same mapping. Clustering algorithms are a bottleneck for some mapping algorithms due to the cost of repeated distance calculations. We have developed an algorithm for quickly calculating the minimum distance between positions on a sequence graph using a minimum distance index. We have also developed an algorithm that uses the distance index to cluster seeds on a graph. We demonstrate that our implementations of these algorithms are efficient and practical to use for mapping algorithms.

Download Full-text

A Flow Procedure for the Linearization of Genome Sequence Graphs

10.1101/101501 ◽

2017 ◽

Cited By ~ 2

Author(s):

David Haussler ◽

Maciej Smuga-Otto ◽

Benedict Paten ◽

Adam M Novak ◽

Sergei Nikitin ◽

...

Keyword(s):

Genetic Variation ◽

Genome Sequence ◽

Graph Representation ◽

Graph Visualization ◽

Human Genetic Variation ◽

A Genome ◽

Sequence Graph ◽

Reference Genomes ◽

Single Graph ◽

Reference Human Genome

1AbstractEfforts to incorporate human genetic variation into the reference human genome have converged on the idea of a graph representation of genetic variation within a species, a genome sequence graph. A sequence graph represents a set of individual haploid reference genomes as paths in a single graph. When that set of reference genomes is sufficiently diverse, the sequence graph implicitly contains all frequent human genetic variations, including translocations, inversions, deletions, and insertions.In representing a set of genomes as a sequence graph one encounters certain challenges. One of the most important is the problem of graph linearization, essential both for efficiency of storage and access, as well as for natural graph visualization and compatibility with other tools. The goal of graph linearization is to order nodes of the graph in such a way that operations such as access, traversal and visualization are as efficient and effective as possible.A new algorithm for the linearization of sequence graphs, called the flow procedure, is proposed in this paper. Comparative experimental evaluation of the flow procedure against other algorithms shows that it outperforms its rivals in the metrics most relevant to sequence graphs.

Download Full-text

Teaser: Individualized benchmarking and optimization of read mapping results for NGS data

10.1101/025858 ◽

2015 ◽

Author(s):

Moritz Smolka ◽

Philipp Rescheneder ◽

Michael C Schatz ◽

Arndt von Haeseler ◽

Fritz J Sedlazeck

Keyword(s):

Quantitative Evaluation ◽

Model Organisms ◽

Read Mapping ◽

Mapping Algorithms ◽

A Genome ◽

Mapping Process ◽

Ngs Data

Mapping reads to a genome remains challenging, especially for non-model organisms with poorer quality assemblies, or for organisms with higher rates of mutations. While most research has focused on speeding up the mapping process, little attention has been paid to optimize the choice of mapper and parameters for a user’s dataset. Here we present Teaser, which assists in these choices through rapid automated benchmarking of different mappers and parameter settings for individualized data. Within minutes, Teaser completes a quantitative evaluation of an ensemble of mapping algorithms and parameters. Using Teaser, we demonstrate how Bowtie2 can be optimized for different data.

Download Full-text

Variation graph toolkit improves read mapping by representing genetic variation in the reference

Nature Biotechnology ◽

10.1038/nbt.4227 ◽

2018 ◽

Vol 36 (9) ◽

pp. 875-879 ◽

Cited By ~ 150

Author(s):

Erik Garrison ◽

Jouni Sirén ◽

Adam M Novak ◽

Glenn Hickey ◽

Jordan M Eizenga ◽

...

Keyword(s):

Genetic Variation ◽

Read Mapping

Download Full-text

Using syncmers improves long-read mapping

10.1101/2022.01.10.475696 ◽

2022 ◽

Author(s):

David Pellow ◽

Abhinav Dutta ◽

Ron Shamir

Keyword(s):

Read Mapping ◽

Mapping Algorithms ◽

Sequencing Errors ◽

Sequence Identity ◽

Memory Efficiency ◽

Long Reads ◽

Long Read ◽

Cancer Tumor ◽

Generation Sequencing ◽

Unmapped Reads

As sequencing datasets keep growing larger, time and memory efficiency of read mapping are becoming more critical. Many clever algorithms and data structures were used to develop mapping tools for next generation sequencing, and in the last few years also for third generation long reads. A key idea in mapping algorithms is to sketch sequences with their minimizers. Recently, syncmers were introduced as an alternative sketching method that is more robust to mutations and sequencing errors. Here we introduce parameterized syncmer schemes, and provide a theoretical analysis for multi-parameter schemes. By combining these schemes with downsampling or minimizers we can achieve any desired compression and window guarantee. We introduced syncmer schemes into the popular minimap2 and Winnowmap2 mappers. In tests on simulated and real long read data from a variety of genomes, the syncmer-based algorithms reduced unmapped reads by 20-60% at high compression while using less memory. The advantage of syncmer-based mapping was even more pronounced at lower sequence identity. At sequence identity of 65-75% and medium compression, syncmer mappers had 50-60% fewer unmapped reads, and ∼ 10% fewer of the reads that did map were incorrectly mapped. We conclude that syncmer schemes improve mapping under higher error and mutation rates. This situation happens, for example, when the high error rate of long reads is compounded by a high mutation rate in a cancer tumor, or due to differences between strains of viruses or bacteria.

Download Full-text

Read Mapping Algorithms for Single Molecule Sequencing Data

Lecture Notes in Computer Science - Algorithms in Bioinformatics ◽

10.1007/978-3-540-87361-7_4 ◽

2008 ◽

pp. 38-49 ◽

Cited By ~ 2

Author(s):

Vladimir Yanovsky ◽

Stephen M. Rumble ◽

Michael Brudno

Keyword(s):

Single Molecule ◽

Sequencing Data ◽

Read Mapping ◽

Mapping Algorithms ◽

Single Molecule Sequencing

Download Full-text

An improved encoding of genetic variation in a Burrows-Wheeler transform

10.1101/658716 ◽

2019 ◽

Author(s):

Thomas Büchler ◽

Enno Ohlebusch

Keyword(s):

Genetic Variation ◽

Copy Number ◽

Reference Genome ◽

Search Algorithm ◽

The Other ◽

Read Mapping ◽

Marked Chromosome ◽

Number Variation ◽

Burrows Wheeler Transform ◽

Multiple Variants

AbstractMotivationIn resequencing experiments, a high-throughput sequencer produces DNA-fragments (called reads) and each read is then mapped to the locus in a reference genome at which it fits best. Currently dominant read mappers (Li and Durbin, 2009; Langmead and Salzberg, 2012) are based on the Burrows-Wheeler transform (BWT). A read can be mapped correctly if it is similar enough to a substring of the reference genome. However, since the reference genome does not represent all known variations, read mapping tends to be biased towards the reference and mapping errors may thus occur. To cope with this problem, Huang et al. (2013) encoded SNPs in a BWT by the IUPAC nucleotide code (Cornish-Bowden, 1985). In a different approach, Maciuca et al. (2016) provided a ‘natural encoding’ of SNPs and other genetic variations in a BWT. However, their encoding resulted in a significantly increased alphabet size (the modified alphabet can have millions of new symbols, which usually implies a loss of efficiency). Moreover, the two approaches do not handle all known kinds of variation.ResultsIn this article, we propose a method that is able to encode many kinds of genetic variation (SNPs, MNPs, indels, duplications, transpositions, inversions, and copy-number variation) in a BWT. It takes the best of both worlds: SNPs are encoded by the IUPAC nucleotide code as in (Huang et al., 2013) and the encoding of the other kinds of genetic variation relies on the idea introduced in (Maciuca et al., 2016). In contrast to Maciuca et al. (2016), however, we use only one additional symbol. This symbol marks variant sites in a chromosome and delimits multiple variants, which are added at the end of the ‘marked chromosome’. We show how the backward search algorithm, which is used in BWT-based read mappers, can be modified in such a way that it can cope with the genetic variation encoded in the BWT. We implemented our method and compared it to BWBBLE (Huang et al., 2013) and gramtools (Maciuca et al., 2016).Availabilityhttps://www.uni-ulm.de/in/theo/research/seqana/Contact:[email protected]

Download Full-text

A variant selection framework for genome graphs

10.1101/2021.02.02.429378 ◽

2021 ◽

Author(s):

Chirag Jain ◽

Neda Tavakoli ◽

Srinivas Aluru

Keyword(s):

Human Chromosome ◽

Chromosome 1 ◽

Mathematical Framework ◽

Variant Selection ◽

Read Mapping ◽

Graph Representations ◽

Graph Size ◽

Long Read ◽

Selection Framework ◽

Reference Bias

AbstractMotivationVariation graph representations are projected to either replace or supplement conventional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogues of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping.ResultsIn this work, we propose a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length α with at most δ differences. This framework leads to a rich set of problems based on the types of variants (SNPs, indels), and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed. We classify the computational complexity of these problems and provide efficient algorithms along with their software implementation when feasible. We empirically evaluate the magnitude of graph reduction achieved in human chromosome variation graphs using multiple α and δ parameter values corresponding to short and long-read resequencing characteristics. When our algorithm is run with parameter settings amenable to long-read mapping (α = 10 kbp, δ = 1000), 99.99% SNPs and 73% indel structural variants can be safely excluded from human chromosome 1 variation graph. The graph size reduction can benefit downstream pan-genome analysis.Implementationhttps://github.com/at-cg/[email protected], [email protected], [email protected]

Download Full-text

Capturing variation in metagenomic assembly graphs with MetaCortex

10.1101/2021.07.23.453484 ◽

2021 ◽

Author(s):

Samuel Martin ◽

Martin Ayling ◽

Livia Patrono ◽

Mario Caccamo ◽

Pablo Murcia ◽

...

Keyword(s):

Species Diversity ◽

De Novo ◽

Local Variation ◽

Strain Level ◽

Assembly Algorithm ◽

Sequence Graph ◽

Viral Communities ◽

Standard Linear ◽

Multiple Species ◽

Metagenomic Assembly

The assembly of contiguous sequence from metagenomic samples presents a particular challenge, due to the presence of multiple species, often closely related, at varying levels of abundance. Capturing diversity within species, for example viral haplotypes, or bacterial strain-level diversity, is even more challenging. We present MetaCortex, a metagenome assembler based on data structures from the Cortex de novo assembler. MetaCortex captures intra-species diversity by searching for signatures of local variation along assembled sequences in the underlying assembly graph and outputting these sequences in sequence graph format. MetaCortex also implements a novel assembly algorithm for representing intra-species diversity in standard linear format. We show that MetaCortex produces accurate assemblies with higher genome coverage and contiguity than other popular metagenomic assemblers on mock viral communities with high levels of strain level diversity, and on simulated communities containing simulated strains. We also show that accuracy can be increased further by using the sequence graph produced by MetaCortex to create highly accurate single contig sequences.

Download Full-text

Molecular identification and genetic variation studies in economically important cephalopods at Beypore Fishing Harbour (Kozhikode), South West coast of India

Notulae Scientia Biologicae ◽

10.15835/nsb13110862 ◽

2021 ◽

Vol 13 (1) ◽

pp. 10862

Author(s):

Alex LINCY ◽

Madhavankonath K. ANIL ◽

Muthusamy THANGARAJ ◽

Jean J. JOSE

Keyword(s):

Genetic Variation ◽

Minimum Distance ◽

Universal Primers ◽

Sustainable Utilization ◽

South West ◽

Boot Strap ◽

South West Coast ◽

Mean Size ◽

Sepioteuthis Lessoniana ◽

Biological Entities

Cephalopods are ecologically and economically important marine groups in the world. Biodiversity description is essential for sustainable utilization of natural resources and to characterize biological entities for conservation. DNA barcoding is an effective tool used for identification of organisms at species level and is been widely used for delineate several ambiguity species. In this study, partial sequence of mitochondrial cytochrome c oxidase 1 (CO1) gene with a mean size of 680 bp was amplified by universal primers. Totally 13 individuals of Cephalopods comprising of three species, were barcoded and genetic variation was analysed. The maximum A+T content (67.60%) was recorded in Cistopus indicus and minimum (63.70%) in Sepioteuthis lessoniana. The maximum K2P distance (0.268) was found between the genus Cistopus and Sepioteuthis whereas the minimum distance (0.188) was observed between Uroteuthis and Sepia. The neighbour joining tree revealed three distinct clades represents Loligonidae, Sepiidae and Octopodidae with high boot strap values. However, Sepioteuthis lessoniana is showing a bifurcated branch and it may due to the co-occurring of cryptic species and till date this species is treated as Sepioteuthis lessoniana complex.

Download Full-text