An Approximate de Bruijn Graph Approach to Multiple Local Alignment and Motif Discovery in Protein Sequences

AbstractNext generation sequencing produces large volumes of short sequences with broad applications. The noise due to sequencing errors led to the development of several correction methods. The main correction paradigm expects a high (from 30-40X) uniform coverage to correctly infer a reference set of subsequences from the reads, that are used for correction. In practice, most accurate methods use k-mer spectrum techniques to obtain a set of reference k-mers. However, when correcting NGS datasets that present an uneven coverage, such as RNA-seq data, this paradigm tends to mistake rare variants for errors. It may therefore discard or alter them using highly covered sequences, which leads to an information loss and may introduce bias. In this paper we present two new contributions in order to cope with this situation.First, we show that starting from non-uniform sequencing coverages, a De Bruijn graph can be cleaned from most errors while preserving biological variability. Second, we demonstrate that reads can be efficiently corrected via local alignment on the cleaned De Bruijn graph paths. We implemented the described method in a tool dubbed BCT and evaluated its results on RNA-seq and metagenomic data. We show that the graph cleaning strategy combined with the mapping strategy leads to save more rare k-mers, resulting in a more conservative correction than previous methods. BCT is also capable to better take advantage of the signal of high depth datasets. We suggest that BCT, being scalable to large metagenomic datasets as well as correcting shallow single cell RNA-seq data, can be a general corrector for non-uniform data. Availability: BCT is open source and available at github.com/Malfoy/BCT under the Affero GPL License.

Download Full-text

Detecting High Scoring Local Alignments in Pangenome Graphs

Bioinformatics ◽

10.1093/bioinformatics/btab077 ◽

2021 ◽

Author(s):

Tizian Schulz ◽

Roland Wittler ◽

Sven Rahmann ◽

Faraz Hach ◽

Jens Stoye

Keyword(s):

Sequence Similarity ◽

Query Sequence ◽

Heuristic Method ◽

Supplementary Information ◽

De Bruijn Graph ◽

Local Alignment ◽

Memory Usage ◽

Sequence Comparisons ◽

De Bruijn Graphs ◽

De Bruijn

Abstract Motivation Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet. Results We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome. Availability Source code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MEME, MAST, and Meta-MEME: New Tools for Motif Discovery in Protein Sequences

Pattern Discovery in Biomolecular Data ◽

10.1093/oso/9780195119404.003.0008 ◽

1999 ◽

Author(s):

Timothy L. Bailey

Keyword(s):

Motif Discovery ◽

Sequence Data ◽

Protein Sequences ◽

Software Tool ◽

Biological Information ◽

Local Alignment ◽

Learning Tools ◽

Protein Sequence Data ◽

Search Tool ◽

And Function

We are in the midst of an explosive increase in the number of DNA and protein sequences available for study, as various genome projects come on line. This wealth of information offers important opportunities for understanding many biological processes and developing new plant and animal models, and ultimately drugs, for human diseases, in addition to other applications of modern biotechnology. Unfortunately, sequences are accumulating at a pace that strains present methods for extracting significant biological information from them. A consequence of this explosion in the sequence databases is that there is much interest and effort in developing tools that can efficiently and automatically extract the relevant biological information in sequence data and make it available for use in biology and medicine. In this chapter, we describe one such method that we have developed based on algorithms from artificial intelligence research. We call this software tool MEME (Multiple Expectation-maximization for Motif Elicitation). It has the attractive property that it is an “unsupervised” discovery tool: it can identify motifs, such as regulatory sites in DNA and functional domains in proteins, from large or small groups of unaligned sequences. As we show below, motifs are a rich source of information about a dataset; they can be used to discover other homologs in a database, to identify protein subsets that contain one or more motifs, and to provide information for mutagenesis studies to elucidate structure and function in the protein family as well as its evolution. Learning tools are used to extract higher level biological patterns from lower level DNA and protein sequence data. In contrast, search tools such as BLAST (Basic Local Alignment Search Tool) take a given higher level pattern and find all items in a database that possess the pattern. Searching for items that have a certain pattern is a problem intrinsically easier than discovering what the pattern is from items that possess it. The patterns considered here are motifs, which for DNA data can be subsequences that interact with transcription factors, polymerases, and other proteins.

Download Full-text

Detecting High Scoring Local Alignments in Pangenome Graphs

10.1101/2020.09.03.280958 ◽

2020 ◽

Author(s):

Tizian Schulz ◽

Roland Wittler ◽

Sven Rahmann ◽

Faraz Hach ◽

Jens Stoye

Keyword(s):

Sequence Similarity ◽

Query Sequence ◽

Heuristic Method ◽

De Bruijn Graph ◽

Local Alignment ◽

Memory Usage ◽

Sequence Comparisons ◽

De Bruijn Graphs ◽

De Bruijn ◽

Colored De Bruijn Graph

AbstractMotivationIncreasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet.ResultsWe present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome.

Download Full-text

Improving the efficiency of de Bruijn graph construction using compact universal hitting sets

Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics ◽

10.1145/3459930.3469520 ◽

2021 ◽

Author(s):

Yael Ben-Ari ◽

Dan Flomin ◽

Lianrong Pu ◽

Yaron Orenstein ◽

Ron Shamir

Keyword(s):

De Bruijn Graph ◽

Hitting Sets ◽

De Bruijn

Download Full-text

Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00182-9 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Kingshuk Mukherjee ◽

Massimiliano Rossi ◽

Leena Salmela ◽

Christina Boucher

Keyword(s):

Single Molecule ◽

De Bruijn Graph ◽

Anabas Testudineus ◽

E Coli ◽

Genome Wide ◽

A Genome ◽

De Bruijn ◽

Optical Maps ◽

Definition Of ◽

Numeric Representation

AbstractGenome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics’ Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as rmapper, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) only successfully ran on E. coli. Moreover, on the human genome rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, rmapper is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper.

Download Full-text