scholarly journals Spaced Seed Data Structures forDe NovoAssembly

2015 ◽  
Vol 2015 ◽  
pp. 1-8 ◽  
Author(s):  
Inanç Birol ◽  
Justin Chu ◽  
Hamid Mohamadi ◽  
Shaun D. Jackman ◽  
Karthika Raghavan ◽  
...  

De novoassembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.

2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Ming-Feng Hsieh ◽  
Chin Lung Lu ◽  
Chuan Yi Tang

Abstract Background Next-generation sequencing technologies revolutionized genomics by producing high-throughput reads at low cost, and this progress has prompted the recent development of de novo assemblers. Multiple assembly methods based on de Bruijn graph have been shown to be efficient for Illumina reads. However, the sequencing errors generated by the sequencer complicate analysis of de novo assembly and influence the quality of downstream genomic researches. Results In this paper, we develop a de Bruijn assembler, called Clover (clustering-oriented de novo assembler), that utilizes a novel k-mer clustering approach from the overlap-layout-consensus concept to deal with the sequencing errors generated by the Illumina platform. We further evaluate Clover’s performance against several de Bruijn graph assemblers (ABySS, SOAPdenovo, SPAdes and Velvet), overlap-layout-consensus assemblers (Bambus2, CABOG and MSR-CA) and string graph assembler (SGA) on three datasets (Staphylococcus aureus, Rhodobacter sphaeroides and human chromosome 14). The results show that Clover achieves a superior assembly quality in terms of corrected N50 and E-size while remaining a significantly competitive in run time except SOAPdenovo. In addition, Clover was involved in the sequencing projects of bacterial genomes Acinetobacter baumannii TYTH-1 and Morganella morganii KT. Conclusions The marvel clustering-based approach of Clover that integrates the flexibility of the overlap-layout-consensus approach and the efficiency of the de Bruijn graph method has high potential on de novo assembly. Now, Clover is freely available as open source software from https://oz.nthu.edu.tw/~d9562563/src.html.


Author(s):  
Borja Freire ◽  
Susana Ladra ◽  
Jose R Paramá ◽  
Leena Salmela

Abstract Motivation RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. Results We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. Availability and implementation viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. Supplementary information Supplementary data are available at Bioinformatics online.


2015 ◽  
Author(s):  
Matthew D MacManes

Motivation: The correction of sequencing errors contained in Illumina reads derived from genomic DNA is a common pre-processing step in many de novo genome assembly pipelines, and has been shown to improved the quality of resultant assemblies. In contrast, the correction of errors in transcriptome sequence data is much less common, but can potentially yield similar improvements in mapping and assembly quality. This manuscript evaluates several popular read-correction tool's ability to correct sequence errors commonplace to transcriptome derived Illumina reads. Results: I evaluated the efficacy of correction of transcriptome derived sequencing reads using using several metrics across a variety of sequencing depths. This evaluation demonstrates a complex relationship between the quality of the correction, depth of sequencing, and hardware availability which results in variable recommendations depending on the goals of the experiment, tolerance for false positives, and depth of coverage. Overall, read error correction is an important step in read quality control, and should become a standard part of analytical pipelines. Availability: Results are non-deterministically repeatable using AMI:ami-3dae4956 (MacManes EC 2015) and the Makefile available here: https://goo.gl/oVIuE0


2020 ◽  
Vol 117 (29) ◽  
pp. 16961-16968
Author(s):  
Justin Chu ◽  
Hamid Mohamadi ◽  
Emre Erhan ◽  
Jeffery Tse ◽  
Readman Chiu ◽  
...  

Alignment-free classification tools have enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines primarily due to their computational efficiency. Originallyk-mer based, such tools often lack sensitivity when faced with sequencing errors and polymorphisms. In response, some tools have been augmented with spaced seeds, which are capable of tolerating mismatches. However, spaced seeds have seen little practical use in classification because they bring increased computational and memory costs compared to methods that usek-mers. These limitations have also caused the design and length of practical spaced seeds to be constrained, since storing spaced seeds can be costly. To address these challenges, we have designed a probabilistic data structure called a multiindex Bloom Filter (miBF), which can store multiple spaced seed sequences with a low memory cost that remains static regardless of seed length or seed design. We formalize how to minimize the false-positive rate of miBFs when classifying sequences from multiple targets or references. Available within BioBloom Tools, we illustrate the utility of miBF in two use cases: read-binning for targeted assembly, and taxonomic read assignment. In our benchmarks, an analysis pipeline based on miBF shows higher sensitivity and specificity for read-binning than sequence alignment-based methods, also executing in less time. Similarly, for taxonomic classification, miBF enables higher sensitivity than a conventional spaced seed-based approach, while using half the memory and an order of magnitude less computational time.


2017 ◽  
Author(s):  
Anthony Bolger ◽  
Alisandra Denton ◽  
Marie Bolger ◽  
Björn Usadel

AbstractRecent massive growth in the production of sequencing data necessitates matching improve-ments in bioinformatics tools to effectively utilize it. Existing tools suffer from limitations in both scalability and applicability which are inherent to their underlying algorithms and data structures. We identify the key requirements for the ideal data structure for sequence analy-ses: it should be informationally lossless, locally updatable, and memory efficient; requirements which are not met by data structures underlying the major assembly strategies Overlap Layout Consensus and De Bruijn Graphs. We therefore propose a new data structure, the LOGAN graph, which is based on a memory efficient Sparse De Bruijn Graph with routing information. Innovations in storing routing information and careful implementation allow sequence datasets for Escherichia coli (4.6Mbp, 117x coverage), Arabidopsis thaliana (135Mbp, 17.5x coverage) and Solanum pennellii (1.2Gbp, 47x coverage) to be loaded into memory on a desktop computer in seconds, minutes, and hours respectively. Memory consumption is competitive with state of the art alternatives, while losslessly representing the reads in an indexed and updatable form. Both Second and Third Generation Sequencing reads are supported. Thus, the LOGAN graph is positioned to be the backbone for major breakthroughs in sequence analysis such as integrated hybrid assembly, assembly of exceptionally large and repetitive genomes, as well as assembly and representation of pan-genomes.


Author(s):  
Felix Kallenborn ◽  
Andreas Hildebrandt ◽  
Bertil Schmidt

Abstract Motivation Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. Results We present CARE—an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration. Availabilityand implementation CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Justin Chu ◽  
Hamid Mohamadi ◽  
Emre Erhan ◽  
Jeffery Tse ◽  
Readman Chiu ◽  
...  

ABSTRACTAlignment-free classification of sequences against collections of sequences has enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines. Originally hash-table based, much work has been done to improve and reduce the memory requirement of indexing of k-mer sequences with probabilistic indexing strategies. These efforts have led to lower memory highly efficient indexes, but often lack sensitivity in the face of sequencing errors or polymorphism because they are k-mer based. To address this, we designed a new memory efficient data structure that can tolerate mismatches using multiple spaced seeds, called a multi-index Bloom Filter. Implemented as part of BioBloom Tools, we demonstrate our algorithm in two applications, read binning for targeted assembly and taxonomic read assignment. Our tool shows a higher sensitivity and specificity for read-binning than BWA MEM at an order of magnitude less time. For taxonomic classification, we show higher sensitivity than CLARK-S at an order of magnitude less time while using half the memory.


2021 ◽  
Vol 13 (4) ◽  
pp. 559
Author(s):  
Milto Miltiadou ◽  
Neill D. F. Campbell ◽  
Darren Cosker ◽  
Michael G. Grant

In this paper, we investigate the performance of six data structures for managing voxelised full-waveform airborne LiDAR data during 3D polygonal model creation. While full-waveform LiDAR data has been available for over a decade, extraction of peak points is the most widely used approach of interpreting them. The increased information stored within the waveform data makes interpretation and handling difficult. It is, therefore, important to research which data structures are more appropriate for storing and interpreting the data. In this paper, we investigate the performance of six data structures while voxelising and interpreting full-waveform LiDAR data for 3D polygonal model creation. The data structures are tested in terms of time efficiency and memory consumption during run-time and are the following: (1) 1D-Array that guarantees coherent memory allocation, (2) Voxel Hashing, which uses a hash table for storing the intensity values (3) Octree (4) Integral Volumes that allows finding the sum of any cuboid area in constant time, (5) Octree Max/Min, which is an upgraded octree and (6) Integral Octree, which is proposed here and it is an attempt to combine the benefits of octrees and Integral Volumes. In this paper, it is shown that Integral Volumes is the more time efficient data structure but it requires the most memory allocation. Furthermore, 1D-Array and Integral Volumes require the allocation of coherent space in memory including the empty voxels, while Voxel Hashing and the octree related data structures do not require to allocate memory for empty voxels. These data structures, therefore, and as shown in the test conducted, allocate less memory. To sum up, there is a need to investigate how the LiDAR data are stored in memory. Each tested data structure has different benefits and downsides; therefore, each application should be examined individually.


2018 ◽  
Vol 18 (3-4) ◽  
pp. 470-483 ◽  
Author(s):  
GREGORY J. DUCK ◽  
JOXAN JAFFAR ◽  
ROLAND H. C. YAP

AbstractMalformed data-structures can lead to runtime errors such as arbitrary memory access or corruption. Despite this, reasoning over data-structure properties for low-level heap manipulating programs remains challenging. In this paper we present a constraint-based program analysis that checks data-structure integrity, w.r.t. given target data-structure properties, as the heap is manipulated by the program. Our approach is to automatically generate a solver for properties using the type definitions from the target program. The generated solver is implemented using a Constraint Handling Rules (CHR) extension of built-in heap, integer and equality solvers. A key property of our program analysis is that the target data-structure properties are shape neutral, i.e., the analysis does not check for properties relating to a given data-structure graph shape, such as doubly-linked-lists versus trees. Nevertheless, the analysis can detect errors in a wide range of data-structure manipulating programs, including those that use lists, trees, DAGs, graphs, etc. We present an implementation that uses the Satisfiability Modulo Constraint Handling Rules (SMCHR) system. Experimental results show that our approach works well for real-world C programs.


Author(s):  
Sudeep Sarkar ◽  
Dmitry Goldgof

There is a growing need for expertise both in image analysis and in software engineering. To date, these two areas have been taught separately in an undergraduate computer and information science curriculum. However, we have found that introduction to image analysis can be easily integrated in data-structure courses without detracting from the original goal of teaching data structures. Some of the image processing tasks offer a natural way to introduce basic data structures such as arrays, queues, stacks, trees and hash tables. Not only does this integrated strategy expose the students to image related manipulations at an early stage of the curriculum but it also imparts cohesiveness to the data-structure assignments and brings them closer to real life. In this paper we present a set of programming assignments that integrates undergraduate data-structure education with image processing tasks. These assignments can be incorporated in existing data-structure courses with low time and software overheads. We have used these assignment sets thrice: once in a 10-week duration data-structure course at the University of California, Santa Barbara and the other two times in 15-week duration courses at the University of South Florida, Tampa.


Sign in / Sign up

Export Citation Format

Share Document