spaced seeds
Recently Published Documents


TOTAL DOCUMENTS

66
(FIVE YEARS 15)

H-INDEX

15
(FIVE YEARS 1)

2021 ◽  
Author(s):  
Valeriy Titarenko ◽  
Sofya Titarenko

Abstract Background: Technical progress in computational hardware allows researchers to use new approaches for sequence alignment problems. A standard procedure is usually based on pre-aligning of short subsequences followed by proper comparison of neighbouring parts. For this purpose index files are created that store all subsequences (or numbers associated with them) and their positions within a reference sequence. Index files designed on subsequences of 32–64 symbols for a human reference genome can now be easily stored without any compression even on a budget computer. The main goal now is to choose a combination of symbols (a spaced seed) that will tolerate various mismatches between reference and given sequences. An ideal spaced seed should allow us to find all such positions (full sensitivity). By increasing the seed’s weight by one we usually reduce the number of candidate positions fourfold. At the same time longer seeds also reduce the number of signatures to be checked. Results: Several algorithms to assist seed generation are presented. The first one allows us to find all permitted spaced seeds iteratively. The results obtained with the algorithm show specific patterns of the seeds of the highest weight. Among the best seeds, there are periodic seeds with a simple relation between the period of a seed, its length and the length of a read. The second algorithm generates blocks for periodic seeds. A list of blocks is found for blocks of up to 50 symbols and up to 9 mismatches. The third algorithm uses those lists to find spaced seeds for reads of an arbitrary length. Conclusions: Lists of long high-weight spaced seeds are found and available in Supplementary Materials. The seeds are best in terms of weights compared to seeds from other papers and can usually be applied to shorter reads. Codes for all algorithms are available at https://github.com/vtman/PerFSeeB.


2021 ◽  
Author(s):  
Valeriy Titarenko ◽  
Sofya Titarenko

Motivation: Technical progress in computer hardware made it possible to access and process large amounts of data even on budget workstations. Therefore new or existing alignment algorithms may use large index files to increase performance. Spaced seeds with large weights reduce the number of possible locations of a read within a reference sequence. Optimal patterns for spaced seeds may guarantee to align reads even with several substitutions. Results: For reads of 64-200 bp periodic spaced seeds of 32, 40, 48, 56, 64 weights are found that guarantee to locate all positions within a reference sequence for a specified number of point mutations. SIMD instructions to convert masked reads into 64, 80, 96, 112, 128-bit numbers are provided. Availability: C codes to generate spaced seeds and find optimal SIMD instructions for them are freely available under MIT license at https://github.com/vtman/VSTseed.


2021 ◽  
Author(s):  
Lucian Ilie ◽  
Silvana Ilie ◽  
Shima Khoshraftar ◽  
Anahita Mansouri Bigvand

Background DNA oligonucleotides are a very useful tool in biology. The best algorithms for designing good DNA oligonucleotides are filtering out unsuitable regions using a seeding approach. Determining the quality of the seeds is crucial for the performance of these algorithms. Results We present a sound framework for evaluating the quality of seeds for oligonucleotide design. The F - score is used to measure the accuracy of each seed. A number of natural candidates are tested: contiguous (BLAST-like), spaced, transitions-constrained, and multiple spaced seeds. Multiple spaced seeds are the best, with more seeds providing better accuracy. Single spaced and transition seeds are very close whereas, as expected, contiguous seeds come last. Increased accuracy comes at the price of reduced efficiency. An exception is that single spaced and transitions-constrained seeds are both more accurate and more efficient than contiguous ones. Conclusions Our work confirms another application where multiple spaced seeds perform the best. It will be useful in improving the algorithms for oligonucleotide design.


2021 ◽  
Author(s):  
Lucian Ilie ◽  
Silvana Ilie ◽  
Shima Khoshraftar ◽  
Anahita Mansouri Bigvand

Background DNA oligonucleotides are a very useful tool in biology. The best algorithms for designing good DNA oligonucleotides are filtering out unsuitable regions using a seeding approach. Determining the quality of the seeds is crucial for the performance of these algorithms. Results We present a sound framework for evaluating the quality of seeds for oligonucleotide design. The F - score is used to measure the accuracy of each seed. A number of natural candidates are tested: contiguous (BLAST-like), spaced, transitions-constrained, and multiple spaced seeds. Multiple spaced seeds are the best, with more seeds providing better accuracy. Single spaced and transition seeds are very close whereas, as expected, contiguous seeds come last. Increased accuracy comes at the price of reduced efficiency. An exception is that single spaced and transitions-constrained seeds are both more accurate and more efficient than contiguous ones. Conclusions Our work confirms another application where multiple spaced seeds perform the best. It will be useful in improving the algorithms for oligonucleotide design.


2021 ◽  
Author(s):  
Silvana Ilie

Background The most frequently used tools in bioinformatics are those searching for similarities, or local alignments, between biological sequences. Since the exact dynamic programming algorithm is quadratic, linear-time heuristics such as BLAST are used. Spaced seeds are much more sensitive than the consecutive seed of BLAST and using several seeds represents the current state of the art in approximate search for biological sequences. The most important aspect is computing highly sensitive seeds. Since the problem seems hard, heuristic algorithms are used. The leading software in the common Bernoulli model is the SpEED program. Findings SpEED uses a hill climbing method based on the overlap complexity heuristic. We propose a new algorithm for this heuristic that improves its speed by over one order of magnitude. We use the new implementation to compute improved seeds for several software programs. We compute as well multiple seeds of the same weight as MegaBLAST, that greatly improve its sensitivity. Conclusion Multiple spaced seeds are being successfully used in bioinformatics software programs. Enabling researchers to compute very fast high quality seeds will help expanding the range of their applications.


2021 ◽  
Author(s):  
Silvana Ilie

Background The most frequently used tools in bioinformatics are those searching for similarities, or local alignments, between biological sequences. Since the exact dynamic programming algorithm is quadratic, linear-time heuristics such as BLAST are used. Spaced seeds are much more sensitive than the consecutive seed of BLAST and using several seeds represents the current state of the art in approximate search for biological sequences. The most important aspect is computing highly sensitive seeds. Since the problem seems hard, heuristic algorithms are used. The leading software in the common Bernoulli model is the SpEED program. Findings SpEED uses a hill climbing method based on the overlap complexity heuristic. We propose a new algorithm for this heuristic that improves its speed by over one order of magnitude. We use the new implementation to compute improved seeds for several software programs. We compute as well multiple seeds of the same weight as MegaBLAST, that greatly improve its sensitivity. Conclusion Multiple spaced seeds are being successfully used in bioinformatics software programs. Enabling researchers to compute very fast high quality seeds will help expanding the range of their applications.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Angana Chakraborty ◽  
Burkhard Morgenstern ◽  
Sanghamitra Bandyopadhyay

Abstract Background The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate. Results We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. Conclusions S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis.


Author(s):  
Arnab Mallik ◽  
Lucian Ilie

Abstract Motivation Sequence similarity is the most frequently used procedure in biological research, as proved by the widely used BLAST program. The consecutive seed used by BLAST can be dramatically improved by considering multiple spaced seeds. Finding the best seeds is a hard problem and much effort went into developing heuristic algorithms and software for designing highly sensitive spaced seeds. Results We introduce a new algorithm and software, ALeS, that produces more sensitive seeds than the current state-of-the-art programs, as shown by extensive testing. We also accurately estimate the sensitivity of a seed, enabling its computation for arbitrary seeds. Availability The source code is freely available at github.com/lucian-ilie/ALeS. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document