scholarly journals Gene Sequence Assembly Algorithm Model Based on the DBG Strategy and Its Application

2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Haihe Shi ◽  
Gang Wu

With the continuous development of sequencing technology, the amount of bioinformatics data has increased geometrically, and the massive amount of bioinformatics data puts forward more stringent requirements for sequence assembly problems. The sequence assembly algorithm based on DBG (De Bruijn graph) strategy is a key algorithm in bioinformatics, which is widely used in the domain of gene sequence assembly. Current research on the domain of sequence assembly always focuses on optimization of specific steps to a specific algorithm and lack of research on domain-level high-abstract algorithm frameworks. To some extent, it leads to the redundancy of the sequence assembly algorithm, and some problems may be caused by the artificial selection algorithm. This paper analyzes the domain of DBGSA and establishes a feature model of this domain. Based on the production programming method, the DBGSA algorithm component is interactively designed. With the support of the PAR platform, the DBGSA algorithm component library is formally implemented, and furthermore, the DBGSA component library is used to assemble the specific algorithm. This research adds domain-level research to the domain of sequence assembly and implements the DBGSA component library, which can assemble specific sequence assembly algorithms, ensuring the efficiency of algorithm development and the reliability of assembly generation algorithms. At the same time, it also provides a valuable reference for solving problems in the domain of sequence assembly.

2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Kanak Mahadik ◽  
Christopher Wright ◽  
Milind Kulkarni ◽  
Saurabh Bagchi ◽  
Somali Chaterji

Abstract Remarkable advancements in high-throughput gene sequencing technologies have led to an exponential growth in the number of sequenced genomes. However, unavailability of highly parallel and scalable de novo assembly algorithms have hindered biologists attempting to swiftly assemble high-quality complex genomes. Popular de Bruijn graph assemblers, such as IDBA-UD, generate high-quality assemblies by iterating over a set of k-values used in the construction of de Bruijn graphs (DBG). However, this process of sequentially iterating from small to large k-values slows down the process of assembly. In this paper, we propose ScalaDBG, which metamorphoses this sequential process, building DBGs for each distinct k-value in parallel. We develop an innovative mechanism to “patch” a higher k-valued graph with contigs generated from a lower k-valued graph. Moreover, ScalaDBG leverages multi-level parallelism, by both scaling up on all cores of a node, and scaling out to multiple nodes simultaneously. We demonstrate that ScalaDBG completes assembling the genome faster than IDBA-UD, but with similar accuracy on a variety of datasets (6.8X faster for one of the most complex genome in our dataset).


Computer ◽  
2002 ◽  
Vol 35 (7) ◽  
pp. 47-54 ◽  
Author(s):  
M. Pop ◽  
S.L. Salzberg ◽  
M. Shumway

2011 ◽  
Vol 11 (1) ◽  
pp. 25-37 ◽  
Author(s):  
Z. Li ◽  
Y. Chen ◽  
D. Mu ◽  
J. Yuan ◽  
Y. Shi ◽  
...  

Author(s):  
Gaifang Dong ◽  
Xueliang Fu ◽  
Honghui Li ◽  
Xu Pan

Sequence Assembly is one of the important topics in bioinformatics research. Sequence assembly algorithm has always met the problems of poor assembling precision and low efficiency. In view of these two problems, this paper designs and implements a precise assembling algorithm under the strategy of finding the source of reads based on the MapReduce (SA-BR-MR) and Eulerian path algorithm. Computational results show that SA-BR-MR is more accurate than other algorithms. At the same time, SA-BR-MR calculates 54 sequences which are randomly selected from animals, plants and microorganisms with base lengths from hundreds to tens of thousands from NCBI. All matching rates of the 54 sequences are 100%. For each species, the algorithm summarizes the range of [Formula: see text] which makes the matching rates to be 100%. In order to verify the range of [Formula: see text] value of hepatitis C virus (HCV) and related variants, the randomly selected eight HCV variants are calculated. The results verify the correctness of [Formula: see text] range of hepatitis C and related variants from NCBI. The experiment results provide the basis for sequencing of other variants of the HCV. In addition, Spark platform is a new computing platform based on memory computation, which is featured by high efficiency and suitable for iterative calculation. Therefore, this paper designs and implements sequence assembling algorithm based on the Spark platform under the strategy of finding the source of reads (SA-BR-Spark). In comparison with SA-BR-MR, SA-BR-Spark shows a superior computational speed.


Genomics ◽  
1993 ◽  
Vol 16 (1) ◽  
pp. 286-288 ◽  
Author(s):  
Michael L. Engle ◽  
Christian Burks

2012 ◽  
Vol 04 (02) ◽  
pp. 1250019 ◽  
Author(s):  
VAMSI KUNDETI ◽  
SANGUTHEVAR RAJASEKARAN ◽  
HEIU DINH

Sequence assembly from short reads is an important problem in biology. It is known that solving the sequence assembly problem exactly on a bi-directed de Bruijn graph or a string graph is intractable. However, finding a shortest double stranded DNA string (SDDNA) containing all the k-long words in the reads seems to be a good heuristic to get close to the original genome. This problem is equivalent to finding a cyclic Chinese Postman (CP) walk on the underlying unweighted bi-directed de Bruijn graph built from the reads. The Chinese Postman walk Problem (CPP) is solved by reducing it to a general bi-directed flow on this graph which runs in O(|E|2 log 2(|V|)) time. In this paper we show that the cyclic CPP on bi-directed graphs can be solved without reducing it to bi-directed flow. We present a Θ(p(|V| + |E|) log (|V|) + (d max p)3) time algorithm to solve the cyclic CPP on a weighted bi-directed de Bruijn graph, where p = max {|{v|d in (v) - d out (v) > 0}|, |{v|d in (v) - d out (v) < 0}|} and d max = max {|d in (v) - d out (v)}. Our algorithm performs asymptotically better than the bi-directed flow algorithm when the number of imbalanced nodes p is much less than the nodes in the bi-directed graph. From our experimental results on various datasets, we have noticed that the value of p/|V| lies between 0.08% and 0.13% with 95% probability. Many practical bi-directed de Bruijn graphs do not have cyclic CP walks. In such cases it is not clear how the bi-directed flow can be useful in identifying contigs. Our algorithm can handle such situations and identify maximal bi-directed sub-graphs that have CP walks. A Θ(p(|V| + |E|)) time heuristic algorithm based on these ideas has been implemented for the SDDNA problem. This algorithm was tested on short reads from a plant genome and achieves an approximation ratio of at most 1.0134. We also present a Θ((|V| + |E|) log (V)) time algorithm for the single source shortest path problem on bi-directed de Bruijn graphs, which may be of independent interest.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Xiaoqian Jing ◽  
Haihe Shi

Unsigned reverse genome rearrangement is an important part of bioinformatics research, which is widely used in biological similarity and homology analysis, revealing biological inheritance, variation, and evolution. Branch and bound, simulated annealing, and other algorithms in unsigned reverse genome rearrangement algorithm are rare in practical application because of their huge time and space consumption, and greedy algorithms are mostly used at present. By deeply analyzing the domain of unsigned reverse genome rearrangement algorithm based on greedy strategy (unsigned reverse genome rearrangement algorithm (URGRA) based on greedy strategy), the domain features are modeled, and the URGRA algorithm components are interactively designed according to the production programming method. With the support of the PAR platform, the algorithm component library of the URGRA is formally realized, and the concrete algorithm is generated by assembly, which improves the reliability of the assembly algorithm.


Sign in / Sign up

Export Citation Format

Share Document