Gene Sequence Assembly Algorithm Model Based on the DBG Strategy and Its Application

With the continuous development of sequencing technology, the amount of bioinformatics data has increased geometrically, and the massive amount of bioinformatics data puts forward more stringent requirements for sequence assembly problems. The sequence assembly algorithm based on DBG (De Bruijn graph) strategy is a key algorithm in bioinformatics, which is widely used in the domain of gene sequence assembly. Current research on the domain of sequence assembly always focuses on optimization of specific steps to a specific algorithm and lack of research on domain-level high-abstract algorithm frameworks. To some extent, it leads to the redundancy of the sequence assembly algorithm, and some problems may be caused by the artificial selection algorithm. This paper analyzes the domain of DBGSA and establishes a feature model of this domain. Based on the production programming method, the DBGSA algorithm component is interactively designed. With the support of the PAR platform, the DBGSA algorithm component library is formally implemented, and furthermore, the DBGSA component library is used to assemble the specific algorithm. This research adds domain-level research to the domain of sequence assembly and implements the DBGSA component library, which can assemble specific sequence assembly algorithms, ensuring the efficiency of algorithm development and the reliability of assembly generation algorithms. At the same time, it also provides a valuable reference for solving problems in the domain of sequence assembly.

Download Full-text

De Bruijn Graph-Based Whole-Genomic Sequence Assembly Algorithms and Applications

2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing ◽

10.1109/greencom-ithings-cpscom.2013.393 ◽

2013 ◽

Author(s):

Xiaojun Kang ◽

Shanyu Tang ◽

Yongge Ma ◽

Ruixiang Liu ◽

Yaping Wang

Keyword(s):

Genomic Sequence ◽

Sequence Assembly ◽

De Bruijn Graph ◽

De Bruijn ◽

Assembly Algorithms ◽

Genomic Sequence Assembly

Download Full-text

Scalable Genome Assembly through Parallel de Bruijn Graph Construction for Multiple k-mers

Scientific Reports ◽

10.1038/s41598-019-51284-9 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Kanak Mahadik ◽

Christopher Wright ◽

Milind Kulkarni ◽

Saurabh Bagchi ◽

Somali Chaterji

Keyword(s):

De Novo ◽

De Bruijn Graph ◽

High Quality ◽

De Bruijn Graphs ◽

Sequencing Technologies ◽

De Bruijn ◽

Similar Accuracy ◽

Valued Graph ◽

Assembly Algorithms ◽

Level Parallelism

Abstract Remarkable advancements in high-throughput gene sequencing technologies have led to an exponential growth in the number of sequenced genomes. However, unavailability of highly parallel and scalable de novo assembly algorithms have hindered biologists attempting to swiftly assemble high-quality complex genomes. Popular de Bruijn graph assemblers, such as IDBA-UD, generate high-quality assemblies by iterating over a set of k-values used in the construction of de Bruijn graphs (DBG). However, this process of sequentially iterating from small to large k-values slows down the process of assembly. In this paper, we propose ScalaDBG, which metamorphoses this sequential process, building DBGs for each distinct k-value in parallel. We develop an innovative mechanism to “patch” a higher k-valued graph with contigs generated from a lower k-valued graph. Moreover, ScalaDBG leverages multi-level parallelism, by both scaling up on all cores of a node, and scaling out to multiple nodes simultaneously. We demonstrate that ScalaDBG completes assembling the genome faster than IDBA-UD, but with similar accuracy on a variety of datasets (6.8X faster for one of the most complex genome in our dataset).

Download Full-text

Genome sequence assembly: algorithms and issues

Computer ◽

10.1109/mc.2002.1016901 ◽

2002 ◽

Vol 35 (7) ◽

pp. 47-54 ◽

Cited By ~ 41

Author(s):

M. Pop ◽

S.L. Salzberg ◽

M. Shumway

Keyword(s):

Genome Sequence ◽

Sequence Assembly ◽

Genome Sequence Assembly ◽

Assembly Algorithms

Download Full-text

An accurate DNA sequence assembly algorithm based on MapReduce

Journal of Computational Methods in Sciences and Engineering ◽

10.3233/jcm-160635 ◽

2016 ◽

Vol 16 (3) ◽

pp. 519-526 ◽

Cited By ~ 1

Author(s):

Gaifang Dong ◽

Xueliang Fu ◽

Honghui Li

Keyword(s):

Dna Sequence ◽

Sequence Assembly ◽

Assembly Algorithm ◽

Dna Sequence Assembly

Download Full-text

Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph

Briefings in Functional Genomics ◽

10.1093/bfgp/elr035 ◽

2011 ◽

Vol 11 (1) ◽

pp. 25-37 ◽

Cited By ~ 111

Author(s):

Z. Li ◽

Y. Chen ◽

D. Mu ◽

J. Yuan ◽

Y. Shi ◽

...

Keyword(s):

De Bruijn Graph ◽

De Bruijn ◽

Assembly Algorithms

Download Full-text

An Accurate Sequence Assembly Algorithm for Livestock, Plants and Microorganism Based on Spark

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001417500240 ◽

2017 ◽

Vol 31 (08) ◽

pp. 1750024

Author(s):

Gaifang Dong ◽

Xueliang Fu ◽

Honghui Li ◽

Xu Pan

Keyword(s):

Hepatitis C Virus ◽

Hepatitis C ◽

High Efficiency ◽

Sequence Assembly ◽

Computational Results ◽

Assembly Algorithm ◽

Computing Platform ◽

Iterative Calculation ◽

Computational Speed ◽

Low Efficiency

Sequence Assembly is one of the important topics in bioinformatics research. Sequence assembly algorithm has always met the problems of poor assembling precision and low efficiency. In view of these two problems, this paper designs and implements a precise assembling algorithm under the strategy of finding the source of reads based on the MapReduce (SA-BR-MR) and Eulerian path algorithm. Computational results show that SA-BR-MR is more accurate than other algorithms. At the same time, SA-BR-MR calculates 54 sequences which are randomly selected from animals, plants and microorganisms with base lengths from hundreds to tens of thousands from NCBI. All matching rates of the 54 sequences are 100%. For each species, the algorithm summarizes the range of [Formula: see text] which makes the matching rates to be 100%. In order to verify the range of [Formula: see text] value of hepatitis C virus (HCV) and related variants, the randomly selected eight HCV variants are calculated. The results verify the correctness of [Formula: see text] range of hepatitis C and related variants from NCBI. The experiment results provide the basis for sequencing of other variants of the HCV. In addition, Spark platform is a new computing platform based on memory computation, which is featured by high efficiency and suitable for iterative calculation. Therefore, this paper designs and implements sequence assembling algorithm based on the Spark platform under the strategy of finding the source of reads (SA-BR-Spark). In comparison with SA-BR-MR, SA-BR-Spark shows a superior computational speed.

Download Full-text

Artificially Generated Data Sets for Testing DNA Sequence Assembly Algorithms

Genomics ◽

10.1006/geno.1993.1180 ◽

1993 ◽

Vol 16 (1) ◽

pp. 286-288 ◽

Cited By ~ 35

Author(s):

Michael L. Engle ◽

Christian Burks

Keyword(s):

Dna Sequence ◽

Sequence Assembly ◽

Data Sets ◽

Dna Sequence Assembly ◽

Assembly Algorithms

Download Full-text

An Experimentally Derived Data Set Constructed for Testing Large-Scale DNA Sequence Assembly Algorithms

Genomics ◽

10.1006/geno.1993.1123 ◽

1993 ◽

Vol 15 (3) ◽

pp. 673-676 ◽

Cited By ~ 17

Author(s):

Donald Seto ◽

Ben F. Koop ◽

Leroy Hood

Keyword(s):

Dna Sequence ◽

Large Scale ◽

Sequence Assembly ◽

Data Set ◽

Dna Sequence Assembly ◽

Assembly Algorithms ◽

Derived Data

Download Full-text

AN EFFICIENT ALGORITHM FOR CHINESE POSTMAN WALK ON BI-DIRECTED DE BRUIJN GRAPHS

Discrete Mathematics Algorithms and Applications ◽

10.1142/s179383091250019x ◽

2012 ◽

Vol 04 (02) ◽

pp. 1250019 ◽

Cited By ~ 1

Author(s):

VAMSI KUNDETI ◽

SANGUTHEVAR RAJASEKARAN ◽

HEIU DINH

Keyword(s):

Time Algorithm ◽

Sequence Assembly ◽

Plant Genome ◽

De Bruijn Graph ◽

Directed Flow ◽

De Bruijn Graphs ◽

Short Reads ◽

String Graph ◽

Chinese Postman ◽

De Bruijn

Sequence assembly from short reads is an important problem in biology. It is known that solving the sequence assembly problem exactly on a bi-directed de Bruijn graph or a string graph is intractable. However, finding a shortest double stranded DNA string (SDDNA) containing all the k-long words in the reads seems to be a good heuristic to get close to the original genome. This problem is equivalent to finding a cyclic Chinese Postman (CP) walk on the underlying unweighted bi-directed de Bruijn graph built from the reads. The Chinese Postman walk Problem (CPP) is solved by reducing it to a general bi-directed flow on this graph which runs in O(|E|2 log 2(|V|)) time. In this paper we show that the cyclic CPP on bi-directed graphs can be solved without reducing it to bi-directed flow. We present a Θ(p(|V| + |E|) log (|V|) + (d max p)3) time algorithm to solve the cyclic CPP on a weighted bi-directed de Bruijn graph, where p = max {|{v|d in (v) - d out (v) > 0}|, |{v|d in (v) - d out (v) < 0}|} and d max = max {|d in (v) - d out (v)}. Our algorithm performs asymptotically better than the bi-directed flow algorithm when the number of imbalanced nodes p is much less than the nodes in the bi-directed graph. From our experimental results on various datasets, we have noticed that the value of p/|V| lies between 0.08% and 0.13% with 95% probability. Many practical bi-directed de Bruijn graphs do not have cyclic CP walks. In such cases it is not clear how the bi-directed flow can be useful in identifying contigs. Our algorithm can handle such situations and identify maximal bi-directed sub-graphs that have CP walks. A Θ(p(|V| + |E|)) time heuristic algorithm based on these ideas has been implemented for the SDDNA problem. This algorithm was tested on short reads from a plant genome and achieves an approximation ratio of at most 1.0134. We also present a Θ((|V| + |E|) log (V)) time algorithm for the single source shortest path problem on bi-directed de Bruijn graphs, which may be of independent interest.

Download Full-text

A New Implementation of Genome Rearrangement Problem

Journal of Healthcare Engineering ◽

10.1155/2021/6692775 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Xiaoqian Jing ◽

Haihe Shi

Keyword(s):

Simulated Annealing ◽

Branch And Bound ◽

Genome Rearrangement ◽

Greedy Algorithms ◽

Practical Application ◽

Component Library ◽

Greedy Strategy ◽

Assembly Algorithm ◽

Rearrangement Algorithm ◽

Biological Similarity

Unsigned reverse genome rearrangement is an important part of bioinformatics research, which is widely used in biological similarity and homology analysis, revealing biological inheritance, variation, and evolution. Branch and bound, simulated annealing, and other algorithms in unsigned reverse genome rearrangement algorithm are rare in practical application because of their huge time and space consumption, and greedy algorithms are mostly used at present. By deeply analyzing the domain of unsigned reverse genome rearrangement algorithm based on greedy strategy (unsigned reverse genome rearrangement algorithm (URGRA) based on greedy strategy), the domain features are modeled, and the URGRA algorithm components are interactively designed according to the production programming method. With the support of the PAR platform, the algorithm component library of the URGRA is formally realized, and the concrete algorithm is generated by assembly, which improves the reliability of the assembly algorithm.

Download Full-text