string graph Latest Research Papers

Technologies for next-generation sequencing (NGS) have stimulated an exponential rise in high-throughput sequencing projects and resulted in the development of new read-assembly algorithms. A drastic reduction in the costs of generating short reads on the genomes of new organisms is attributable to recent advances in NGS technologies such as Ion Torrent, Illumina, and PacBio. Genome research has led to the creation of high-quality reference genomes for several organisms, and de novo assembly is a key initiative that has facilitated gene discovery and other studies. More powerful analytical algorithms are needed to work on the increasing amount of sequence data. We make a thorough comparison of the de novo assembly algorithms to allow new users to clearly understand the assembly algorithms: overlap-layout-consensus and de-Bruijn-graph, string-graph based assembly, and hybrid approach. We also address the computational efficacy of each algorithm’s performance, challenges faced by the assem- bly tools used, and the impact of repeats. Our results compare the relative performance of the different assemblers and other related assembly differences with and without the reference genome. We hope that this analysis will contribute to further the application of de novo sequences and help the future growth of assembly algorithms.

Download Full-text

Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) ◽

10.1109/ipdps49936.2021.00060 ◽

2021 ◽

Author(s):

Giulia Guidi ◽

Oguz Selvitopi ◽

Marquita Ellis ◽

Leonid Oliker ◽

Katherine Yelick ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

De Novo Genome Assembly ◽

Transitive Reduction ◽

String Graph

Download Full-text

A Sharp Threshold Phenomenon in String Graphs

Discrete & Computational Geometry ◽

10.1007/s00454-021-00279-3 ◽

2021 ◽

Author(s):

István Tomon

Keyword(s):

Assembling Long Accurate Reads Using de Bruijn Graphs

10.1101/2020.12.10.420448 ◽

2020 ◽

Author(s):

Anton Bankevich ◽

Andrey Bzikadze ◽

Mikhail Kolmogorov ◽

Pavel A. Pevzner

Keyword(s):

Human Genome ◽

De Bruijn Graph ◽

High Fidelity ◽

High Quality ◽

De Bruijn Graphs ◽

String Graph ◽

De Bruijn ◽

Genome Assembler ◽

Large Genomes

AbstractAlthough the de Bruijn graphs represent the basis of many genome assemblers, it remains unclear how to construct these graphs for large genomes and large k-mer sizes. This algorithmic challenge has become particularly important with the emergence of long and accurate high-fidelity (HiFi) reads that were recently utilized to generate a semi-manual telomere-to-telomere assembly of the human genome using the alternative string graph assembly approach. To enable fully automated high-quality HiFi assemblies of various genomes, we developed an efficient jumboDB algorithm for constructing the de Bruijn graph for large genomes and large k-mer sizes and the LJA genome assembler that error-corrects HiFi reads and uses jumboDB to construct the de Bruijn graph on the error-corrected reads. Since the de Bruijn graph constructed for a fixed k-mer size is typically either too tangled or too fragmented, LJA uses a new concept of a multiplex de Bruijn graph with varying k-mer sizes. We demonstrate that LJA produces contiguous assemblies of complex repetitive regions in genomes including automated assemblies of various highly-repetitive human centromeres.

Download Full-text

Clover: a clustering-oriented de novo assembler for Illumina sequences

BMC Bioinformatics ◽

10.1186/s12859-020-03788-9 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Ming-Feng Hsieh ◽

Chin Lung Lu ◽

Chuan Yi Tang

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Low Cost ◽

De Bruijn Graph ◽

Illumina Platform ◽

Sequencing Errors ◽

Sequencing Technologies ◽

String Graph ◽

Clustering Approach ◽

De Bruijn

Abstract Background Next-generation sequencing technologies revolutionized genomics by producing high-throughput reads at low cost, and this progress has prompted the recent development of de novo assemblers. Multiple assembly methods based on de Bruijn graph have been shown to be efficient for Illumina reads. However, the sequencing errors generated by the sequencer complicate analysis of de novo assembly and influence the quality of downstream genomic researches. Results In this paper, we develop a de Bruijn assembler, called Clover (clustering-oriented de novo assembler), that utilizes a novel k-mer clustering approach from the overlap-layout-consensus concept to deal with the sequencing errors generated by the Illumina platform. We further evaluate Clover’s performance against several de Bruijn graph assemblers (ABySS, SOAPdenovo, SPAdes and Velvet), overlap-layout-consensus assemblers (Bambus2, CABOG and MSR-CA) and string graph assembler (SGA) on three datasets (Staphylococcus aureus, Rhodobacter sphaeroides and human chromosome 14). The results show that Clover achieves a superior assembly quality in terms of corrected N50 and E-size while remaining a significantly competitive in run time except SOAPdenovo. In addition, Clover was involved in the sequencing projects of bacterial genomes Acinetobacter baumannii TYTH-1 and Morganella morganii KT. Conclusions The marvel clustering-based approach of Clover that integrates the flexibility of the overlap-layout-consensus approach and the efficiency of the de Bruijn graph method has high potential on de novo assembly. Now, Clover is freely available as open source software from https://oz.nthu.edu.tw/~d9562563/src.html.

Download Full-text

SOF: An Efficient String Graph Construction Algorithm

2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm47256.2019.8983393 ◽

2019 ◽

Author(s):

S. M. Iqbal Morshed ◽

Shibu Yooseph

Keyword(s):

Construction Algorithm ◽

String Graph

Download Full-text

Using Apache Spark on genome assembly for scalable overlap-graph reduction

Human Genomics ◽

10.1186/s40246-019-0227-1 ◽

2019 ◽

Vol 13 (S1) ◽

Cited By ~ 1

Author(s):

Alexander J. Paul ◽

Dylan Lawrence ◽

Myoungkyu Song ◽

Seung-Hwan Lim ◽

Chongle Pan ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Time Frame ◽

Apache Spark ◽

Reference Sequence ◽

Graph Reduction ◽

De Novo Genome Assembly ◽

String Graph ◽

Edge Graph ◽

Generation Sequencing

Abstract Background De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. Results To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA’s implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. Conclusions We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA. We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.

Download Full-text

Accelerating Sequence Alignment to Graphs

10.1101/651638 ◽

2019 ◽

Cited By ~ 6

Author(s):

Chirag Jain ◽

Alexander Dilthey ◽

Sanchit Misra ◽

Haowen Zhang ◽

Srinivas Aluru

Keyword(s):

Dna Sequences ◽

Query Sequence ◽

Dynamic Programming Algorithm ◽

Reference Sequence ◽

Programming Algorithm ◽

Peak Performance ◽

Task Parallelism ◽

Sequencing Data ◽

Link Type ◽

String Graph

AbstractAligning DNA sequences to an annotated reference is a key step for genotyping in biology. Recent scientific studies have demonstrated improved inference by aligning reads to a variation graph, i.e., a reference sequence augmented with known genetic variations. Given a variation graph in the form of a directed acyclic string graph, the sequence to graph alignment problem seeks to find the best matching path in the graph for an input query sequence. Solving this problem exactly using a sequential dynamic programming algorithm takes quadratic time in terms of the graph size and query length, making it difficult to scale to high throughput DNA sequencing data. In this work, we propose the first parallel algorithm for computing sequence to graph alignments that leverages multiple cores and single-instruction multiple-data (SIMD) operations. We take advantage of the available inter-task parallelism, and provide a novel blocked approach to compute the score matrix while ensuring high memory locality. Using a 48-core Intel Xeon Skylake processor, the proposed algorithm achieves peak performance of 317 billion cell updates per second (GCUPS), and demonstrates near linear weak and strong scaling on up to 48 cores. It delivers significant performance gains compared to existing algorithms, and results in run-time reduction from multiple days to three hours for the problem of optimally aligning high coverage long (PacBio/ONT) or short (Illumina) DNA reads to an MHC human variation graph containing 10 million vertices.AvailabilityThe implementation of our algorithm is available at https://github.com/ParBLiSS/PaSGAL. Data sets used for evaluation are accessible using https://alurulab.cc.gatech.edu/PaSGAL.

Download Full-text

GraphSeq: Accelerating String Graph Construction for De Novo Assembly on Spark

10.1101/321729 ◽

2018 ◽

Author(s):

Chung-Tsai Su ◽

Ming-Tai Chang ◽

Yun-Chian Cheng ◽

Yun-Lung Li ◽

Yao-Ting Wang

Keyword(s):

Genome Assembly ◽

De Novo Assembly ◽

De Novo ◽

Data Representation ◽

Important Application ◽

Supplementary Information ◽

De Novo Genome Assembly ◽

String Graph ◽

Computing Framework ◽

Variant Identification

AbstractSummary: De novo genome assembly is an important application on both uncharacterized genome assembly and variant identification in a reference-unbiased way. In comparison with de Brujin graph, string graph is a lossless data representation for de novo assembly. However, string graph construction is computational intensive. We propose GraphSeq to accelerate string graph construction by leveraging the distributed computing framework.Availability and Implementation: GraphSeq is implemented with Scala on Spark and freely available at https://www.atgenomix.com/blog/graphseq.Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text