scholarly journals Using Apache Spark on genome assembly for scalable overlap-graph reduction

2019 ◽  
Vol 13 (S1) ◽  
Author(s):  
Alexander J. Paul ◽  
Dylan Lawrence ◽  
Myoungkyu Song ◽  
Seung-Hwan Lim ◽  
Chongle Pan ◽  
...  

Abstract Background De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. Results To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA’s implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. Conclusions We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA. We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.

BMC Genomics ◽  
2015 ◽  
Vol 16 (Suppl 12) ◽  
pp. S9 ◽  
Author(s):  
Chih-Hao Fang ◽  
Yu-Jung Chang ◽  
Wei-Chun Chung ◽  
Ping-Heng Hsieh ◽  
Chung-Yen Lin ◽  
...  

PLoS ONE ◽  
2013 ◽  
Vol 8 (4) ◽  
pp. e62856 ◽  
Author(s):  
Yen-Chun Chen ◽  
Tsunglin Liu ◽  
Chun-Hui Yu ◽  
Tzen-Yuh Chiang ◽  
Chi-Chuan Hwang

2018 ◽  
Author(s):  
Chung-Tsai Su ◽  
Ming-Tai Chang ◽  
Yun-Chian Cheng ◽  
Yun-Lung Li ◽  
Yao-Ting Wang

AbstractSummary: De novo genome assembly is an important application on both uncharacterized genome assembly and variant identification in a reference-unbiased way. In comparison with de Brujin graph, string graph is a lossless data representation for de novo assembly. However, string graph construction is computational intensive. We propose GraphSeq to accelerate string graph construction by leveraging the distributed computing framework.Availability and Implementation: GraphSeq is implemented with Scala on Spark and freely available at https://www.atgenomix.com/blog/graphseq.Supplementary information: Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document