Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads
Motivation: The third generation sequencing (3GS) technology generates long sequences of thousands of bases. However, its error rates are estimated in the range of 15-40%, much higher than the previous generation (approximately 1%). Fundamental tasks such as genome assembly and variant calling require us to obtain high quality sequences from these long erroneous sequences. Results: In this paper we describe a versatile and efficient linear complexity consensus algorithm Sparc that builds a sparse k-mer graph using a collection of sequences from the same genomic region. The heaviest path approximates the most likely genome sequence (consensus) and is sought through a sparsity-induced reweighted graph. Experiments show that our algorithm can efficiently provide high-quality consensus sequences with error rate <0.5% using both PacBio and Oxford Nanopore sequencing technologies. Compared with the existing approaches, Sparc calculates the consensus with higher accuracy, uses 80% less memory, and is 5x faster, approximately. Availability: The source code is available for download at http://sourceforge.net/p/sparc-consensus/code/ and a testing dataset is available: https://www.dropbox.com/sh/trng8vdaeqywx1e/AAASJesLVAJZcbORkU9f4LuBa?dl=0 (Please copy the link to a browser to access if directly clicking the link fails)