Fast and sensitive protein sequence homology searches using hierarchical cluster BLAST
AbstractThe throughput of DNA sequencing continues to increase, allowing researchers to analyze genomes of interest at greater depths. An unintended consequence of this data deluge is the increased cost of analyzing these datasets. As a result, genome and metagenome annotation pipelines are left with a few options: (i) search against smaller reference databases, (ii) use faster, but less sensitive, algorithms to assess sequence similarities, or (iii) invest in computing hardware specifically designed to improve BLAST searches such as GPGPU systems and/or large CPU-rich clusters.We present a pipeline that improves the speed of amino acid sequence homology searches with a minimal decrease in sensitivity and specificity by searching against hierarchical clusters. Briefly, the pipeline requires two homology searches: the first search is against a clustered version of the database and the second is against sequences belonging to clusters with a hit from the first search. We tested this method using two assembled viral metagenomes and three databases (Swiss-Prot, Metagenomes Online, and UniRef100). Hierarchical cluster homology searching proved to be 12-times faster than BLASTp and produced alignments that were nearly identical to BLASTp (precision=0.99; recall=0.97). This approach is ideal when searching large collections of sequences against large databases.