TaxaSE: Exploiting evolutionary conservation within 16S rDNA sequences for enhanced taxonomic annotation
Amplicon based taxonomic analysis, which determines the presence of microbial taxa in different environments on the basis of marker gene annotations, often uses percentage identity as the main metric to determine sequence similarity against databases. These data are then used to study the distribution of biodiversity as well as response of microbial communities to environmental conditions. However the 16S rRNA gene displays varying degrees of sequence conservation along its length and percentage identity does not fully utilize this information. Additionally, the prevalent usage of Operational Taxonomic Unit, or OTUs is not without its own issues and may lead to a reduction in annotation capability of the system. Hence a novel approach to taxonomic annotation is needed. Here we introduce a new taxonomic annotation pipeline, TaxaSE, which utilizes Shannon entropy to quantify evolutionary conservation within 16S rDNA sequences for enhanced taxonomic annotations. Furthermore, the system is capable of annotation of individual sequences in order to improve fine grain taxonomic annotations. We present both in-silico comparison of the new similarity metric with percentage identity, as well as comparison with the popular QIIME pipeline. The results demonstrate the new similarity metric achieves better performance especially at lower taxa levels. Furthermore, the pipeline is able to extract more fine grain taxonomic annotations compared to QIIME. These exhibit not only the effectiveness of the new pipeline but also highlight the need to shift away from both percentage identity and OTU based approaches for ecological projects.