Phylogenetic measures of indel rate variation among the HIV-1 group M subtypes
The transmission and pathogenesis of human immunodeficiency virus type 1 (HIV-1) is disproportionately influenced by evolution in the five variable regions of the virus surface envelope glycoprotein (gp120). Insertions and deletions (indels) are a significant source of evolutionary change in these regions. However, the influx of indels relative to nucleotide substitutions has not yet been quantified through a comparative analysis of HIV-1 sequence data. Here we develop and report results from a phylogenetic method to estimate indel rates for the gp120 variable regions across five major subtypes and two circulating recombinant forms (CRFs) of HIV-1 group M. We processed over 26,000 published HIV-1 gp120 sequences, from which we extracted 6,605 sequences for phylogenetic analysis. In brief, our method employs maximum likelihood to reconstruct phylogenies scaled in time and fits a Poisson model to the observed distribution of indels between closely related pairs of sequences in the tree (cherries). The rate estimates ranged from 3.0e-5 to 1.5e-3 indels/nt/year and varied significantly among variable regions and subtypes. Indel rates were significantly lower in the region encoding variable loop V3, and also lower for HIV-1 subtype B relative to other subtypes. We also found that variable loops V1, V2 and V4 tended to accumulate significantly longer indels. Further, we observed that the nucleotide composition of indel sequences was significantly distinct from that of the flanking sequence in HIV-1 gp120. Indels affected potential N-linked glycosylation sites substantially more often in V1 and V2 than expected by chance, which is consistent with positive selection on glycosylation patterns within these regions of gp120. These results represent the first comprehensive measures of indel rates in HIV-1 gp120 across multiple subtypes and CRFs, and identifies novel and unexpected patterns for further research in the molecular evolution of HIV-1.