Uncovering non-random sequence patterns within intrinsically disordered proteins
AbstractIntrinsically disordered proteins / regions (IDPs / IDRs) pose unique challenges for deriving sequence-function relationships from multiple sequence alignments. These challenges arise from variations in sequence lengths, similarities, and identities across orthologs. Recent computational efforts have demonstrated the utility of comparing large numbers of distinct sequence features as a strategy to identify conserved sequence-function relationships in IDPs / IDRs. Inspired by these efforts, and by biophysical studies that have established the importance of binary patterning features in IDPs / IDRs, we present here a computational method, NARDINI (Non-random Arrangement of Residues in Disordered Regions Inferred using Numerical Intermixing), to uncover truly non-random binary patterns within disordered proteins / regions. Binary patterns refer to the linear clustering or dispersion of specific residues or residue types with respect to all other residues or specific types of residues. Our approach does not use, nor does it require sequence alignments. Instead for each IDR, we generate an ensemble of scrambled sequences and use this to set up expectations from a composition-specific null model for the patterning parameters of interest. We annotate each IDR in terms of pattern-specific z-score matrices by computing how specific patterns deviate from the null model. The z-scores help in identifying the non-random linear sequence patterns within an IDR. We tested the accuracy of NARDINI derived z-scores by assessing the ability to identify sequence patterns that have been identified as determinants of sequence-function relationships in specific IDPs / IDRs.