pKAI: a fast and interpretable deep learning approach for accurate electrostatics-driven pKa predictions
Abstract The pKa values of ionizable residues influence protein folding, stability and biological function. The pKa in bulk water is known for all residues, however, in a protein environment, it can significantly be affected by confinement and electrostatics. Existing computational methods to estimate pKa shifts rely on theoretical approximations and lengthy computations. Furthermore, the amount of experimentally determined pKa values is still very limited, hindering the development of faster machine learning-based methods. In this work, we use a data set of 6 million pKa shifts — determined by PypKa, a continuum electrostatics method — to train deep learning models that are shown to rival the physics-based predictor. On ~750 experimentally determined data points, our model displays the best accuracy and it is the only one that breaks the 1 pK unit RMSE barrier of this considerably difficult test set. Although trained using a very simplified view of the surroundings of the titratable group (namely, atom types and distances to other titratable groups within a given radius), the models are shown to assign proper electrostatic charges to chemical groups, to keep the known correlation between solvent exposure and pKa shift magnitude, and to grasp the importance of close interactions, including hydrogen bonds. Inference times allow speedups of more than 1000 times faster than physics-based methods, especially for large proteins. By combining speed, accuracy and a reasonable understanding of the theoretical basis for pKa shifts, our models provide a game-changing solution for fast estimations of macroscopic pKa from ensembles of microscopic (pKhalf) values as well as for many downstream applications such as molecular docking and constant-pH molecular dynamics simulations.