Self-Supervised Representation Learning of Protein Tertiary Structures (PtsRep): Protein Engineering as A Case Study
AbstractIn recent years, deep learning has been increasingly used to decipher the relationships among protein sequence, structure, and function. Thus far deep learning of proteins has mostly utilized protein primary sequence information, while the vast amount of protein tertiary structural information remains unused. In this study, we devised a self-supervised representation learning framework to extract the fundamental features of unlabeled protein tertiary structures (PtsRep), and the embedded representations were transferred to two commonly recognized protein engineering tasks, protein stability and GFP fluorescence prediction. On both tasks, PtsRep significantly outperformed the two benchmark methods (UniRep and TAPE-BERT), which are based on protein primary sequences. Protein clustering analyses demonstrated that PtsRep can capture the structural signals in proteins. PtsRep reveals an avenue for general protein structural representation learning, and for exploring protein structural space for protein engineering and drug design.