Sequence based prediction of protein phase separation into disordered condensates using machine learning
Several proteins which are responsible for neuro-degenrerative disorders (Alzheimers, Parkinsons etc) are shown to undergo a mechanism known as liquid liquid phase separation (LLPS). We in this research build a predictor which would answer whether a protein molecule would undergo LLPS or not. For this we used some protein sequences for which we already knew the answer. The ones who undergo LLPS were considered as the positive set and the ones who do not, were taken as the negative set. Depending on the knowledge of amino-acid sequences we identified some relevant variables in the context of LLPS e.g. number of amino acids, length of the best pairings, average register shifts. Using these variables we built a number of scoring functions which were basically analytic functions involving these variables and we also combined some scores already existing in the literature. We considered a total of 43636 protein sequences, among them only 121 were positive. We applied logistic regression and performed cross validation, where 25% of the data were used as the training set and the performance of the obtained results were tested on the remaining 75% of the data. In the training process, we used Simplex algorithm to maximize area under the curve (AUC) in receiver operator characteristics (ROC) space for each of the scores we defined. The optimised parameters were then used to evaluate AUC on the test set to check the accuracy. The best performing score was identified as the predicting model to answer the question whether a protein chain would undergo phase separating behavior or not.