Background:
Thrombin is the central protease of the vertebrate blood coagulation
cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most
significant property of thrombin inhibitors.
Method:
This study was carried out to predict Ki values of thrombin inhibitors based on a large data
set by using machine learning methods. Taking advantage of finding non-intuitive regularities on
high-dimensional datasets, machine learning can be used to build effective predictive models. A total
of 6554 descriptors for each compound were collected and an efficient descriptor selection method
was chosen to find the appropriate descriptors. Four different methods including multiple linear
regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and
Support Vector Machine (SVM) were implemented to build prediction models with these selected
descriptors.
Results:
The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the
training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization
test and applicability domain evaluation, were adopted to assess the robustness and
generalization ability of the model. The final model shows excellent stability and predictive ability
and can be employed for rapid estimation of the inhibitory constant, which is full of help for
designing novel thrombin inhibitors.