Using BERT to identify drug-target interactions from whole PubMed
Abstract Background: Drug-target interactions (DTIs) are critical for drug repurposing and elucidation of drug mechanisms, and are manually curated by large databases, such as ChEMBL, BindingDB, DrugBank and DrugTargetCommons. However, the number of articles providing this data (~0.1 million) likely constitutes only a fraction of all articles on PubMed that contain experimentally determined DTIs. Finding such articles and extracting the experimental information is a challenging task, and there is a pressing need for systematic approaches to assist the curation of DTIs. To this end, we propose Bidirectional Encoder Representations from Transformers (BERT) to identify such articles. Because DTI data intimately depends on the type of assays used to generate it, we also aimed to incorporate functions to predict the assay format. Results: Our novel method identified ~2.1 million articles (along with drug and protein information) that are not previously included in public DTI databases. Using 10-fold cross-validation, we obtained ~99% accuracy for identifying articles containing quantitative drug-target profiles. The accuracy for the prediction of assay format is ~90%, which leaves room for improvement in future studies. Conclusion: The BERT model in this study is robust and the proposed pipeline can be used to identify previously overlooked articles containing quantitative DTIs. Overall, our method provides a significant advancement in machine-assisted DTI extraction and curation. We expect it to be a useful addition to drug mechanism discovery and repurposing.