Using BERT to identify drug-target interactions from whole PubMed

Abstract Background: Drug-target interactions (DTIs) are critical for drug repurposing and elucidation of drug mechanisms, and are manually curated by large databases, such as ChEMBL, BindingDB, DrugBank and DrugTargetCommons. However, the number of articles providing this data (~0.1 million) likely constitutes only a fraction of all articles on PubMed that contain experimentally determined DTIs. Finding such articles and extracting the experimental information is a challenging task, and there is a pressing need for systematic approaches to assist the curation of DTIs. To this end, we propose Bidirectional Encoder Representations from Transformers (BERT) to identify such articles. Because DTI data intimately depends on the type of assays used to generate it, we also aimed to incorporate functions to predict the assay format. Results: Our novel method identified ~2.1 million articles (along with drug and protein information) that are not previously included in public DTI databases. Using 10-fold cross-validation, we obtained ~99% accuracy for identifying articles containing quantitative drug-target profiles. The accuracy for the prediction of assay format is ~90%, which leaves room for improvement in future studies. Conclusion: The BERT model in this study is robust and the proposed pipeline can be used to identify previously overlooked articles containing quantitative DTIs. Overall, our method provides a significant advancement in machine-assisted DTI extraction and curation. We expect it to be a useful addition to drug mechanism discovery and repurposing.

Download Full-text

Using BERT to identify drug-target interactions from whole PubMed

10.1101/2021.09.10.459845 ◽

2021 ◽

Author(s):

Jehad Aldahdooh ◽

Markus Vähä-Koskela ◽

Jing Tang ◽

Ziaurrehman Tanoli

Keyword(s):

Drug Target ◽

Drug Repurposing ◽

Experimental Information ◽

Future Studies ◽

Drug Mechanism ◽

Assay Format ◽

Large Databases ◽

Data Points ◽

Novel Method ◽

Fold Cross Validation

ABSTRACTBackgroundDrug-target interactions (DTIs) are critical for drug repurposing and elucidation of drug mechanisms, and they are collected in large databases, such as ChEMBL, BindingDB, DrugBank and DrugTargetCommons. However, the number of studies providing this data (~0.1 million) likely constitutes only a fraction of all studies on PubMed that contain experimental DTI data. Finding such studies and extracting the experimental information is a challenging task, and there is a pressing need for machine learning for the extraction and curation of DTIs. To this end, we developed new text mining document classifiers based on the Bidirectional Encoder Representations from Transformers (BERT) algorithm. Because DTI data intimately depends on the type of assays used to generate it, we also aimed to incorporate functions to predict the assay format.ResultsOur novel method identified and extracted DTIs from 2.1 million studies not previously included in public DTI databases. Using 10-fold cross-validation, we obtained ~99% accuracy for identifying studies containing drug-target pairs. The accuracy for the prediction of assay format is ~90%, which leaves room for improvement in future studies.ConclusionThe BERT model in this study is robust and the proposed pipeline can be used to identify new and previously overlooked studies containing DTIs and automatically extract the DTI data points. The tabular output facilitates validation of the extracted data and assay format information. Overall, our method provides a significant advancement in machine-assisted DTI extraction and curation. We expect it to be a useful addition to drug mechanism discovery and repurposing.

Download Full-text

A Novel Method for Gender and Age Detection Based on EEG Brain Signals

The International Arab Journal of Information Technology ◽

10.34028/iajit/18/5/10 ◽

2021 ◽

Vol 18 (5) ◽

Author(s):

Haitham Issa ◽

Sali Issa ◽

Wahab Shah

Keyword(s):

Cross Validation ◽

Image Feature ◽

Emotional States ◽

Time Frequency ◽

Brain Signals ◽

Average Accuracy ◽

Gender And Age ◽

Novel Method ◽

Fold Cross Validation ◽

Validation Strategy

This paper presents a new gender and age classification system based on Electroencephalography (EEG) brain signals. First, Continuous Wavelet Transform (CWT) technique is used to get the time-frequency information of only one EEG electrode for eight distinct emotional states instead of the ordinary neutral or relax states. Then, sequential steps are implemented to extract the improved grayscale image feature. For system evaluation, a three-fold-cross validation strategy is applied to construct four different classifiers. The experimental test shows that the proposed extracted feature with Convolutional Neural Network (CNN) classifier improves the performance of both gender and age classification, and achieves an average accuracy of 96.3% and 89% for gender and age classification, respectively. Moreover, the ability to predict human gender and age during the mood of different emotional states is practically approved.

Download Full-text

A Method for Detecting Harmful Entries on Informal School Websites Using Morphosemantic Patterns

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2017.p1189 ◽

2017 ◽

Vol 21 (7) ◽

pp. 1189-1201

Author(s):

Michal Ptaszynski ◽

Fumito Masui ◽

Yoko Nakajima ◽

Yasutomo Kimura ◽

Rafal Rzepka ◽

...

Keyword(s):

Human Rights ◽

High Schools ◽

Semantic Information ◽

Cross Validation ◽

Document Classification ◽

Semantic Roles ◽

Parts Of Speech ◽

Novel Method ◽

Novel Concept ◽

Fold Cross Validation

This paper presents a novel method of analyzing morphosemantic patterns in language to the detect cyberbullying, or frequently appearing harmful messages and entries that aim to humiliate other users. The morphosemantic patterns represent a novel concept, with the assumption that analyzed elements can be perceived as a combination of morphological information, such as parts of speech, and semantic information, such as semantic roles, categories, etc. The patterns are further automatically extracted from the data containing harmful entries (cyberbullying) and non-harmful entries found on the informal websites of Japanese high schools. These website data were prepared and standardized by the Human Rights Center in Mie Prefecture, Japan. The patterns extracted in this way are further applied to a document classification task using the provided data in 10-fold cross-validation. The results indicate that morphosemantic sentence representation can be considered useful in the task of detecting the deceptive and provocative language used in cyberbullying.

Download Full-text

A two-step discriminated method to identify thermophilic proteins

International Journal of Biomathematics ◽

10.1142/s1793524517500504 ◽

2017 ◽

Vol 10 (04) ◽

pp. 1750050 ◽

Cited By ~ 33

Author(s):

Hua Tang ◽

Ren-Zhi Cao ◽

Wen Wang ◽

Tie-Shan Liu ◽

Li-Ming Wang ◽

...

Keyword(s):

Protein Engineering ◽

Chemical Reaction ◽

Cross Validation ◽

Promising Method ◽

Enzyme Design ◽

Relevant Field ◽

Novel Method ◽

Fold Cross Validation

Improving thermostability of an enzyme can accelerate the relevant chemical reaction. Thus, the analysis and prediction of thermophilic proteins are conducive to protein engineering and enzyme design. In this study, a novel method based on two-step discrimination was proposed to distinguish between thermophilic and non-thermophilic proteins. The model was rigorously benchmarked on an objective dataset including 915 thermophilic proteins and 793 non-thermophilic proteins. Results showed that the overall accuracy of our method is 94.44% in 5-fold cross-validation, which is higher than those of other published methods. We believe that the two-step discriminated strategy will become a promising method in the relevant field of protein bioinformatics.

Download Full-text

Prediction of miRNA-Disease Association Using Deep Collaborative Filtering

BioMed Research International ◽

10.1155/2021/6652948 ◽

2021 ◽

Vol 2021 ◽

pp. 1-16

Author(s):

Li Wang ◽

Cheng Zhong

Keyword(s):

Collaborative Filtering ◽

Cross Validation ◽

Kidney Neoplasms ◽

Feature Vector ◽

High Failure Rate ◽

Experimental Identification ◽

Disease Similarity ◽

Disease Associations ◽

Novel Method ◽

Fold Cross Validation

The existing studies have shown that miRNAs are related to human diseases by regulating gene expression. Identifying miRNA association with diseases will contribute to diagnosis, treatment, and prognosis of diseases. The experimental identification of miRNA-disease associations is time-consuming, tremendously expensive, and of high-failure rate. In recent years, many researchers predicted potential associations between miRNAs and diseases by computational approaches. In this paper, we proposed a novel method using deep collaborative filtering called DCFMDA to predict miRNA-disease potential associations. To improve prediction performance, we integrated neural network matrix factorization (NNMF) and multilayer perceptron (MLP) in a deep collaborative filtering framework. We utilized known miRNA-disease associations to capture miRNA-disease interaction features by NNMF and utilized miRNA similarity and disease similarity to extract miRNA feature vector and disease feature vector, respectively, by MLP. At last, we merged outputs of the NNMF and MLP to obtain the prediction matrix. The experimental results indicate that compared with other existing computational methods, our method can achieve the AUC of 0.9466 based on 10-fold cross-validation. In addition, case studies show that the DCFMDA can effectively predict candidate miRNAs for breast neoplasms, colon neoplasms, kidney neoplasms, leukemia, and lymphoma.

Download Full-text

ADVIAN: Alzheimer's Disease VGG-Inspired Attention Network Based on Convolutional Block Attention Module and Multiple Way Data Augmentation

Frontiers in Aging Neuroscience ◽

10.3389/fnagi.2021.687456 ◽

2021 ◽

Vol 13 ◽

Author(s):

Shui-Hua Wang ◽

Qinghua Zhou ◽

Ming Yang ◽

Yu-Dong Zhang

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Cross Validation ◽

Data Augmentation ◽

State Of The Art ◽

Attention Network ◽

Backbone Network ◽

Novel Method ◽

Precision And Accuracy ◽

Fold Cross Validation

Aim: Alzheimer's disease is a neurodegenerative disease that causes 60–70% of all cases of dementia. This study is to provide a novel method that can identify AD more accurately.Methods: We first propose a VGG-inspired network (VIN) as the backbone network and investigate the use of attention mechanisms. We proposed an Alzheimer's Disease VGG-Inspired Attention Network (ADVIAN), where we integrate convolutional block attention modules on a VIN backbone. Also, 18-way data augmentation is proposed to avoid overfitting. Ten runs of 10-fold cross-validation are carried out to report the unbiased performance.Results: The sensitivity and specificity reach 97.65 ± 1.36 and 97.86 ± 1.55, respectively. Its precision and accuracy are 97.87 ± 1.53 and 97.76 ± 1.13, respectively. The F1 score, MCC, and FMI are obtained as 97.75 ± 1.13, 95.53 ± 2.27, and 97.76 ± 1.13, respectively. The AUC is 0.9852.Conclusion: The proposed ADVIAN gives better results than 11 state-of-the-art methods. Besides, experimental results demonstrate the effectiveness of 18-way data augmentation.

Download Full-text

Identification of miRNA-Small Molecule Associations by Continuous Feature Representation Using Auto-Encoders

Pharmaceutics ◽

10.3390/pharmaceutics14010003 ◽

2021 ◽

Vol 14 (1) ◽

pp. 3

Author(s):

Ibrahim Abdelbaky ◽

Hilal Tayara ◽

Kil To Chong

Keyword(s):

Small Molecules ◽

Small Molecule ◽

Cross Validation ◽

Drug Repurposing ◽

The Body ◽

Feature Representation ◽

Computational Techniques ◽

Similarity Calculation ◽

Non Coding Rnas ◽

Fold Cross Validation

MicroRNAs (miRNAs) are short non-coding RNAs that play important roles in the body and affect various diseases, including cancers. Controlling miRNAs with small molecules is studied herein to provide new drug repurposing perspectives for miRNA-related diseases. Experimental methods are time- and effort-consuming, so computational techniques have been applied, relying mostly on biological feature similarities and a network-based scheme to infer new miRNA–small molecule associations. Collecting such features is time-consuming and may be impractical. Here we suggest an alternative method of similarity calculation, representing miRNAs and small molecules through continuous feature representation. This representation is learned by the proposed deep learning auto-encoder architecture. Our suggested representation was compared to previous works and achieved comparable results using 5-fold cross validation (92% identified within top 25% predictions), and better predictions for most of the case studies (avg. of 31% vs. 25% identified within the top 25% of predictions). The results proved the effectiveness of our proposed method to replace previous time- and effort-consuming methods.

Download Full-text

Combination of Support Vector Machine and K-Fold cross-validation for prediction of long-term degradation of the compressive strength of marine concrete

International Journal of Computational Physics Series ◽

10.29167/a1i1p120-130 ◽

2018 ◽

Vol 1 (1) ◽

pp. 120-130 ◽

Cited By ~ 1

Author(s):

Chunxiang Qian ◽

Wence Kang ◽

Hao Ling ◽

Hua Dong ◽

Chengyao Liang ◽

...

Keyword(s):

Support Vector Machine ◽

Environmental Factors ◽

Cross Validation ◽

Concrete Strength ◽

Simulation Method ◽

Support Vector ◽

Svm Model ◽

Artificial Neural Network Ann ◽

Influence Degree ◽

Fold Cross Validation

Support Vector Machine (SVM) model optimized by K-Fold cross-validation was built to predict and evaluate the degradation of concrete strength in a complicated marine environment. Meanwhile, several mathematical models, such as Artificial Neural Network (ANN) and Decision Tree (DT), were also built and compared with SVM to determine which one could make the most accurate predictions. The material factors and environmental factors that influence the results were considered. The materials factors mainly involved the original concrete strength, the amount of cement replaced by fly ash and slag. The environmental factors consisted of the concentration of Mg2+, SO42-, Cl-, temperature and exposing time. It was concluded from the prediction results that the optimized SVM model appeared to perform better than other models in predicting the concrete strength. Based on SVM model, a simulation method of variables limitation was used to determine the sensitivity of various factors and the influence degree of these factors on the degradation of concrete strength.

Download Full-text

Rancang Bangun Sistem Informasi Untuk Menentukan Kapabilitas Konsumen Dalam Mengambil Pinjaman KPR

Jurnal ULTIMA InfoSys ◽

10.31937/si.v7i2.543 ◽

2016 ◽

Vol 7 (2) ◽

pp. 75-80

Author(s):

Adhi Kusnadi ◽

Risyad Ananda Putra

Keyword(s):

Data Mining ◽

Low Income ◽

Cross Validation ◽

Classification Tree ◽

Large Population ◽

Housing Development ◽

Good Precision ◽

Index Terms ◽

The Government ◽

Fold Cross Validation

Indonesia is one country that has a relatively large population . The government in the period of 5 years, annually hold a procurement program 1 million FLPP house units. This program is held in an effort to provide a decent home for low income people. FLPP housing development requires good precision and speed of development on the part of the developer, this is often hampered by the bank process, because it is difficult to predict the results and speed of data processing in the bank. Knowing the ability of consumers to get subsidized credit, has many advantages, among others, developers can plan a better cash flow, and developers can replace consumers who will be rejected before entering the bank process. For that reason built a system that can help developers. There are many methods that can be used to create this application. One of them is data mining with Classification tree. The results of 10-fold-cross-validation applications have an accuracy of 92%. Index Terms-Data Mining, Classification Tree, Housing, FLPP, 10-fold-cross Validation, Consumer Capability

Download Full-text

Klasifikasi Berita Kriminal Menggunakan NaÃ¯ve Bayes Classifier (NBC) dengan Pengujian K-Fold Cross Validation

Jurnal Sains dan Informatika ◽

10.34128/jsi.v5i2.177 ◽

2019 ◽

Vol 5 (2) ◽

pp. 108-117

Author(s):

Herfia Rhomadhona ◽

Jaka Permadi

Keyword(s):

Cross Validation ◽

Online Media ◽

Bayes Classifier ◽

Ve Bayes ◽

Fold Cross Validation

Berita kriminalitas merupakan berita yang selalu menjadi trending topik di setiap media massa, khususnya media massa online. Media massa online terlah menyediakan beberapa fasilitas untuk mempermudah masyarakan dalam mencari sebuah berita berdasarkan topik. Media massa online melabeli suatu berita berdasarkan kategorinya. Namun, media massa online tidak memberikan sub kategori pada berita tersebut. Sebagai contoh jika seorang pengguna membuka kategori kriminal, maka yang ditampilkan adalah semua jenis berita kriminal tanpa memberikan informasi yang spesifik dari jenis kriminalitasnya. Permasalahan tersebut dapat diatasi dengan mengklasifikasikan berita kriminalitas berdasarkan subkategori. Penelitian ini menggunakan metode NaÃ¯ve Bayes Classifier (NBC) untuk mengklasifikasi berita berdasarkan sub kategorinya. Adapun subkategori terbagi kedalam 5 kategori yaitu korupsi, narkoba, pencurian, pemerkosaan dan pembunuhan. Penelitian ini bertujuan untuk mengetahui kemampuan NBC dalam mengklasifikasi berita dengan melakukan pengujian menggunakan teknik K-Fold Cross Validation dengan nilai K dari 3 sampai 10. Hasil pengujian menyatakan bahwa NBC memiliki kemampuan dalam klasifikasi berita kriminal dengan nilai precision sebesar 98,53 %, nilai recall sebesar 98,44 % dan nilai accuracy sebesar 99,38 %.

Download Full-text