scholarly journals Multiclass Classifier for P-Glycoprotein Substrates, Inhibitors, and Non-Active Compounds

Molecules ◽  
2019 ◽  
Vol 24 (10) ◽  
pp. 2006 ◽  
Author(s):  
Liadys Mora Lagares ◽  
Nikola Minovski ◽  
Marjana Novič

P-glycoprotein (P-gp) is a transmembrane protein that actively transports a wide variety of chemically diverse compounds out of the cell. It is highly associated with the ADMET (absorption, distribution, metabolism, excretion and toxicity) properties of drugs/drug candidates and contributes to decreasing toxicity by eliminating compounds from cells, thereby preventing intracellular accumulation. Therefore, in the drug discovery and toxicological assessment process it is advisable to pay attention to whether a compound under development could be transported by P-gp or not. In this study, an in silico multiclass classification model capable of predicting the probability of a compound to interact with P-gp was developed using a counter-propagation artificial neural network (CP ANN) based on a set of 2D molecular descriptors, as well as an extensive dataset of 2512 compounds (1178 P-gp inhibitors, 477 P-gp substrates and 857 P-gp non-active compounds). The model provided a good classification performance, producing non error rate (NER) values of 0.93 for the training set and 0.85 for the test set, while the average precision (AvPr) was 0.93 for the training set and 0.87 for the test set. An external validation set of 385 compounds was used to challenge the model’s performance. On the external validation set the NER and AvPr values were 0.70 for both indices. We believe that this in silico classifier could be effectively used as a reliable virtual screening tool for identifying potential P-gp ligands.

Author(s):  
Zhuo Lu ◽  
Jin Chen ◽  
JiongYi Yan ◽  
QiaoMing Liu ◽  
Fang Li ◽  
...  

Background: Colon cancer is one of the most common cancer worldwide and has a poor prognosis. Through the analysis of transcriptome and clinical data of colon cancer, immune gene-set signature was identified by single sample enrichment analysis (ssGSEA) scoring to predict patient survival and discover new therapeutic targets. Objective: To study the role of immune gene-set signature in colon cancer. Methods: First, RNASeq and clinical follow-up information were downloaded from The Cancer Genome Atlas (TCGA). Immune gene-related gene sets were collected from ImmPort database. Genes and immunological pathways related to prognosis were screened in the training set and integrated for feature selection using random forest. Immune gene-related prognosis model was verified in the entire TCGA test set and GEO validation set and compared with immune cells scores and matrix score. Results: 1650 prognostic genes and 13 immunological pathways were identified. These genes and pathways are closely related to the development of tumors. 13-immune gene-set signature was established, which is an independent prognostic factor for patients with colon cancer. Risk stratification of samples could be carried out in the training set, test set and external validation set. The AUC of five-year survival in the training set and validation set is greater than 0.6. Immunosuppression occurs in high-risk samples. Compared with published models, Riskscore has better prediction effect. Conclusion: This study constructed 13-immune gene-set signature as a new prognostic marker to predict the survival of patients with colon cancer, and provided new diagnostic/prognostic biomarkers and therapeutic targets for colon cancer.


2021 ◽  
Vol 11 (5) ◽  
pp. 2039
Author(s):  
Hyunseok Shin ◽  
Sejong Oh

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.


2021 ◽  
Author(s):  
Xiaobo Wen ◽  
Biao Zhao ◽  
Meifang Yuan ◽  
Jinzhi Li ◽  
Mengzhen Sun ◽  
...  

Abstract Objectives: To explore the performance of Multi-scale Fusion Attention U-net (MSFA-U-net) in thyroid gland segmentation on CT localization images for radiotherapy. Methods: CT localization images for radiotherapy of 80 patients with breast cancer or head and neck tumors were selected; label images were manually delineated by experienced radiologists. The data set was randomly divided into the training set (n=60), the validation set (n=10), and the test set (n=10). Data expansion was performed in the training set, and the performance of the MSFA-U-net model was evaluated using the evaluation indicators Dice similarity coefficient (DSC), Jaccard similarity coefficient (JSC), positive predictive value (PPV), sensitivity (SE), and Hausdorff distance (HD). Results: With the MSFA-U-net model, the DSC, JSC, PPV, SE, and HD indexes of the segmented thyroid gland in the test set were 0.8967±0.0935, 0.8219±0.1115, 0.9065±0.0940, 0.8979±0.1104, and 2.3922±0.5423, respectively. Compared with U-net, HR-net, and Attention U-net, MSFA-U-net showed that DSC increased by 0.052, 0.0376, and 0.0346 respectively; JSC increased by 0.0569, 0.0805, and 0.0433, respectively; SE increased by 0.0361, 0.1091, and 0.0831, respectively; and HD increased by −0.208, −0.1952, and −0.0548, respectively. The test set image results showed that the thyroid edges segmented by the MSFA-U-net model were closer to the standard thyroid delineated by the experts, in comparison with those segmented by the other three models. Moreover, the edges were smoother, over-anti-noise interference was stronger, and oversegmentation and undersegmentation were reduced. Conclusion: The MSFA-U-net model can meet basic clinical requirements and improve the efficiency of physicians' clinical work.


In this paper, the authors present an effort to increase the applicability domain (AD) by means of retraining models using a database of 701 great dissimilar molecules presenting anti-tyrosinase activity and 728 drugs with other uses. Atom-based linear indices and best subset linear discriminant analysis (LDA) were used to develop individual classification models. Eighteen individual classification-based QSAR models for the tyrosinase inhibitory activity were obtained with global accuracy varying from 88.15-91.60% in the training set and values of Matthews correlation coefficients (C) varying from 0.76-0.82. The external validation set shows globally classifications above 85.99% and 0.72 for C. All individual models were validated and fulfilled by OECD principles. A brief analysis of AD for the training set of 478 compounds and the new active compounds included in the re-training was carried out. Various assembled multiclassifier systems contained eighteen models using different selection criterions were obtained, which provide possibility of select the best strategy for particular problem. The various assembled multiclassifier systems also estimated the potency of active identified compounds. Eighteen validated potency models by OECD principles were used.


2019 ◽  
Vol 31 (5) ◽  
pp. 665-673 ◽  
Author(s):  
Maud Menard ◽  
Alexis Lecoindre ◽  
Jean-Luc Cadoré ◽  
Michèle Chevallier ◽  
Aurélie Pagnon ◽  
...  

Accurate staging of hepatic fibrosis (HF) is important for treatment and prognosis of canine chronic hepatitis. HF scores are used in human medicine to indirectly stage and monitor HF, decreasing the need for liver biopsy. We developed a canine HF score to screen for moderate or greater HF. We included 96 dogs in our study, including 5 healthy dogs. A liver biopsy for histologic examination and a biochemistry profile were performed on all dogs. The dogs were randomly split into a training set of 58 dogs and a validation set of 38 dogs. A HF score that included alanine aminotransferase, alkaline phosphatase, total bilirubin, potassium, and gamma-glutamyl transferase was developed in the training set. Model performance was confirmed using the internal validation set, and was similar to the performance in the training set. The overall sensitivity and specificity for the study group were 80% and 70% respectively, with an area under the curve of 0.80 (0.71–0.90). This HF score could be used for indirect diagnosis of canine HF when biochemistry panels are performed on the Konelab 30i (Thermo Scientific), using reagents as in our study. External validation is required to determine if the score is sufficiently robust to utilize biochemical results measured in other laboratories with different instruments and methodologies.


Cells ◽  
2019 ◽  
Vol 8 (10) ◽  
pp. 1286 ◽  
Author(s):  
Onat Kadioglu ◽  
Thomas Efferth

P-glycoprotein (P-gp) is an important determinant of multidrug resistance (MDR) because its overexpression is associated with increased efflux of various established chemotherapy drugs in many clinically resistant and refractory tumors. This leads to insufficient therapeutic targeting of tumor populations, representing a major drawback of cancer chemotherapy. Therefore, P-gp is a target for pharmacological inhibitors to overcome MDR. In the present study, we utilized machine learning strategies to establish a model for P-gp modulators to predict whether a given compound would behave as substrate or inhibitor of P-gp. Random forest feature selection algorithm-based leave-one-out random sampling was used. Testing the model with an external validation set revealed high performance scores. A P-gp modulator list of compounds from the ChEMBL database was used to test the performance, and predictions from both substrate and inhibitor classes were selected for the last step of validation with molecular docking. Predicted substrates revealed similar docking poses than that of doxorubicin, and predicted inhibitors revealed similar docking poses than that of the known P-gp inhibitor elacridar, implying the validity of the predictions. We conclude that the machine-learning approach introduced in this investigation may serve as a tool for the rapid detection of P-gp substrates and inhibitors in large chemical libraries.


2017 ◽  
Vol 35 (15_suppl) ◽  
pp. e15575-e15575
Author(s):  
Brice Jabo ◽  
John W. Morgan ◽  
Mayada A. Aljehani ◽  
Matthew J Selleck ◽  
Albert Y. Lin

e15575 Background: Gastric cancer (GC) mortality remains high, with a 5-year survival of 30 percent. For patients with resectable GC, mortality varies depending on both patient and tumor characteristics. The current study sought to develop a web-based prognostic model to assist patients and health care providers in decision making regarding either surgery-only or adjuvant chemoradiotherapy (CRT). Methods: California SEER data was used and records, including demographic, pathologic, and treatment information, for 2,583 patients diagnosed with stage IB to III GC and treated with either surgery only or adjuvant CRT from 2006 to 2013 were retrieved. Purposeful selection using Cox regression model was used to identify important mortality predictors. Additionally, with simple random sampling, 70% of the data were assigned to the training set and the remaining 30% were assigned to the test set. Furthermore, generalized boosted classification model was trained using the training set and validated using the test set. Area under the curve (AUC) of the receiver operating characteristic (ROC), sensitivity, specificity and accuracy were determined for 5- and 10-year mortality. Results: The median survival was 33 months for patients in the training set, and 32 for the test set. Predictors included in the model were age, ethnicity (Asian/other, Hispanic, non-Hispanic black and non-Hispanic white), T-stage, histology (intestinal, diffuse and other), presence of signet ring (yes/no), proximal location (yes/no), lymph node ratio, and CRT following surgery (yes/no). Validation of the model on the test set showed as follows: AUC, sensitivity, specificity and accuracy of 0.78(95%CI = 0.75,0.82), 0.75, 0.65 and 0.70 for 5-year survival and 0.77(95%CI = 0.74,0.80), 0.79, 0.55 and 0.70 for 10-year survival. Conclusions: The proposed web-based prognostic tool using readily available patient and tumor characteristic provides validated and personalized prognostic information to aide clinicians and patients in GC adjuvant treatment decision process. [Table: see text]


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. e15718-e15718
Author(s):  
Shuichi Mitsunaga ◽  
Shogo Nomura ◽  
Kazuo Hara ◽  
Yukiko Takayama ◽  
Makoto Ueno ◽  
...  

e15718 Background: The diagnostic value of serum microRNAs (miRNA) in a highly sensitive microarray for pancreatobiliary cancer (PBca) has been demonstrated. This study attempted to build and validate a signature comprised of multiple serum miRNA markers for discriminating PBca from healthy controls. Methods: A multicenter prospective study on the diagnostic performance of serum miRNAs was conducted. The patients (pts) with treatment-naïve PBca and healthy participants aged ≥60 years were enrolled. Clinical data and sera were collected. Target population was randomly divided to training or validation cohort with an allocation ratio of 2:1. Twenty-nine serum miRNA markers on the microarray data were analyzed. Using any combinations of the markers, a Fisher’s linear discriminant analysis was performed, and the resulting sensitivity, specificity and AUC of ROC curve to discriminate PBca from healthy controls were calculated for each combination. Marker combinations with a sensitivity/specificity (SN/SP) of ≥80%/90% and high AUC in comparison with AUC of CA19-9 were defined as the diagnostic miRNA signature, which were selected in the training cohort. Next, the signatures were screened out which showed a good reproducibility in the validation cohort. As an independent external cohort, PBca pts and healthy with pooled frozen sera were enrolled and the identified miRNA signatures were further validated. Results: Total of 546 participants (80 healthy and 223 PBca in training set, 40 healthy and 104 PBca in validation set, 49 healthy and 50 PBca in external validation set) were analyzed in this study. Four serum miRNA combinations were identified as the diagnostic miRNA signature. In the training set, four miRNA signatures, consisted of 10 miRNAs, were developed. For the best-performed miRNA signature, the SN/SP and AUC in the validation and external validation cohorts were 84/90% and 0.95 (CA19-9: 73/95% and 0.88) and 84/90% and 0.93 (CA19-9: 80/94% and 0.87), respectively. Conclusions: The diagnostic serum miRNA signatures for PBca were identified in this study.


2020 ◽  
Author(s):  
Ruyi Zhang ◽  
Mei Xu ◽  
Xiangxiang Liu ◽  
Miao Wang ◽  
Qiang Jia ◽  
...  

Abstract Objectives To develop a clinically predictive nomogram model which can maximize patients’ net benefit in terms of predicting the prognosis of patients with thyroid carcinoma based on the 8th edition of the AJCC Cancer Staging method. MethodsWe selected 134,962 thyroid carcinoma patients diagnosed between 2004 and 2015 from SEER database with details of the 8th edition of the AJCC Cancer Staging Manual and separated those patients into two datasets randomly. The first dataset, training set, was used to build the nomogram model accounting for 80% (94,474 cases) and the second dataset, validation set, was used for external validation accounting for 20% (40,488 cases). Then we evaluated its clinical availability by analyzing DCA (Decision Curve Analysis) performance and evaluated its accuracy by calculating AUC, C-index as well as calibration plot.ResultsDecision curve analysis showed the final prediction model could maximize patients’ net benefit. In training set and validation set, Harrell’s Concordance Indexes were 0.9450 and 0.9421 respectively. Both sensitivity and specificity of three predicted time points (12 Months,36 Months and 60 Months) of two datasets were all above 0.80 except sensitivity of 60-month time point of validation set was 0.7662. AUCs of three predicted timepoints were 0.9562, 0.9273 and 0.9009 respectively for training set. Similarly, those numbers were 0.9645, 0.9329, and 0.8894 respectively for validation set. Calibration plot also showed that the nomogram model had a good calibration.ConclusionThe final nomogram model provided with both excellent accuracy and clinical availability and should be able to predict patients’ survival probability visually and accurately.


Author(s):  
Ade Nurhopipah ◽  
Uswatun Hasanah

The performance of classification models in machine learning algorithms is influenced by many factors, one of which is dataset splitting method. To avoid overfitting, it is important to apply a suitable dataset splitting strategy. This study presents comparison of four dataset splitting techniques, namely Random Sub-sampling Validation (RSV), k-Fold Cross Validation (k-FCV), Bootstrap Validation (BV) and Moralis Lima Martin Validation (MLMV). This comparison is done in face classification on CCTV images using Convolutional Neural Network (CNN) algorithm and Support Vector Machine (SVM) algorithm. This study is also applied in two image datasets. The results of the comparison are reviewed by using model accuracy in training set, validation set and test set, also bias and variance of the model. The experiment shows that k-FCV technique has more stable performance and provide high accuracy on training set as well as good generalizations on validation set and test set. Meanwhile, data splitting using MLMV technique has lower performance than the other three techniques since it yields lower accuracy. This technique also shows higher bias and variance values and it builds overfitting models, especially when it is applied on validation set.


Sign in / Sign up

Export Citation Format

Share Document