Multiclass Classifier for P-Glycoprotein Substrates, Inhibitors, and Non-Active Compounds

P-glycoprotein (P-gp) is a transmembrane protein that actively transports a wide variety of chemically diverse compounds out of the cell. It is highly associated with the ADMET (absorption, distribution, metabolism, excretion and toxicity) properties of drugs/drug candidates and contributes to decreasing toxicity by eliminating compounds from cells, thereby preventing intracellular accumulation. Therefore, in the drug discovery and toxicological assessment process it is advisable to pay attention to whether a compound under development could be transported by P-gp or not. In this study, an in silico multiclass classification model capable of predicting the probability of a compound to interact with P-gp was developed using a counter-propagation artificial neural network (CP ANN) based on a set of 2D molecular descriptors, as well as an extensive dataset of 2512 compounds (1178 P-gp inhibitors, 477 P-gp substrates and 857 P-gp non-active compounds). The model provided a good classification performance, producing non error rate (NER) values of 0.93 for the training set and 0.85 for the test set, while the average precision (AvPr) was 0.93 for the training set and 0.87 for the test set. An external validation set of 385 compounds was used to challenge the model’s performance. On the external validation set the NER and AvPr values were 0.70 for both indices. We believe that this in silico classifier could be effectively used as a reliable virtual screening tool for identifying potential P-gp ligands.

Download Full-text

A 13-immune gene set signature for prediction of colon cancer prognosis

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207323666200930104744 ◽

2020 ◽

Vol 23 ◽

Author(s):

Zhuo Lu ◽

Jin Chen ◽

JiongYi Yan ◽

QiaoMing Liu ◽

Fang Li ◽

...

Keyword(s):

Colon Cancer ◽

External Validation ◽

Therapeutic Targets ◽

The Cancer Genome Atlas ◽

Cancer Prognosis ◽

Immune Gene ◽

Training Set ◽

Test Set ◽

Gene Set ◽

Validation Set

Background: Colon cancer is one of the most common cancer worldwide and has a poor prognosis. Through the analysis of transcriptome and clinical data of colon cancer, immune gene-set signature was identified by single sample enrichment analysis (ssGSEA) scoring to predict patient survival and discover new therapeutic targets. Objective: To study the role of immune gene-set signature in colon cancer. Methods: First, RNASeq and clinical follow-up information were downloaded from The Cancer Genome Atlas (TCGA). Immune gene-related gene sets were collected from ImmPort database. Genes and immunological pathways related to prognosis were screened in the training set and integrated for feature selection using random forest. Immune gene-related prognosis model was verified in the entire TCGA test set and GEO validation set and compared with immune cells scores and matrix score. Results: 1650 prognostic genes and 13 immunological pathways were identified. These genes and pathways are closely related to the development of tumors. 13-immune gene-set signature was established, which is an independent prognostic factor for patients with colon cancer. Risk stratification of samples could be carried out in the training set, test set and external validation set. The AUC of five-year survival in the training set and validation set is greater than 0.6. Immunosuppression occurs in high-risk samples. Compared with published models, Riskscore has better prediction effect. Conclusion: This study constructed 13-immune gene-set signature as a new prognostic marker to predict the survival of patients with colon cancer, and provided new diagnostic/prognostic biomarkers and therapeutic targets for colon cancer.

Download Full-text

Feature-Weighted Sampling for Proper Evaluation of Classification Models

Applied Sciences ◽

10.3390/app11052039 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2039

Author(s):

Hyunseok Shin ◽

Sejong Oh

Keyword(s):

Random Sampling ◽

Sampling Method ◽

Classification Model ◽

Training Set ◽

Test Set ◽

Feature Importance ◽

Proper Training ◽

Machine Learning Applications ◽

Test Sets ◽

The Given

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.

Download Full-text

Application of Multi-Scale Fusion Attention U-Net to Segment the Thyroid Gland on CT Localization Images for Radiotherapy

10.21203/rs.3.rs-949323/v1 ◽

2021 ◽

Author(s):

Xiaobo Wen ◽

Biao Zhao ◽

Meifang Yuan ◽

Jinzhi Li ◽

Mengzhen Sun ◽

...

Keyword(s):

Thyroid Gland ◽

Clinical Work ◽

Similarity Coefficient ◽

Dice Similarity Coefficient ◽

Training Set ◽

Data Set ◽

Test Set ◽

Noise Interference ◽

Multi Scale ◽

Validation Set

Abstract Objectives: To explore the performance of Multi-scale Fusion Attention U-net (MSFA-U-net) in thyroid gland segmentation on CT localization images for radiotherapy. Methods: CT localization images for radiotherapy of 80 patients with breast cancer or head and neck tumors were selected; label images were manually delineated by experienced radiologists. The data set was randomly divided into the training set (n=60), the validation set (n=10), and the test set (n=10). Data expansion was performed in the training set, and the performance of the MSFA-U-net model was evaluated using the evaluation indicators Dice similarity coefficient (DSC), Jaccard similarity coefficient (JSC), positive predictive value (PPV), sensitivity (SE), and Hausdorff distance (HD). Results: With the MSFA-U-net model, the DSC, JSC, PPV, SE, and HD indexes of the segmented thyroid gland in the test set were 0.8967±0.0935, 0.8219±0.1115, 0.9065±0.0940, 0.8979±0.1104, and 2.3922±0.5423, respectively. Compared with U-net, HR-net, and Attention U-net, MSFA-U-net showed that DSC increased by 0.052, 0.0376, and 0.0346 respectively; JSC increased by 0.0569, 0.0805, and 0.0433, respectively; SE increased by 0.0361, 0.1091, and 0.0831, respectively; and HD increased by −0.208, −0.1952, and −0.0548, respectively. The test set image results showed that the thyroid edges segmented by the MSFA-U-net model were closer to the standard thyroid delineated by the experts, in comparison with those segmented by the other three models. Moreover, the edges were smoother, over-anti-noise interference was stronger, and oversegmentation and undersegmentation were reduced. Conclusion: The MSFA-U-net model can meet basic clinical requirements and improve the efficiency of physicians' clinical work.

Download Full-text

Retrained Classification of Tyrosinase Inhibitors and “In Silico” Potency Estimation by Using Atom-Type Linear Indices

Methodologies and Applications for Chemoinformatics and Chemical Engineering ◽

10.4018/978-1-4666-4010-8.ch021 ◽

2013 ◽

pp. 322-427

Keyword(s):

External Validation ◽

Correlation Coefficients ◽

Classification Models ◽

Training Set ◽

Linear Discriminant ◽

Oecd Principles ◽

Qsar Models ◽

Validation Set ◽

Global Accuracy

In this paper, the authors present an effort to increase the applicability domain (AD) by means of retraining models using a database of 701 great dissimilar molecules presenting anti-tyrosinase activity and 728 drugs with other uses. Atom-based linear indices and best subset linear discriminant analysis (LDA) were used to develop individual classification models. Eighteen individual classification-based QSAR models for the tyrosinase inhibitory activity were obtained with global accuracy varying from 88.15-91.60% in the training set and values of Matthews correlation coefficients (C) varying from 0.76-0.82. The external validation set shows globally classifications above 85.99% and 0.72 for C. All individual models were validated and fulfilled by OECD principles. A brief analysis of AD for the training set of 478 compounds and the new active compounds included in the re-training was carried out. Various assembled multiclassifier systems contained eighteen models using different selection criterions were obtained, which provide possibility of select the best strategy for particular problem. The various assembled multiclassifier systems also estimated the potency of active identified compounds. Eighteen validated potency models by OECD principles were used.

Download Full-text

Use of serum biomarkers in staging of canine hepatic fibrosis

Journal of Veterinary Diagnostic Investigation ◽

10.1177/1040638719866881 ◽

2019 ◽

Vol 31 (5) ◽

pp. 665-673 ◽

Cited By ~ 1

Author(s):

Maud Menard ◽

Alexis Lecoindre ◽

Jean-Luc Cadoré ◽

Michèle Chevallier ◽

Aurélie Pagnon ◽

...

Keyword(s):

Liver Biopsy ◽

Hepatic Fibrosis ◽

External Validation ◽

Model Performance ◽

Area Under The Curve ◽

Gamma Glutamyl Transferase ◽

Training Set ◽

Internal Validation ◽

Validation Set ◽

Glutamyl Transferase

Accurate staging of hepatic fibrosis (HF) is important for treatment and prognosis of canine chronic hepatitis. HF scores are used in human medicine to indirectly stage and monitor HF, decreasing the need for liver biopsy. We developed a canine HF score to screen for moderate or greater HF. We included 96 dogs in our study, including 5 healthy dogs. A liver biopsy for histologic examination and a biochemistry profile were performed on all dogs. The dogs were randomly split into a training set of 58 dogs and a validation set of 38 dogs. A HF score that included alanine aminotransferase, alkaline phosphatase, total bilirubin, potassium, and gamma-glutamyl transferase was developed in the training set. Model performance was confirmed using the internal validation set, and was similar to the performance in the training set. The overall sensitivity and specificity for the study group were 80% and 70% respectively, with an area under the curve of 0.80 (0.71–0.90). This HF score could be used for indirect diagnosis of canine HF when biochemistry panels are performed on the Konelab 30i (Thermo Scientific), using reagents as in our study. External validation is required to determine if the score is sufficiently robust to utilize biochemical results measured in other laboratories with different instruments and methodologies.

Download Full-text

A Machine Learning-Based Prediction Platform for P-Glycoprotein Modulators and Its Validation by Molecular Docking

Cells ◽

10.3390/cells8101286 ◽

2019 ◽

Vol 8 (10) ◽

pp. 1286 ◽

Cited By ~ 1

Author(s):

Onat Kadioglu ◽

Thomas Efferth

Keyword(s):

Machine Learning ◽

Molecular Docking ◽

Learning Strategies ◽

High Performance ◽

External Validation ◽

Major Drawback ◽

Chemotherapy Drugs ◽

P Glycoprotein ◽

Validation Set ◽

Leave One Out

P-glycoprotein (P-gp) is an important determinant of multidrug resistance (MDR) because its overexpression is associated with increased efflux of various established chemotherapy drugs in many clinically resistant and refractory tumors. This leads to insufficient therapeutic targeting of tumor populations, representing a major drawback of cancer chemotherapy. Therefore, P-gp is a target for pharmacological inhibitors to overcome MDR. In the present study, we utilized machine learning strategies to establish a model for P-gp modulators to predict whether a given compound would behave as substrate or inhibitor of P-gp. Random forest feature selection algorithm-based leave-one-out random sampling was used. Testing the model with an external validation set revealed high performance scores. A P-gp modulator list of compounds from the ChEMBL database was used to test the performance, and predictions from both substrate and inhibitor classes were selected for the last step of validation with molecular docking. Predicted substrates revealed similar docking poses than that of doxorubicin, and predicted inhibitors revealed similar docking poses than that of the known P-gp inhibitor elacridar, implying the validity of the predictions. We conclude that the machine-learning approach introduced in this investigation may serve as a tool for the rapid detection of P-gp substrates and inhibitors in large chemical libraries.

Download Full-text

A personalized, web-based prognostic tool for resectable gastric cancer.

Journal of Clinical Oncology ◽

10.1200/jco.2017.35.15_suppl.e15575 ◽

2017 ◽

Vol 35 (15_suppl) ◽

pp. e15575-e15575

Author(s):

Brice Jabo ◽

John W. Morgan ◽

Mayada A. Aljehani ◽

Matthew J Selleck ◽

Albert Y. Lin

Keyword(s):

Gastric Cancer ◽

Health Care Providers ◽

Classification Model ◽

Adjuvant Chemoradiotherapy ◽

Care Providers ◽

Training Set ◽

Test Set ◽

Web Based ◽

Prognostic Tool ◽

Sensitivity Specificity

e15575 Background: Gastric cancer (GC) mortality remains high, with a 5-year survival of 30 percent. For patients with resectable GC, mortality varies depending on both patient and tumor characteristics. The current study sought to develop a web-based prognostic model to assist patients and health care providers in decision making regarding either surgery-only or adjuvant chemoradiotherapy (CRT). Methods: California SEER data was used and records, including demographic, pathologic, and treatment information, for 2,583 patients diagnosed with stage IB to III GC and treated with either surgery only or adjuvant CRT from 2006 to 2013 were retrieved. Purposeful selection using Cox regression model was used to identify important mortality predictors. Additionally, with simple random sampling, 70% of the data were assigned to the training set and the remaining 30% were assigned to the test set. Furthermore, generalized boosted classification model was trained using the training set and validated using the test set. Area under the curve (AUC) of the receiver operating characteristic (ROC), sensitivity, specificity and accuracy were determined for 5- and 10-year mortality. Results: The median survival was 33 months for patients in the training set, and 32 for the test set. Predictors included in the model were age, ethnicity (Asian/other, Hispanic, non-Hispanic black and non-Hispanic white), T-stage, histology (intestinal, diffuse and other), presence of signet ring (yes/no), proximal location (yes/no), lymph node ratio, and CRT following surgery (yes/no). Validation of the model on the test set showed as follows: AUC, sensitivity, specificity and accuracy of 0.78(95%CI = 0.75,0.82), 0.75, 0.65 and 0.70 for 5-year survival and 0.77(95%CI = 0.74,0.80), 0.79, 0.55 and 0.70 for 10-year survival. Conclusions: The proposed web-based prognostic tool using readily available patient and tumor characteristic provides validated and personalized prognostic information to aide clinicians and patients in GC adjuvant treatment decision process. [Table: see text]

Download Full-text

Serum microRNA signatures for the detection of pancreatobiliary cancer.

Journal of Clinical Oncology ◽

10.1200/jco.2019.37.15_suppl.e15718 ◽

2019 ◽

Vol 37 (15_suppl) ◽

pp. e15718-e15718

Author(s):

Shuichi Mitsunaga ◽

Shogo Nomura ◽

Kazuo Hara ◽

Yukiko Takayama ◽

Makoto Ueno ◽

...

Keyword(s):

Validation Cohort ◽

External Validation ◽

Diagnostic Value ◽

Healthy Controls ◽

Mirna Signature ◽

Training Set ◽

Linear Discriminant ◽

Serum Mirna ◽

Validation Set ◽

Sensitivity Specificity

e15718 Background: The diagnostic value of serum microRNAs (miRNA) in a highly sensitive microarray for pancreatobiliary cancer (PBca) has been demonstrated. This study attempted to build and validate a signature comprised of multiple serum miRNA markers for discriminating PBca from healthy controls. Methods: A multicenter prospective study on the diagnostic performance of serum miRNAs was conducted. The patients (pts) with treatment-naïve PBca and healthy participants aged ≥60 years were enrolled. Clinical data and sera were collected. Target population was randomly divided to training or validation cohort with an allocation ratio of 2:1. Twenty-nine serum miRNA markers on the microarray data were analyzed. Using any combinations of the markers, a Fisher’s linear discriminant analysis was performed, and the resulting sensitivity, specificity and AUC of ROC curve to discriminate PBca from healthy controls were calculated for each combination. Marker combinations with a sensitivity/specificity (SN/SP) of ≥80%/90% and high AUC in comparison with AUC of CA19-9 were defined as the diagnostic miRNA signature, which were selected in the training cohort. Next, the signatures were screened out which showed a good reproducibility in the validation cohort. As an independent external cohort, PBca pts and healthy with pooled frozen sera were enrolled and the identified miRNA signatures were further validated. Results: Total of 546 participants (80 healthy and 223 PBca in training set, 40 healthy and 104 PBca in validation set, 49 healthy and 50 PBca in external validation set) were analyzed in this study. Four serum miRNA combinations were identified as the diagnostic miRNA signature. In the training set, four miRNA signatures, consisted of 10 miRNAs, were developed. For the best-performed miRNA signature, the SN/SP and AUC in the validation and external validation cohorts were 84/90% and 0.95 (CA19-9: 73/95% and 0.88) and 84/90% and 0.93 (CA19-9: 80/94% and 0.87), respectively. Conclusions: The diagnostic serum miRNA signatures for PBca were identified in this study.

Download Full-text

Establishment and Validation of a Clinically Predictive Nomogram Model for Thyroid Carcinoma Patients

10.21203/rs.3.rs-123528/v1 ◽

2020 ◽

Author(s):

Ruyi Zhang ◽

Mei Xu ◽

Xiangxiang Liu ◽

Miao Wang ◽

Qiang Jia ◽

...

Keyword(s):

Thyroid Carcinoma ◽

External Validation ◽

Cancer Staging ◽

Curve Analysis ◽

Calibration Plot ◽

Net Benefit ◽

Training Set ◽

Decision Curve Analysis ◽

Validation Set ◽

Good Calibration

Abstract Objectives To develop a clinically predictive nomogram model which can maximize patients’ net benefit in terms of predicting the prognosis of patients with thyroid carcinoma based on the 8th edition of the AJCC Cancer Staging method. MethodsWe selected 134,962 thyroid carcinoma patients diagnosed between 2004 and 2015 from SEER database with details of the 8th edition of the AJCC Cancer Staging Manual and separated those patients into two datasets randomly. The first dataset, training set, was used to build the nomogram model accounting for 80% (94,474 cases) and the second dataset, validation set, was used for external validation accounting for 20% (40,488 cases). Then we evaluated its clinical availability by analyzing DCA (Decision Curve Analysis) performance and evaluated its accuracy by calculating AUC, C-index as well as calibration plot.ResultsDecision curve analysis showed the final prediction model could maximize patients’ net benefit. In training set and validation set, Harrell’s Concordance Indexes were 0.9450 and 0.9421 respectively. Both sensitivity and specificity of three predicted time points (12 Months,36 Months and 60 Months) of two datasets were all above 0.80 except sensitivity of 60-month time point of validation set was 0.7662. AUCs of three predicted timepoints were 0.9562, 0.9273 and 0.9009 respectively for training set. Similarly, those numbers were 0.9645, 0.9329, and 0.8894 respectively for validation set. Calibration plot also showed that the nomogram model had a good calibration.ConclusionThe final nomogram model provided with both excellent accuracy and clinical availability and should be able to predict patients’ survival probability visually and accurately.

Download Full-text

Dataset Splitting Techniques Comparison For Face Classification on CCTV Images

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.58092 ◽

2020 ◽

Vol 14 (4) ◽

pp. 341

Author(s):

Ade Nurhopipah ◽

Uswatun Hasanah

Keyword(s):

Splitting Method ◽

Machine Learning Algorithms ◽

Support Vector ◽

Training Set ◽

Test Set ◽

Face Classification ◽

Lower Accuracy ◽

Svm Algorithm ◽

Stable Performance ◽

Validation Set

The performance of classification models in machine learning algorithms is influenced by many factors, one of which is dataset splitting method. To avoid overfitting, it is important to apply a suitable dataset splitting strategy. This study presents comparison of four dataset splitting techniques, namely Random Sub-sampling Validation (RSV), k-Fold Cross Validation (k-FCV), Bootstrap Validation (BV) and Moralis Lima Martin Validation (MLMV). This comparison is done in face classification on CCTV images using Convolutional Neural Network (CNN) algorithm and Support Vector Machine (SVM) algorithm. This study is also applied in two image datasets. The results of the comparison are reviewed by using model accuracy in training set, validation set and test set, also bias and variance of the model. The experiment shows that k-FCV technique has more stable performance and provide high accuracy on training set as well as good generalizations on validation set and test set. Meanwhile, data splitting using MLMV technique has lower performance than the other three techniques since it yields lower accuracy. This technique also shows higher bias and variance values and it builds overfitting models, especially when it is applied on validation set.

Download Full-text