The Empirical Comparison of Machine Learning Algorithm for the Class Imbalanced Problem in Conformational Epitope Prediction

2021 ◽  
Vol 9 (1) ◽  
pp. 131
Author(s):  
Binti Solihah ◽  
Azhari Azhari ◽  
Aina Musdholifah

A conformational epitope is a part of a protein-based vaccine. It is challenging to identify using an experiment. A computational model is developed to support identification. However, the imbalance class is one of the constraints to achieving optimal performance on the conformational epitope B cell prediction. In this paper, we compare several conformational epitope B cell prediction models from non-ensemble and ensemble approaches. A sampling method from Random undersampling, SMOTE, and cluster-based undersampling is combined with a decision tree or SVM to build a non-ensemble model. A random forest model and several variants of the bagging method is used to construct the ensemble model. A 10-fold cross-validation method is used to validate the model.  The experiment results show that the combination of the cluster-based under-sampling and decision tree outperformed the other sampling method when combined with the non-ensemble and the ensemble method. This study provides a baseline to improve existing models for dealing with the class imbalance in the conformational epitope prediction.

2020 ◽  
Vol 6 ◽  
pp. e275
Author(s):  
Binti Solihah ◽  
Azhari Azhari ◽  
Aina Musdholifah

Background A conformational B-cell epitope is one of the main components of vaccine design. It contains separate segments in its sequence, which are spatially close in the antigen chain. The availability of Ag-Ab complex data on the Protein Data Bank allows for the development predictive methods. Several epitope prediction models also have been developed, including learning-based methods. However, the performance of the model is still not optimum. The main problem in learning-based prediction models is class imbalance. Methods This study proposes CluSMOTE, which is a combination of a cluster-based undersampling method and Synthetic Minority Oversampling Technique. The approach is used to generate other sample data to ensure that the dataset of the conformational epitope is balanced. The Hierarchical DBSCAN algorithm is performed to identify the cluster in the majority class. Some of the randomly selected data is taken from each cluster, considering the oversampling degree, and combined with the minority class data. The balance data is utilized as the training dataset to develop a conformational epitope prediction. Furthermore, two binary classification methods, Support Vector Machine and Decision Tree, are separately used to develop model prediction and to evaluate the performance of CluSMOTE in predicting conformational B-cell epitope. The experiment is focused on determining the best parameter for optimal CluSMOTE. Two independent datasets are used to compare the proposed prediction model with state of the art methods. The first and the second datasets represent the general protein and the glycoprotein antigens respectively. Result The experimental result shows that CluSMOTE Decision Tree outperformed the Support Vector Machine in terms of AUC and Gmean as performance measurements. The mean AUC of CluSMOTE Decision Tree in the Kringelum and the SEPPA 3 test sets are 0.83 and 0.766, respectively. This shows that CluSMOTE Decision Tree is better than other methods in the general protein antigen, though comparable with SEPPA 3 in the glycoprotein antigen.


2019 ◽  
Vol 8 (2) ◽  
pp. 2463-2468

Learning of class imbalanced data becomes a challenging issue in the machine learning community as all classification algorithms are designed to work for balanced datasets. Several methods are available to tackle this issue, among which the resampling techniques- undersampling and oversampling are more flexible and versatile. This paper introduces a new concept for undersampling based on Center of Gravity principle which helps to reduce the excess instances of majority class. This work is suited for binary class problems. The proposed technique –CoGBUS- overcomes the class imbalance problem and brings best results in the study. We take F-Score, GMean and ROC for the performance evaluation of the method.


2021 ◽  
Vol 1 (4) ◽  
pp. 268-280
Author(s):  
Bamanga Mahmud , , , Ahmad ◽  
Ahmadu Asabe Sandra ◽  
Musa Yusuf Malgwi ◽  
Dahiru I. Sajoh

For the identification and prediction of different diseases, machine learning techniques are commonly used in clinical decision support systems. Since heart disease is the leading cause of death for both men and women around the world. Heart is one of the essential parts of human body, therefore, it is one of the most critical concerns in the medical domain, and several researchers have developed intelligent medical devices to support the systems and further to enhance the ability to diagnose and predict heart diseases. However, there are few studies that look at the capabilities of ensemble methods in developing a heart disease detection and prediction model. In this study, the researchers assessed that how to use ensemble model, which proposes a more stable performance than the use of base learning algorithm and these leads to better results than other heart disease prediction models. The University of California, Irvine (UCI) Machine Learning Repository archive was used to extract patient heart disease data records. To achieve the aim of this study, the researcher developed the meta-algorithm. The ensemble model is a superior solution in terms of high predictive accuracy and diagnostics output reliability, as per the results of the experiments. An ensemble heart disease prediction model is also presented in this work as a valuable, cost-effective, and timely predictive option with a user-friendly graphical user interface that is scalable and expandable. From the finding, the researcher suggests that Bagging is the best ensemble classifier to be adopted as the extended algorithm that has the high prediction probability score in the implementation of heart disease prediction.


2009 ◽  
Vol 12 (9) ◽  
pp. 31-37
Author(s):  
Vinh Ngoc Tran ◽  
Quy Cam Vo ◽  
Thuoc Linh Tran

Although discontinuous epitopes make up 90% of total number of B-cell epitopes, however, because of difficulties in the development of method for their prediction, most of the B-cell epitope prediction methods today focus on continuous epitopes. To serve for the development of vaccine against H5N1 virus, we have been studying on in silico prediction of T- and B-cell continuous as well as B-cell discontinuous epitopes on H5N1 viral antigens. In this study, using the homology modeling method, we have generated structures of matrix protein of the H5N1 virus and predicted B-cell discontinuous epitopes. 60 out of 72 predicted residues were similar with those reported by the CEP method (Conformational Epitope Prediction). All predicted aminoacid residues were hydrophilic, polar, electrically charged and located on the surface of the antigen structures.


2014 ◽  
Vol 15 (1) ◽  
Author(s):  
Yuh-Jyh Hu ◽  
Shun-Chien Lin ◽  
Yu-Lung Lin ◽  
Kuan-Hui Lin ◽  
Shun-Ning You

2020 ◽  
Vol 38 (15_suppl) ◽  
pp. e14064-e14064
Author(s):  
Daniel Margalski ◽  
Thomas Lycan ◽  
Suraj Rajendran ◽  
Umit Topaloglu

e14064 Background: The ubiquitous implementation of immunotherapy has significantly improved outcomes in the treatment of cancer patients; however, once rare adverse events from these therapies have increased in lock step. We now face an increased burden of identification on providers with limited experience in the diagnosis of irAEs. We use machine learning to develop prediction models that will aid providers in identifying patients at high risk for developing irAEs as well as for multiple downstream applications. Methods: We have manually extracted progress notes from 462 patients with non-small cell lung cancer treated with immunotherapy who had known irAEÕs, with focus on pneumonitis, colitis, and rash; the most common symptomatic irAEÕs. Labels were applied by clinician review to train the machine learning algorithm to identify the predictive signals at the earliest stage of recognition possible. As a standard Natural Language Processing method, we cleaned the notes to standardize punctuation, numbers and special characters. Next, we created a word embedding matrix utilizing word2vec as well as Google News Vector. Finally, we implemented a Convolutional Neural Network (CNN) on the Microsoft Azure Databricks platform. Due to class imbalance, we deployed a Synthetic Minority Over-sampling Technique algorithm as a correction. We prioritized F1 score in the analysis given the heterogeneity of the data, but will present accuracy, precision and recall as well. Results: We trained our CNN with 10 epochs resulting in an F1 of 0.428, accuracy of 0.895, precision of 0.75 and recall of 0.3. There was no significant difference in results between the word embedding matrices. Conclusions: Using machine learning, we created an algorithm for irAE prediction that was accurate but lacked recall. This will serve as the foundation of implementations including the creation of a clinical decision support tool to guide focused and appropriate treatment of the unique toxicity of irAEs. Although informative as a starting point, this model had a final F1 score that was lower than expected presumably due to class imbalance of input data and the temporal nature of progress notes, which limits the utility of a CNN. Future iterations of the algorithm will include supplementary documentation and implement recurrent neural networks with long short-term memory architecture to address these limitations.


2021 ◽  
Vol 17 ◽  
Author(s):  
Jingyu Lee ◽  
Myeong-Sang Yu ◽  
Dokyun Na

Background: Drug-induced liver injury (DILI) is a leading cause of drug failure, accounting for nearly 20% of drug withdrawal. Thus, there has been a great demand for in silico DILI prediction models for successful drug discovery. To date, various models have been developed for DILI prediction; however, building an accurate model for practical use in drug discovery remains challenging. Methods: We constructed an ensemble model composed of three high-performance DILI prediction models to utilize the unique advantage of each machine learning algorithm. Results: The ensemble model exhibited high predictive performance, with an area under the curve of 0.88, sensitivity of 0.83, specificity of 0.77, F1-score of 0.82, and accuracy of 0.80. When a test dataset collected from the literature was used to compare the performance of our model with publicly available DILI prediction models, our model achieved an accuracy of 0.77, sensitivity of 0.82, specificity of 0.72, and F1-score of 0.79, which were higher than those of the other DILI prediction models. As many published DILI prediction models are not available for public access, which hinders in silico drug discovery, we made our DILI prediction model publicly accessible (http://ssbio.cau.ac.kr/software/dili/). Conclusion: We expect that our ensemble model may facilitate advancements in drug discovery by providing a highly predictive model and reducing the drug withdrawal rate.


Sign in / Sign up

Export Citation Format

Share Document