Combined weighted multi-objective optimizer for instance reduction in two-class imbalanced data problem

2020 ◽  
Vol 90 ◽  
pp. 103500 ◽  
Author(s):  
Javad Hamidzadeh ◽  
Niloufar Kashefi ◽  
Mona Moradi
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Tianjun Li ◽  
Long Chen ◽  
Min Gan

Abstract Background Mass spectra are usually acquired from the Liquid Chromatography-Mass Spectrometry (LC-MS) analysis for isotope labeled proteomics experiments. In such experiments, the mass profiles of labeled (heavy) and unlabeled (light) peptide pairs are represented by isotope clusters (2D or 3D) that provide valuable information about the studied biological samples in different conditions. The core task of quality control in quantitative LC-MS experiment is to filter out low-quality peptides with questionable profiles. The commonly used methods for this problem are the classification approaches. However, the data imbalance problems in previous control methods are often ignored or mishandled. In this study, we introduced a quality control framework based on the extreme gradient boosting machine (XGBoost), and carefully addressed the imbalanced data problem in this framework. Results In the XGBoost based framework, we suggest the application of the Synthetic minority over-sampling technique (SMOTE) to re-balance data and use the balanced data to train the boosted trees as the classifier. Then the classifier is applied to other data for the peptide quality assessment. Experimental results show that our proposed framework increases the reliability of peptide heavy-light ratio estimation significantly. Conclusions Our results indicate that this framework is a powerful method for the peptide quality assessment. For the feature extraction part, the extracted ion chromatogram (XIC) based features contribute to the peptide quality assessment. To solve the imbalanced data problem, SMOTE brings a much better classification performance. Finally, the XGBoost is capable for the peptide quality control. Overall, our proposed framework provides reliable results for the further proteomics studies.


2019 ◽  
Vol 9 (20) ◽  
pp. 4216 ◽  
Author(s):  
Zhen Chen ◽  
Xiaoyan Han ◽  
Chengwei Fan ◽  
Zirun He ◽  
Xueneng Su ◽  
...  

In recent years, machine learning methods have shown the great potential for real-time transient stability status prediction (TSSP) application. However, most existing studies overlook the imbalanced data problem in TSSP. To address this issue, a novel data segmentation-based ensemble classification (DSEC) method for TSSP is proposed in this paper. Firstly, the effects of the imbalanced data problem on the decision boundary and classification performance of TSSP are investigated in detail. Then, a three-step DSEC method is presented. In the first step, the data segmentation strategy is utilized for dividing the stable samples into multiple non-overlapping stable subsets, ensuring that the samples in each stable subset are not more than the unstable ones, then each stable subset is combined with the unstable set into a training subset. For the second step, an AdaBoost classifier is built based on each training subset. In the final step, decision values from each AdaBoost classifier are aggregated for determining the transient stability status. The experiments are conducted on the Northeast Power Coordinating Council 140-bus system and the simulation results indicate that the proposed approach can significantly improve the classification performance of TSSP with imbalanced data.


Author(s):  
Sajad Emamipour ◽  
Rasoul Sali ◽  
Zahra Yousefi

This article describes how class imbalance learning has attracted great attention in recent years as many real world domain applications suffer from this problem. Imbalanced class distribution occurs when the number of training examples for one class far surpasses the training examples of the other class often the one that is of more interest. This problem may produce an important deterioration of the classifier performance, in particular with patterns belonging to the less represented classes. Toward this end, the authors developed a hybrid model to address the class imbalance learning with focus on binary class problems. This model combines benefits of the ensemble classifiers with a multi objective feature selection technique to achieve higher classification performance. The authors' model also proposes non-dominated sets of features. Then they evaluate the performance of the proposed model by comparing its results with notable algorithms for solving imbalanced data problem. Finally, the authors utilize the proposed model in medical domain of predicting life expectancy in post-operative of thoracic surgery patients.


2020 ◽  
Vol 7 (6) ◽  
pp. 1221
Author(s):  
Nabila Sekar Ramadhanti ◽  
Wisnu Ananta Kusuma ◽  
Annisa Annisa

<p>Data tidak seimbang menjadi salah satu masalah yang muncul pada masalah prediksi atau klasifikasi. Penelitian ini memfokuskan untuk mengatasi masalah data tidak seimbang pada prediksi <em>drug-target interaction</em> (interaksi senyawa-protein). Ada banyak protein target dan senyawa obat yang terdapat pada basis data interaksi senyawa-protein yang belum divalidasi interaksinya secara eksperimen. Belum diketahuinya interaksi antar senyawa dan target tersebut membuat proporsi antara data yang diketahui interaksinya dan yang belum dikethui menjadi tidak seimbang. Data interaksi yang sangat tidak seimbang dapat menyebabkan hasil prediksi menjadi bias. Terdapat banyak cara untuk mengatasi data tidak seimbang ini, namun pada penelitian ini diimplementasikan metode yang menggabungkan <em>Biased Support Vector Machine</em> (BSVM), <em>oversampling, </em>dan <em>undersampling</em> dengan <em>Ensemble Support Vector Machine</em> (SVM). Penelitian ini mengeksplorasi efek sampling yang digabungkan dalam metode tersebut pada data interaksi senyawa-protein. Metode ini sudah diuji pada dataset <em>Nuclear Receptor,</em> <em>G-Protein Coupled Receptor</em> dan <em>Ion Channel </em>dengan rasio ketidakseimbangannya sebesar 14.6%, 32.36%, dan 28.2%. Hasil pengujian dengan menggunakan ketiga dataset tersebut menunjukkan nilai <em>area under curve</em> (AUC) secara berturut-turut sebesar 63.4%, 71.4%, 61.3% dan F-measure sebesar 54%, 60.7% dan 39%. Nilai akurasi dari metode yang digunakan masih terbilang cukup baik, walaupun nilai tersebut lebih kecil dari metode SVM tanpa perlakuan apapun. Nilai tersebut <em>bias</em> karena nilai AUC dan F-measure ternyata lebih kecil. Hal ini membuktikan bahwa metode yang diusulkan dapat menurunkan tingkat bias pada data tidak seimbang yang diuji dan meningkatkan nilai AUC dan f-measure sekitar 5%-20%.</p><p> </p><p><em><strong>Abstract</strong></em></p><p><em>Imbalanced data </em><em>has been one of the problems that arise in processing data. This research is focusing on handling imbalanced data problem for </em><em>drug-target</em><em> </em><em>(compound-protein) interaction data. There are many target protein and drug compound existed in compound-protein interaction databases, which many interactions are not validated yet by experiment. This unknown</em><em> interaction led drug target interaction to become imbalanced data. A really imbalanced data may cause bias to prediction result. There are many ways of handling imbalanced data, but this research implemented some methods such as BSVM, oversampling, undersampling with SVM ensemble. These method already solve the imbalanced data problem on other kind of data like image data. This research is focusing on exploration of effect on the sampling that used in these method for </em><em>compound-protein</em><em> interaction data. This method had been tested on </em><em>compound-protein</em><em> interaction Nuclear Receptor, GPCR</em> <em>and Ion Channel with 14.6%, 32.36% and 28.2% of imbalance ratio. The evaluation result using these three dataset show the value of AUC respectively 63.4%, 71.4%, 61.3% and F-measure of 54%, 60.7% and 39%. The score from this method is quite good, even though the score of accuracy and precision is smaller than the SVM. The value is bias because the AUC and F-measure score is smaller. This proves that the proposed method could reduce the bias rate in the evaluated imbalanced data and increase AUC and f-measure score from 5% to 20%.</em></p><p><em><strong><br /></strong></em></p>


Sign in / Sign up

Export Citation Format

Share Document