scholarly journals A Hybrid GAN Based Approach to Solve Imbalanced Data Problem in Recommendation Systems

IEEE Access ◽  
2022 ◽  
pp. 1-1
Author(s):  
Wafa Shafqat ◽  
Yung-Cheol Byun
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Tianjun Li ◽  
Long Chen ◽  
Min Gan

Abstract Background Mass spectra are usually acquired from the Liquid Chromatography-Mass Spectrometry (LC-MS) analysis for isotope labeled proteomics experiments. In such experiments, the mass profiles of labeled (heavy) and unlabeled (light) peptide pairs are represented by isotope clusters (2D or 3D) that provide valuable information about the studied biological samples in different conditions. The core task of quality control in quantitative LC-MS experiment is to filter out low-quality peptides with questionable profiles. The commonly used methods for this problem are the classification approaches. However, the data imbalance problems in previous control methods are often ignored or mishandled. In this study, we introduced a quality control framework based on the extreme gradient boosting machine (XGBoost), and carefully addressed the imbalanced data problem in this framework. Results In the XGBoost based framework, we suggest the application of the Synthetic minority over-sampling technique (SMOTE) to re-balance data and use the balanced data to train the boosted trees as the classifier. Then the classifier is applied to other data for the peptide quality assessment. Experimental results show that our proposed framework increases the reliability of peptide heavy-light ratio estimation significantly. Conclusions Our results indicate that this framework is a powerful method for the peptide quality assessment. For the feature extraction part, the extracted ion chromatogram (XIC) based features contribute to the peptide quality assessment. To solve the imbalanced data problem, SMOTE brings a much better classification performance. Finally, the XGBoost is capable for the peptide quality control. Overall, our proposed framework provides reliable results for the further proteomics studies.


2019 ◽  
Vol 9 (20) ◽  
pp. 4216 ◽  
Author(s):  
Zhen Chen ◽  
Xiaoyan Han ◽  
Chengwei Fan ◽  
Zirun He ◽  
Xueneng Su ◽  
...  

In recent years, machine learning methods have shown the great potential for real-time transient stability status prediction (TSSP) application. However, most existing studies overlook the imbalanced data problem in TSSP. To address this issue, a novel data segmentation-based ensemble classification (DSEC) method for TSSP is proposed in this paper. Firstly, the effects of the imbalanced data problem on the decision boundary and classification performance of TSSP are investigated in detail. Then, a three-step DSEC method is presented. In the first step, the data segmentation strategy is utilized for dividing the stable samples into multiple non-overlapping stable subsets, ensuring that the samples in each stable subset are not more than the unstable ones, then each stable subset is combined with the unstable set into a training subset. For the second step, an AdaBoost classifier is built based on each training subset. In the final step, decision values from each AdaBoost classifier are aggregated for determining the transient stability status. The experiments are conducted on the Northeast Power Coordinating Council 140-bus system and the simulation results indicate that the proposed approach can significantly improve the classification performance of TSSP with imbalanced data.


2020 ◽  
Vol 7 (6) ◽  
pp. 1221
Author(s):  
Nabila Sekar Ramadhanti ◽  
Wisnu Ananta Kusuma ◽  
Annisa Annisa

<p>Data tidak seimbang menjadi salah satu masalah yang muncul pada masalah prediksi atau klasifikasi. Penelitian ini memfokuskan untuk mengatasi masalah data tidak seimbang pada prediksi <em>drug-target interaction</em> (interaksi senyawa-protein). Ada banyak protein target dan senyawa obat yang terdapat pada basis data interaksi senyawa-protein yang belum divalidasi interaksinya secara eksperimen. Belum diketahuinya interaksi antar senyawa dan target tersebut membuat proporsi antara data yang diketahui interaksinya dan yang belum dikethui menjadi tidak seimbang. Data interaksi yang sangat tidak seimbang dapat menyebabkan hasil prediksi menjadi bias. Terdapat banyak cara untuk mengatasi data tidak seimbang ini, namun pada penelitian ini diimplementasikan metode yang menggabungkan <em>Biased Support Vector Machine</em> (BSVM), <em>oversampling, </em>dan <em>undersampling</em> dengan <em>Ensemble Support Vector Machine</em> (SVM). Penelitian ini mengeksplorasi efek sampling yang digabungkan dalam metode tersebut pada data interaksi senyawa-protein. Metode ini sudah diuji pada dataset <em>Nuclear Receptor,</em> <em>G-Protein Coupled Receptor</em> dan <em>Ion Channel </em>dengan rasio ketidakseimbangannya sebesar 14.6%, 32.36%, dan 28.2%. Hasil pengujian dengan menggunakan ketiga dataset tersebut menunjukkan nilai <em>area under curve</em> (AUC) secara berturut-turut sebesar 63.4%, 71.4%, 61.3% dan F-measure sebesar 54%, 60.7% dan 39%. Nilai akurasi dari metode yang digunakan masih terbilang cukup baik, walaupun nilai tersebut lebih kecil dari metode SVM tanpa perlakuan apapun. Nilai tersebut <em>bias</em> karena nilai AUC dan F-measure ternyata lebih kecil. Hal ini membuktikan bahwa metode yang diusulkan dapat menurunkan tingkat bias pada data tidak seimbang yang diuji dan meningkatkan nilai AUC dan f-measure sekitar 5%-20%.</p><p> </p><p><em><strong>Abstract</strong></em></p><p><em>Imbalanced data </em><em>has been one of the problems that arise in processing data. This research is focusing on handling imbalanced data problem for </em><em>drug-target</em><em> </em><em>(compound-protein) interaction data. There are many target protein and drug compound existed in compound-protein interaction databases, which many interactions are not validated yet by experiment. This unknown</em><em> interaction led drug target interaction to become imbalanced data. A really imbalanced data may cause bias to prediction result. There are many ways of handling imbalanced data, but this research implemented some methods such as BSVM, oversampling, undersampling with SVM ensemble. These method already solve the imbalanced data problem on other kind of data like image data. This research is focusing on exploration of effect on the sampling that used in these method for </em><em>compound-protein</em><em> interaction data. This method had been tested on </em><em>compound-protein</em><em> interaction Nuclear Receptor, GPCR</em> <em>and Ion Channel with 14.6%, 32.36% and 28.2% of imbalance ratio. The evaluation result using these three dataset show the value of AUC respectively 63.4%, 71.4%, 61.3% and F-measure of 54%, 60.7% and 39%. The score from this method is quite good, even though the score of accuracy and precision is smaller than the SVM. The value is bias because the AUC and F-measure score is smaller. This proves that the proposed method could reduce the bias rate in the evaluated imbalanced data and increase AUC and f-measure score from 5% to 20%.</em></p><p><em><strong><br /></strong></em></p>


2021 ◽  
Author(s):  
Danqing Hu ◽  
Shaolei Li ◽  
Huilong Duan ◽  
Nan Wu ◽  
Xudong Lu

BACKGROUND Lung cancer is the leading cause of cancer death worldwide. Prognostic prediction plays a vital role in the decision-making process for postoperative non-small cell lung cancer (NSCLC) patients. However, the high imbalance ratio of prognostic data limits the development of effective prognostic prediction models. OBJECTIVE In this study, we present a novel approach, namely ensemble learning with active sampling (ELAS), to tackle the imbalanced data problem in NSCLC prognostic prediction. METHODS ELAS first applies an active sampling mechanism to query the most informative samples for the current base classifier and then exploits those samples to update this base classifier to give it a new perspective. This training process is repeated until no enough samples are queried. Next, an internal development set is employed to evaluate the base classifiers, and the ones with the best performances are integrated as the ensemble model. Besides, we set up multiple initial training sets and internal development sets to ensure the stability and generalization of the model. RESULTS We verify the effectiveness of the ELAS on a real clinical dataset containing 1,848 postoperative NSCLC patients. Experimental results show that the ELAS achieves an averaged 0.732 AUC value for all 6 prognostic tasks and obtains 2.3%, 1.4%, 1.3%, 1.5%, 4.4% improvements over the SVM, AdaBoost, Bagging, SMOTE and TomekLinks, respectively. CONCLUSIONS We conclude that the ELAS can effectively alleviate the imbalanced data problem in NSCLC prognostic prediction and demonstrates good potential for future postoperative NSCLC prognostic prediction.


Sign in / Sign up

Export Citation Format

Share Document