Combined weighted multi-objective optimizer for instance reduction in two-class imbalanced data problem

Abstract Background Mass spectra are usually acquired from the Liquid Chromatography-Mass Spectrometry (LC-MS) analysis for isotope labeled proteomics experiments. In such experiments, the mass profiles of labeled (heavy) and unlabeled (light) peptide pairs are represented by isotope clusters (2D or 3D) that provide valuable information about the studied biological samples in different conditions. The core task of quality control in quantitative LC-MS experiment is to filter out low-quality peptides with questionable profiles. The commonly used methods for this problem are the classification approaches. However, the data imbalance problems in previous control methods are often ignored or mishandled. In this study, we introduced a quality control framework based on the extreme gradient boosting machine (XGBoost), and carefully addressed the imbalanced data problem in this framework. Results In the XGBoost based framework, we suggest the application of the Synthetic minority over-sampling technique (SMOTE) to re-balance data and use the balanced data to train the boosted trees as the classifier. Then the classifier is applied to other data for the peptide quality assessment. Experimental results show that our proposed framework increases the reliability of peptide heavy-light ratio estimation significantly. Conclusions Our results indicate that this framework is a powerful method for the peptide quality assessment. For the feature extraction part, the extracted ion chromatogram (XIC) based features contribute to the peptide quality assessment. To solve the imbalanced data problem, SMOTE brings a much better classification performance. Finally, the XGBoost is capable for the peptide quality control. Overall, our proposed framework provides reliable results for the further proteomics studies.

Download Full-text

Addressing Imbalanced Data Problem with Generative Adversarial Network For Intrusion Detection

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI) ◽

10.1109/iri49571.2020.00012 ◽

2020 ◽

Author(s):

Ibrahim Yilmaz ◽

Rahat Masum ◽

Ambareen Siraj

Keyword(s):

Intrusion Detection ◽

Imbalanced Data ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Data Problem

Download Full-text

A Data Segmentation-Based Ensemble Classification Method for Power System Transient Stability Status Prediction with Imbalanced Data

Applied Sciences ◽

10.3390/app9204216 ◽

2019 ◽

Vol 9 (20) ◽

pp. 4216 ◽

Cited By ~ 2

Author(s):

Zhen Chen ◽

Xiaoyan Han ◽

Chengwei Fan ◽

Zirun He ◽

Xueneng Su ◽

...

Keyword(s):

Transient Stability ◽

Imbalanced Data ◽

Classification Performance ◽

Ensemble Classification ◽

Segmentation Strategy ◽

Data Segmentation ◽

Unstable Set ◽

Data Problem ◽

Adaboost Classifier ◽

Training Subset

In recent years, machine learning methods have shown the great potential for real-time transient stability status prediction (TSSP) application. However, most existing studies overlook the imbalanced data problem in TSSP. To address this issue, a novel data segmentation-based ensemble classification (DSEC) method for TSSP is proposed in this paper. Firstly, the effects of the imbalanced data problem on the decision boundary and classification performance of TSSP are investigated in detail. Then, a three-step DSEC method is presented. In the first step, the data segmentation strategy is utilized for dividing the stable samples into multiple non-overlapping stable subsets, ensuring that the samples in each stable subset are not more than the unstable ones, then each stable subset is combined with the unstable set into a training subset. For the second step, an AdaBoost classifier is built based on each training subset. In the final step, decision values from each AdaBoost classifier are aggregated for determining the transient stability status. The experiments are conducted on the Northeast Power Coordinating Council 140-bus system and the simulation results indicate that the proposed approach can significantly improve the classification performance of TSSP with imbalanced data.

Download Full-text

Classification of imbalanced data sets using Multi Objective Genetic Programming

2015 International Conference on Computer Communication and Informatics (ICCCI) ◽

10.1109/iccci.2015.7218125 ◽

2015 ◽

Cited By ~ 5

Author(s):

Hardik H. Maheta ◽

Vipul K. Dabhi

Keyword(s):

Genetic Programming ◽

Imbalanced Data ◽

Data Sets ◽

Imbalanced Data Sets ◽

Multi Objective

Download Full-text

A Multi-Objective Ensemble Method for Class Imbalance Learning

International Journal of Big Data and Analytics in Healthcare ◽

10.4018/ijbdah.2017010102 ◽

2017 ◽

Vol 2 (1) ◽

pp. 16-34

Author(s):

Sajad Emamipour ◽

Rasoul Sali ◽

Zahra Yousefi

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Classification Performance ◽

Ensemble Classifiers ◽

Feature Selection Technique ◽

Multi Objective ◽

Proposed Model ◽

Training Examples ◽

Imbalance Learning ◽

Class Imbalance Learning

This article describes how class imbalance learning has attracted great attention in recent years as many real world domain applications suffer from this problem. Imbalanced class distribution occurs when the number of training examples for one class far surpasses the training examples of the other class often the one that is of more interest. This problem may produce an important deterioration of the classifier performance, in particular with patterns belonging to the less represented classes. Toward this end, the authors developed a hybrid model to address the class imbalance learning with focus on binary class problems. This model combines benefits of the ensemble classifiers with a multi objective feature selection technique to achieve higher classification performance. The authors' model also proposes non-dominated sets of features. Then they evaluate the performance of the proposed model by comparing its results with notable algorithms for solving imbalanced data problem. Finally, the authors utilize the proposed model in medical domain of predicting life expectancy in post-operative of thoracic surgery patients.

Download Full-text

Feature Selection Approach for Solving Imbalanced Data Problem in Single Nucleotide Polymorphism Discovery

Journal of Physics Conference Series ◽

10.1088/1742-6596/1566/1/012035 ◽

2020 ◽

Vol 1566 ◽

pp. 012035

Author(s):

R Nurhasanah ◽

L S Hasibuan ◽

W A Kusuma

Keyword(s):

Single Nucleotide Polymorphism ◽

Feature Selection ◽

Imbalanced Data ◽

Single Nucleotide Polymorphism Discovery ◽

Nucleotide Polymorphism ◽

Single Nucleotide ◽

Polymorphism Discovery ◽

Selection Approach ◽

Data Problem ◽

Feature Selection Approach

Download Full-text

Fully Bayesian Analysis of Relevance Vector Machine Classification with Probit Link Function for Imbalanced Data Problem

IEEE Access ◽

10.1109/access.2021.3052935 ◽

2021 ◽

pp. 1-1

Author(s):

Wenyang Wang ◽

Dongchu Sun ◽

Peng Shao ◽

Haibo Kuang ◽

Cong Sui

Keyword(s):

Bayesian Analysis ◽

Imbalanced Data ◽

Relevance Vector Machine ◽

Link Function ◽

Data Problem ◽

Fully Bayesian ◽

Probit Link

Download Full-text

Optimasi Data Tidak Seimbang pada Interaksi Drug Target dengan Sampling dan Ensemble Support Vector Machine

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2020762857 ◽

2020 ◽

Vol 7 (6) ◽

pp. 1221

Author(s):

Nabila Sekar Ramadhanti ◽

Wisnu Ananta Kusuma ◽

Annisa Annisa

Keyword(s):

Support Vector Machine ◽

Nuclear Receptor ◽

Protein Interaction ◽

Drug Target ◽

Imbalanced Data ◽

Support Vector ◽

Interaction Data ◽

Target Interaction ◽

Data Problem ◽

F Measure

Data tidak seimbang menjadi salah satu masalah yang muncul pada masalah prediksi atau klasifikasi. Penelitian ini memfokuskan untuk mengatasi masalah data tidak seimbang pada prediksi drug-target interaction (interaksi senyawa-protein). Ada banyak protein target dan senyawa obat yang terdapat pada basis data interaksi senyawa-protein yang belum divalidasi interaksinya secara eksperimen. Belum diketahuinya interaksi antar senyawa dan target tersebut membuat proporsi antara data yang diketahui interaksinya dan yang belum dikethui menjadi tidak seimbang. Data interaksi yang sangat tidak seimbang dapat menyebabkan hasil prediksi menjadi bias. Terdapat banyak cara untuk mengatasi data tidak seimbang ini, namun pada penelitian ini diimplementasikan metode yang menggabungkan Biased Support Vector Machine (BSVM), oversampling, dan undersampling dengan Ensemble Support Vector Machine (SVM). Penelitian ini mengeksplorasi efek sampling yang digabungkan dalam metode tersebut pada data interaksi senyawa-protein. Metode ini sudah diuji pada dataset Nuclear Receptor, G-Protein Coupled Receptor dan Ion Channel dengan rasio ketidakseimbangannya sebesar 14.6%, 32.36%, dan 28.2%. Hasil pengujian dengan menggunakan ketiga dataset tersebut menunjukkan nilai area under curve (AUC) secara berturut-turut sebesar 63.4%, 71.4%, 61.3% dan F-measure sebesar 54%, 60.7% dan 39%. Nilai akurasi dari metode yang digunakan masih terbilang cukup baik, walaupun nilai tersebut lebih kecil dari metode SVM tanpa perlakuan apapun. Nilai tersebut bias karena nilai AUC dan F-measure ternyata lebih kecil. Hal ini membuktikan bahwa metode yang diusulkan dapat menurunkan tingkat bias pada data tidak seimbang yang diuji dan meningkatkan nilai AUC dan f-measure sekitar 5%-20%. AbstractImbalanced data has been one of the problems that arise in processing data. This research is focusing on handling imbalanced data problem for drug-target (compound-protein) interaction data. There are many target protein and drug compound existed in compound-protein interaction databases, which many interactions are not validated yet by experiment. This unknown interaction led drug target interaction to become imbalanced data. A really imbalanced data may cause bias to prediction result. There are many ways of handling imbalanced data, but this research implemented some methods such as BSVM, oversampling, undersampling with SVM ensemble. These method already solve the imbalanced data problem on other kind of data like image data. This research is focusing on exploration of effect on the sampling that used in these method for compound-protein interaction data. This method had been tested on compound-protein interaction Nuclear Receptor, GPCR and Ion Channel with 14.6%, 32.36% and 28.2% of imbalance ratio. The evaluation result using these three dataset show the value of AUC respectively 63.4%, 71.4%, 61.3% and F-measure of 54%, 60.7% and 39%. The score from this method is quite good, even though the score of accuracy and precision is smaller than the SVM. The value is bias because the AUC and F-measure score is smaller. This proves that the proposed method could reduce the bias rate in the evaluated imbalanced data and increase AUC and f-measure score from 5% to 20%.

Download Full-text

Old and New Data Mining Topics: Imbalanced Data Problem, Privacy-Preserving Data Mining and Data Mining by Image Processing

Journal of Japan Society for Fuzzy Theory and Intelligent Informatics ◽

10.3156/jsoft.32.1_9 ◽

2020 ◽

Vol 32 (1) ◽

pp. 9-12

Author(s):

Ayahiko NIIMI

Keyword(s):

Data Mining ◽

Image Processing ◽

Imbalanced Data ◽

Privacy Preserving ◽

Privacy Preserving Data Mining ◽

Data Problem

Download Full-text

Combined weighted multi-objective optimizer for instance reduction in two-class imbalanced data problem

Multi-objective Evolutionary Undersampling Algorithm for Imbalanced Data Classification

Quality control of imbalanced mass spectra from isotopic labeling experiments

Addressing Imbalanced Data Problem with Generative Adversarial Network For Intrusion Detection

A Data Segmentation-Based Ensemble Classification Method for Power System Transient Stability Status Prediction with Imbalanced Data

Classification of imbalanced data sets using Multi Objective Genetic Programming

A Multi-Objective Ensemble Method for Class Imbalance Learning

Feature Selection Approach for Solving Imbalanced Data Problem in Single Nucleotide Polymorphism Discovery

Fully Bayesian Analysis of Relevance Vector Machine Classification with Probit Link Function for Imbalanced Data Problem

Optimasi Data Tidak Seimbang pada Interaksi Drug Target dengan Sampling dan Ensemble Support Vector Machine

Old and New Data Mining Topics: Imbalanced Data Problem, Privacy-Preserving Data Mining and Data Mining by Image Processing

Export Citation Format