Multi-Label Classification with PSO based Synthetic Minority Over-Sampling Technique (Psosmote) for Imbalanced Samples

Recently, the learning from unbalanced data has emerged to be a pre-dominant problem in several applications and in that multi label classification is an evolving data mining task, learning from unbalanced multilabel data is being examined. However, the available algorithms-based SMOTE makes use of the same sampling rate for every instance of the minority class. This leads to sub-optimal performance. To deal with this problem, a new Particle Swarm Optimization based SMOTE (PSOSMOTE) algorithm is proposed. The PSOSMOTE algorithm employs diverse sampling rates for multiple minority class instances and gets the fusion of optimal sampling rates and to deal with classification of unbalanced datasets. Then, Bayesian technique is combined with Random forest for multilabel classification (BARF-MLC) is to address the inherent label dependencies among samples such as ML-FOREST classifier, Predictive Clustering Trees (PCT), Hierarchy of Multi Label Classifier (HOMER) by taking the different metrics including precision, recall, F-measure, Accuracy and Error Rate.

Download Full-text

INTEGRASI NAIVE BAYES DENGAN TEKNIK SAMPLING SMOTE UNTUK MENANGANI DATA TIDAK SEIMBANG

NUANSA INFORMATIKA ◽

10.25134/nuansa.v14i1.2411 ◽

2020 ◽

Vol 14 (1) ◽

pp. 34

Author(s):

Nina Sulistiyowati ◽

Mohamad Jajuli

Keyword(s):

Machine Learning ◽

Data Mining ◽

Sampling Technique ◽

Unbalanced Data ◽

Classification Algorithms ◽

Customer Data ◽

Ve Bayes ◽

Almost All ◽

Loan Amount

Classification of data with unbalanced classes is a major problem in the field of machine learning and data mining. If working on unbalanced data, almost all classification algorithms will produce much higher accuracy for majority classes than minority classes. This research will implement the Synthetic Minority Over-sampling Technique (SMOTE) method to overcome unbalanced data on credit customer data in Rawamerta teacher cooperatives. The research methodology uses SEMMA with the stages of research Sample, Explore, Modify, Model, and Asses. The Sample Phase was conducted to choose the data of the Rawamerta Teachers Cooperative credit customers for 2015-2017 with a total of 878 data with the attributes used namely income, total deposits, loan amount, duration of installments, services, installments, and credit status. The Explore phase analyzes current classes which are categorized as majority classes because there are 813 data, while traffic classes can be categorized as minority classes because there are 65 data. The data shows an imbalance of data between the two classes. The Modify stages perform the 500% SMOTE process. The Model Stage classifies using Na�ve Bayes. Na�ve Bayes modeling with SMOTE produced 1131 successfully classified data correctly and 72 data were not classified correctly while without SMOTE resulted in 818 data was classified correctly and 60 data were not classified correctly.Keywords: Na�ve Bayes, SMOTE, unbalanced data

Download Full-text

Classification of Unbalanced Data Based on RSM and Binomial Distribution

Fuzzy Systems and Data Mining VI - Frontiers in Artificial Intelligence and Applications ◽

10.3233/faia200684 ◽

2020 ◽

Author(s):

Rong Li ◽

Wei-Bai Zhou

Keyword(s):

Dimension Reduction ◽

Binomial Distribution ◽

Classification Accuracy ◽

Classification Algorithm ◽

Unbalanced Data ◽

Minority Class ◽

Model Interpretation ◽

High Classification Accuracy ◽

Random Part

In the case of extremely unbalanced data, the results of the traditional classification algorithm are very unbalanced, and most samples are often divided into the categories of majority samples, so the accuracy of judgment of the minority classes will be reduced. In this paper, we propose a classification algorithm for unbalanced data based on RSM and binomial undersampling. We use RSM’s random part features rather than all each classifier to make each training classifier reduce the dimensions, and dimension reduction makes relatively minority class samples indirectly lift. Using the above characteristics of the RSM to reduce dimension can solve the problem that unbalanced data classification in the minority class samples is too little, and it can also find the important attribute of variables to make the model have the ability of explanation. Experiments show that our algorithm has high classification accuracy and model interpretation ability when classifying unbalanced data.

Download Full-text

Feature Selection in Classification of Blood Sugar Disease Using Particle Swarm Optimization (PSO) on C4.5 Algorithm

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) ◽

10.29207/resti.v4i3.1881 ◽

2020 ◽

Vol 4 (3) ◽

pp. 569-575

Author(s):

Dwi Meylitasari Tarigan ◽

Dian Palupi Rini ◽

Samsuryadi

Keyword(s):

Data Mining ◽

Particle Swarm Optimization ◽

Blood Sugar ◽

Blood Sugar Level ◽

Public Awareness ◽

Particle Swarm ◽

Data Mining Technique ◽

Swarm Optimization ◽

C4.5 Algorithm

Diabetes Mellitus (DM) is a disease caused by blood sugar level increased were higher than the maximum limit. Food consumed tends to contain uncontrolled sugar which could cause the drastic increase of blood sugar level. It is necessary to efforts, to increasing the public awareness to controlling blood sugar and the risks of increasing blood sugar level so as to determine of preventive and early detection measures One of used of data mining technique is information technology in the health sector which used a lot as a decision maker to predicting and diagnosing a several disease. This research aims to optimizing the features on classification of the data mining with the C4.5 algorithm using Particle Swarm Optimization (PSO) to detect the blood sugar level in patient. The dataset used is the effect of physical activity to the Blood Sugar Level at H. Abdul Manan Simatupang Kisaran Regional Public Hospital. The amount of dataset used is 42 record with 10 attributes. The result of this research obtained that the Particle Swarm Optimization (PSO) may increasing the accuracy performance of C4.5 from 86% to 95%. Whereas the evaluation result of the AUC Value increasing from 0,917 to 0,950. From those 10 attributes which are then selection with using PSO into 7 attributes used to determine the prediction of sugar level. Therefore the Algorithm C4.5 using the Particle Swarm Optimization (PSO) may provide the best solution to the accuracy of detection blood sugar levels.

Download Full-text

Unbalanced Sequential Data Classification using Extreme Outlier Elimination and Sampling Techniques

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch005 ◽

2012 ◽

pp. 83-93 ◽

Cited By ~ 1

Author(s):

T.Maruthi Padmaja ◽

Raju S. Bapi ◽

P. Radha Krishna

Keyword(s):

Sampling Technique ◽

Sequential Data ◽

Minority Class ◽

Class Prediction ◽

Sequence Patterns ◽

Nearest Neighbour Classifier ◽

Minority Regions ◽

Extreme Outlier ◽

F Measure ◽

Hybrid Sampling

Predicting minority class sequence patterns from the noisy and unbalanced sequential datasets is a challenging task. To solve this problem, we proposed a new approach called extreme outlier elimination and hybrid sampling technique. We use k Reverse Nearest Neighbors (kRNNs) concept as a data cleaning method for eliminating extreme outliers in minority regions. Hybrid sampling technique, a combination of SMOTE to oversample the minority class sequences and random undersampling to undersample the majority class sequences is used for improving minority class prediction. This method was evaluated in terms of minority class precision, recall and f-measure on syntactically simulated, highly overlapped sequential dataset named Hill-Valley. We conducted the experiments with k-Nearest Neighbour classifier and compared the performance of our approach against simple hybrid sampling technique. Results indicate that our approach does not sacrifice one class in favor of the other, but produces high predictions for both fraud and non-fraud classes.

Download Full-text

An Uncertainty-Based Model for Optimized Multi-Label Classification

Advances in Computational Intelligence and Robotics - Handbook of Research on Swarm Intelligence in Engineering ◽

10.4018/978-1-4666-8291-7.ch002 ◽

2015 ◽

pp. 40-73

Author(s):

J. Anuradha ◽

B. K. Tripathy

Keyword(s):

Data Mining ◽

Particle Swarm Optimization ◽

Rough Set ◽

Fuzzy Model ◽

Pso Algorithm ◽

Data Sets ◽

Swarm Optimization ◽

Weak Points ◽

The Individual

The data used in the real world applications are uncertain and vague. Several models to handle such data efficiently have been put forth so far. It has been found that the individual models have some strong points and certain weak points. Efforts have been made to combine these models so that the hybrid models will cash upon the strong points of the constituent models. Dubois and Prade in 1990 combined rough set and fuzzy set together to develop two models of which rough fuzzy model is a popular one and is used in many fields to handle uncertainty-based data sets very well. Particle Swarm Optimization (PSO) further combined with the rough fuzzy model is expected to produce optimized solutions. Similarly, multi-label classification in the context of data mining deals with situations where an object or a set of objects can be assigned to multiple classes. In this chapter, the authors present a rough fuzzy PSO algorithm that performs classification of multi-label data sets, and through experimental analysis, its efficiency and superiority has been established.

Download Full-text

A Novel Synthetic Over-Sampling Technique for Imbalanced Classification of Gene Expressions Using Autoencoders and Swarm Optimization

AI 2018: Advances in Artificial Intelligence - Lecture Notes in Computer Science ◽

10.1007/978-3-030-03991-2_55 ◽

2018 ◽

pp. 603-615

Author(s):

Maisa Daoud ◽

Michael Mayo

Keyword(s):

Sampling Technique ◽

Gene Expressions ◽

Swarm Optimization ◽

Imbalanced Classification

Download Full-text

Classification of Quranic topics based on imbalanced classification

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v22.i2.pp678-687 ◽

2021 ◽

Vol 22 (2) ◽

pp. 678

Author(s):

Bassam Sulaiman Arkok ◽

Akram Mohammed Zeki

Keyword(s):

Matthews Correlation Coefficient ◽

Sampling Technique ◽

Poor Performance ◽

Classification Performance ◽

Classification Techniques ◽

Imbalanced Classification ◽

Traditional Classification ◽

Imbalanced Classes ◽

F Measure

Imbalanced classification techniques have been applied widely in the field of data mining. It is used to classify the imbalanced classes that are not equal in the number of samples. The problem of imbalanced classes is that the classification performance tends to the class with more samples while the class with few samples will obtain poor performance. This problem can be occurred in the Qur’anic classification due to the different number of verses. Many studies classified Qur’anic verses, which depended on the traditional classification. However, no study classified Qur’anic topics based on the techniques of imbalanced classification. Therefore, this paper aims to apply the methods of imbalanced classification as synthetic minority over-sampling technique (SMOTE), random over sample (ROS), and random under sample (RUS) methods to classify the Qur’anic topics that are imbalanced. Many metrics were used in this research to evaluate the experimental results. These metrics are sensitivity/recall, specificity, overall accuracy, F-Measure, G-mean, and matthews correlation coefficient (MCC). The results showed that the Quranic classification performance improved when imbalanced classification techniques were applied

Download Full-text

Conversion of adverse data corpus to shrewd output using sampling metrics

Visual Computing for Industry Biomedicine and Art ◽

10.1186/s42492-020-00055-9 ◽

2020 ◽

Vol 3 (1) ◽

Cited By ~ 2

Author(s):

Shahzad Ashraf ◽

Sehrish Saleem ◽

Tauqeer Ahmed ◽

Zeeshan Aslam ◽

Durr Muhammad

Keyword(s):

Learning Algorithm ◽

Class Imbalance ◽

High Accuracy ◽

The Other ◽

Educational Context ◽

Imbalanced Dataset ◽

Minority Class ◽

Confusion Matrices ◽

F Measure

AbstractAn imbalanced dataset is commonly found in at least one class, which are typically exceeded by the other ones. A machine learning algorithm (classifier) trained with an imbalanced dataset predicts the majority class (frequently occurring) more than the other minority classes (rarely occurring). Training with an imbalanced dataset poses challenges for classifiers; however, applying suitable techniques for reducing class imbalance issues can enhance classifiers’ performance. In this study, we consider an imbalanced dataset from an educational context. Initially, we examine all shortcomings regarding the classification of an imbalanced dataset. Then, we apply data-level algorithms for class balancing and compare the performance of classifiers. The performance of the classifiers is measured using the underlying information in their confusion matrices, such as accuracy, precision, recall, and F measure. The results show that classification with an imbalanced dataset may produce high accuracy but low precision and recall for the minority class. The analysis confirms that undersampling and oversampling are effective for balancing datasets, but the latter dominates.

Download Full-text

ImbTree: Minority Class Sensitive Weighted Decision Tree for Classification of Unbalanced Data

International Journal of Intelligent Systems and Applications in Engineering ◽

10.18201/ijisae.2021473633 ◽

2021 ◽

Vol 9 (4) ◽

pp. 152-158

Author(s):

Pratikkumar A. Barot ◽

Harikrishna B. Jethva

Keyword(s):

Decision Tree ◽

Unbalanced Data ◽

Minority Class

Download Full-text

Peningkatan Konerja Metode SVM Menggunakan Metode KNN Imputasi dan K-Means-Smote untuk Klasifikasi Kelulusan Mahasiswa Universitas Bumigora

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2021843428 ◽

2021 ◽

Vol 8 (4) ◽

pp. 713

Author(s):

Hairani Hairani

Keyword(s):

Data Mining ◽

Missing Values ◽

Decision Makers ◽

Drop Out ◽

Student Graduation ◽

Last Stage ◽

Sensitivity Specificity ◽

Incoming Students ◽

F Measure

Salah satu permasalahan utama Universitas Bumigora adalah rasio antara mahasiswa yang masuk dengan mahasiswa lulus tepat waktu tidak seimbang, sehingga akan mengakibatkan penurunan penilaian akreditasi dikemudian hari. Salah satu indikator penilaian dalam proses akreditasi adalah rasio kelulusan mahasiswa. Data kelulusan mahasiswa yang tersimpan pada basisdata kampus, tetapi belum dimanfaatkan dengan maksimal. Dengan memanfaatkan data kelulusan mahasiswa dapat mengetahui pattern atau pola-pola mahasiswa yang lulus tepat waktu atau tidak, sehingga dapat minimalisir terjadinya mahasiswa yang drop out. Tidak hanya itu, pengambil keputusan dapat dimudahkan membuat kebijakan secara dini untuk membantu mahasiswa yang berpotensi drop out dan lulus tidak tepat waktu. Solusi yang ditawarkan pada penelitian ini adalah menggunakan teknik data mining. Salah satu metode data mining yang digunakan penelitian ini adalah metode SVM. Adapun tujuan penelitian ini adalah meningkatkan kinerja metode SVM untuk klasifikasi kelulusan mahasiswa Universitas Bumigora menggunakan metode KNN Imputasi dan K-Means-Smote. Penelitian ini terdiri dari beberapa tahapan yaitu pengumpulan data kelulusan mahasiswa, pra-pengolahan seperti penanganan nilai hilang menggunakan metode KNNI, penanganan ketidakseimbangan kelas menggunakan K-Means-Smote, klasifikasi menggunakan metode SVM. Tahapan terakhir adalah pengujian kinerja SVM berdasarkan akurasi, sensitivitas, spesifisitas, dan f-measure. Berdasarkan hasil pengujian yang telah dilakukan, integrasi metode KNNI, K-Means-Smote, dan SVM mendapatkan akurasi 83.9%, sensitivitas 81.3%, spesifisitas 86.6%, dan f-measure 83.5%. Penggunaan metode KNNI dan K-Means-Smote dapat meningkatkan kinerja metode SVM berdasarkan akurasi, sensitivitas, spesifisitas, dan f-measure. Abstract One of the main problems of Bumigora University is the ratio between incoming students and students graduating on time is not balanced, so that it will result in a decrease in accreditation assessment in the future. One of the assessment indicators in the accreditation process is the student graduation ratio. Student graduation data stored in the campus database, but has not been maximally utilized. By utilizing graduation data, students can find out patterns or patterns of students who graduate on time or not, so as to minimize the occurrence of students who drop out. Not only that, decision makers can make it easier to make policies early to help students who have the potential to drop out and not graduate on time. The solution offered in this research is to use data mining techniques. One of the data mining methods used in this study is the SVM method. The purpose of this study is to improve the performance of the SVM method for the classification of Bumigora University graduation students using the KNN Imputation and K-Means-Smote methods. This research consists of several stages, namely the collection of student graduation data, pre-processing such as handling missing values using KNNI method, handling class imbalances using K-Means-Smote, classification the SVM method. The last stage is testing SVM performance based on accuracy, sensitivity, specificity, and f-measure. Based on the results of test that have been carried out, the integration of the KNNI, K-Means-Smote, and SVM method get an accuracy of 83.9%, sensitivity 81.3%, specificity 86.6%, and f-measure 83.5%. The use of KNNI and K-Means-Smote method can improve the performance of the SVM method based on accuracy, sensitivity, specificity, and f-measure.

Download Full-text