scholarly journals Conversion of adverse data corpus to shrewd output using sampling metrics

Author(s):  
Shahzad Ashraf ◽  
Sehrish Saleem ◽  
Tauqeer Ahmed ◽  
Zeeshan Aslam ◽  
Durr Muhammad

AbstractAn imbalanced dataset is commonly found in at least one class, which are typically exceeded by the other ones. A machine learning algorithm (classifier) trained with an imbalanced dataset predicts the majority class (frequently occurring) more than the other minority classes (rarely occurring). Training with an imbalanced dataset poses challenges for classifiers; however, applying suitable techniques for reducing class imbalance issues can enhance classifiers’ performance. In this study, we consider an imbalanced dataset from an educational context. Initially, we examine all shortcomings regarding the classification of an imbalanced dataset. Then, we apply data-level algorithms for class balancing and compare the performance of classifiers. The performance of the classifiers is measured using the underlying information in their confusion matrices, such as accuracy, precision, recall, and F measure. The results show that classification with an imbalanced dataset may produce high accuracy but low precision and recall for the minority class. The analysis confirms that undersampling and oversampling are effective for balancing datasets, but the latter dominates.

Classification is a supervised learning task based on categorizing things in groups on the basis of class labels. Algorithms are trained with labeled datasets for accomplishing the task of classification. In the process of classification, datasets plays an important role. If in a dataset, instances of one label/class (majority class) are much more than instances of another label/class (minority class), such that it becomes hard to understand and learn characteristics of minority class for a classifier, such dataset is termed an imbalanced dataset. These types of datasets raise the problem of biased prediction or misclassification in the real world, as models based on such datasets may give very high accuracy during training, but as not familiar with minority class instances, would not be able to predict minority class and thus fails poorly. A survey on various techniques proposed by the researchers for handling imbalanced data has been presented and a comparison of the techniques based on f-measure has been identified and discussed.


Author(s):  
Sreeja N. K.

Learning a classifier from imbalanced data is one of the most challenging research problems. Data imbalance occurs when the number of instances belonging to one class is much less than the number of instances belonging to the other class. A standard classifier is biased towards the majority class and therefore misclassifies the minority class instances. Minority class instances may be regarded as rare events or unusual patterns that could potentially have a negative impact on the society. Therefore, detection of such events is considered significant. This chapter proposes a FireWorks-based Hybrid ReSampling (FWHRS) algorithm to resample imbalance data. It is used with Weighted Pattern Matching based classifier (PMC+) for classification. FWHRS-PMC+ was evaluated on 44 imbalanced binary datasets. Experiments reveal FWHRS-PMC+ is effective in classification of imbalanced data. Empirical results were validated using non-parametric statistical tests.


Author(s):  
S. Priya ◽  
R. Annie Uthra

AbstractIn present times, data science become popular to support and improve decision-making process. Due to the accessibility of a wide application perspective of data streaming, class imbalance and concept drifting become crucial learning problems. The advent of deep learning (DL) models finds useful for the classification of concept drift in data streaming applications. This paper presents an effective class imbalance with concept drift detection (CIDD) using Adadelta optimizer-based deep neural networks (ADODNN), named CIDD-ADODNN model for the classification of highly imbalanced streaming data. The presented model involves four processes namely preprocessing, class imbalance handling, concept drift detection, and classification. The proposed model uses adaptive synthetic (ADASYN) technique for handling class imbalance data, which utilizes a weighted distribution for diverse minority class examples based on the level of difficulty in learning. Next, a drift detection technique called adaptive sliding window (ADWIN) is employed to detect the existence of the concept drift. Besides, ADODNN model is utilized for the classification processes. For increasing the classifier performance of the DNN model, ADO-based hyperparameter tuning process takes place to determine the optimal parameters of the DNN model. The performance of the presented model is evaluated using three streaming datasets namely intrusion detection (NSL KDDCup) dataset, Spam dataset, and Chess dataset. A detailed comparative results analysis takes place and the simulation results verified the superior performance of the presented model by obtaining a maximum accuracy of 0.9592, 0.9320, and 0.7646 on the applied KDDCup, Spam, and Chess dataset, respectively.


2020 ◽  
Vol 8 (2) ◽  
pp. 89-93 ◽  
Author(s):  
Hairani Hairani ◽  
Khurniawan Eko Saputro ◽  
Sofiansyah Fadli

The occurrence of imbalanced class in a dataset causes the classification results to tend to the class with the largest amount of data (majority class). A sampling method is needed to balance the minority class (positive class) so that the class distribution becomes balanced and leading to better classification results. This study was conducted to overcome imbalanced class problems on the Indian Pima diabetes illness dataset using k-means-SMOTE. The dataset has 268 instances of the positive class (minority class) and 500 instances of the negative class (majority class). The classification was done by comparing C4.5, SVM, and naïve Bayes while implementing k-means-SMOTE in data sampling. Using k-means-SMOTE, the SVM classification method has the highest accuracy and sensitivity of 82 % and 77 % respectively, while the naive Bayes method produces the highest specificity of 89 %.


Author(s):  
Ali Jebelli ◽  
Rafiq Ahmad

<p>Agricultural products, as essential commodities, are among the most sought-for items in superstores. Barcode is usually utilized to classify and regulate the price of products such as ornamental flowers in such stores. However, the use of barcodes on some fragile agricultural products such as ornamental flowers can be damaged and lessen their life length. Moreover, it is time-consuming and costly<em><strong> </strong></em>and may lead to the production of massive waste and damage to the environment and the admittance of chemical materials into food products that can affect human health. Consequently, we aimed to design a classifier robot to recognize ornamental flowers based on the related product image at different times and surrounding conditions. Besides, it can increase the speed and accuracy of distinguishing and classifying the products, lower the pricing time, and increase the lifetime due to the absence of the need for movement and changing the position of the products. According to the datasheets provided by the robot that is stored in its database, we provide the possibility of identifying and introducing the product in different colors and shapes. Also, due to the preparation of a standard and small database tailored to the needs of the robot, the robot will be trained in a short time (less than five minutes) without the need for an Internet connection or a large hard drive for storage the data. On the other hand, by dividing each input photo into ten different sections, the system can, without the need for a detection system, simultaneously in several different images, decorative flowers in different conditions, angles and environments, even with other objects such as vases, detects very fast with a high accuracy of 97%.</p>


2019 ◽  
Vol 8 (4) ◽  
pp. 4039-4042

Recently, the learning from unbalanced data has emerged to be a pre-dominant problem in several applications and in that multi label classification is an evolving data mining task, learning from unbalanced multilabel data is being examined. However, the available algorithms-based SMOTE makes use of the same sampling rate for every instance of the minority class. This leads to sub-optimal performance. To deal with this problem, a new Particle Swarm Optimization based SMOTE (PSOSMOTE) algorithm is proposed. The PSOSMOTE algorithm employs diverse sampling rates for multiple minority class instances and gets the fusion of optimal sampling rates and to deal with classification of unbalanced datasets. Then, Bayesian technique is combined with Random forest for multilabel classification (BARF-MLC) is to address the inherent label dependencies among samples such as ML-FOREST classifier, Predictive Clustering Trees (PCT), Hierarchy of Multi Label Classifier (HOMER) by taking the different metrics including precision, recall, F-measure, Accuracy and Error Rate.


Author(s):  
Hartono Hartono ◽  
Erianto Ongko ◽  
Yeni Risyani

<span>In the classification process that contains class imbalance problems. In addition to the uneven distribution of instances which causes poor performance, overlapping problems also cause performance degradation. This paper proposes a method that combining feature selection and hybrid approach redefinition (HAR) method in handling class imbalance and overlapping for multi-class imbalanced. HAR was a hybrid ensembles method in handling class imbalance problem. The main contribution of this work is to produce a new method that can overcome the problem of class imbalance and overlapping in the multi-class imbalance problem.  This method must be able to give better results in terms of classifier performance and overlap degrees in multi-class problems. This is achieved by improving an ensemble learning algorithm and a preprocessing technique in HAR <span>using minimizing overlapping selection under SMOTE (MOSS). MOSS was known as a very popular feature selection method in handling overlapping. To validate the accuracy of the proposed method, this research use augmented R-Value, Mean AUC, Mean F-Measure, Mean G-Mean, and Mean Precision. The performance of the model is evaluated against the hybrid method (MBP+CGE) as a popular method in handling class imbalance and overlapping for multi-class imbalanced. It is found that the proposed method is superior when subjected to classifier performance as indicate with better Mean AUC, F-Measure, G-Mean, and precision.</span></span>


2020 ◽  
Vol 31 (2) ◽  
pp. 25
Author(s):  
Liqaa M. Shoohi ◽  
Jamila H. Saud

Classification of imbalanced data is an important issue. Many algorithms have been developed for classification, such as Back Propagation (BP) neural networks, decision tree, Bayesian networks etc., and have been used repeatedly in many fields. These algorithms speak of the problem of imbalanced data, where there are situations that belong to more classes than others. Imbalanced data result in poor performance and bias to a class without other classes. In this paper, we proposed three techniques based on the Over-Sampling (O.S.) technique for processing imbalanced dataset and redistributing it and converting it into balanced dataset. These techniques are (Improved Synthetic Minority Over-Sampling Technique (Improved SMOTE),  Borderline-SMOTE + Imbalanced Ratio(IR), Adaptive Synthetic Sampling (ADASYN) +IR) Algorithm, where the work these techniques are generate the synthetic samples for the minority class to achieve balance between minority and majority classes and then calculate the IR between classes of minority and majority. Experimental results show ImprovedSMOTE algorithm outperform the Borderline-SMOTE + IR and ADASYN + IR algorithms because it achieves a high balance between minority and majority classes.


1977 ◽  
Vol 4 (2) ◽  
pp. 49-50 ◽  
Author(s):  
E. K. Kharadze ◽  
R. A. Bartaya

The observational base of our investigation consists of a two-dimensional MK classification of about 11.000 stars in the 42 Kapteyn Areas situated along the gal.lat-s from −17° up to +72° and 200 Ap, Am stars discovered in the same KA. The dispersion of the applied objective-prism spectra is 160 Â per mm. The data are of high accuracy, close to the Michigan level, and uniformity, which make them reliable. The limit is close to the 12-th ph.mg.The general conclusion of the undoubted importance is stated: the galactic concentration of dwarfs is closer than it had been assumed until now; on the other hand - the giants are not so closely concentrated to the galactic plane as it has been accepted.


1997 ◽  
Vol 6 (1) ◽  
pp. 57-62 ◽  
Author(s):  
Wayne O. Olsen ◽  
Terri L. Pratt ◽  
Christopher D. Bauch
Keyword(s):  

Multichannel ABR recordings for 30 otoneurologic patients were reviewed independently by three audiologists to assess interjudge consistency in determining absolute latencies and overall interpretation of ABR results. Four months later, the tracings were reviewed a second time to evaluate intrajudge consistency in interpretation of ABR waveforms. Interjudge agreement in marking latencies for waves I, III, and V within 0.2 ms was on the order of 90% or better. Intrajudge consistency was slightly higher. Only rarely did inter- or intrajudge differences in latency measurements exceed 0.3 ms. Agreement in overall interpretation of ABR results as "normal" or "abnormal" was unanimous for 90% of the patients. Across pairs of judges, the agreement for "normal" and "abnormal" classification of the ABR tracings was 97%. Intrajudge consistency for "normal" and "abnormal" categorization of the ABR results was 100% for one judge, 97% for the other two judges.


Sign in / Sign up

Export Citation Format

Share Document