Methodology for eliminating imbalance of image data sets

Author(s):  
Tatyana Biloborodova ◽  
Inna Skarga-Bandurova ◽  
Mark Koverga

The methodology of solving the problem of eliminating class imbalance in image data sets is presented. The proposed methodology includes the stages of image fragment extraction, fragment augmentation, feature extraction, duplication of minority objects, and is based on reinforcement learning technology. The degree of imbalance indicator was used as a measure to determine the imbalance of the data set. An experiment was performed using a set of images of the faces of patients with skin rashes, annotated according to the severity of acne. The main steps of the methodology implementation are considered. The results of the classification showed the feasibility of applying the proposed methodology. The accuracy of classification on test data was 85%, which is 5% higher than the result obtained without the use of the proposed methodology. Key words: class imbalance, unbalanced data set, image fragment extraction, augmentation.

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Jiawei Lian ◽  
Junhong He ◽  
Yun Niu ◽  
Tianze Wang

Purpose The current popular image processing technologies based on convolutional neural network have the characteristics of large computation, high storage cost and low accuracy for tiny defect detection, which is contrary to the high real-time and accuracy, limited computing resources and storage required by industrial applications. Therefore, an improved YOLOv4 named as YOLOv4-Defect is proposed aim to solve the above problems. Design/methodology/approach On the one hand, this study performs multi-dimensional compression processing on the feature extraction network of YOLOv4 to simplify the model and improve the feature extraction ability of the model through knowledge distillation. On the other hand, a prediction scale with more detailed receptive field is added to optimize the model structure, which can improve the detection performance for tiny defects. Findings The effectiveness of the method is verified by public data sets NEU-CLS and DAGM 2007, and the steel ingot data set collected in the actual industrial field. The experimental results demonstrated that the proposed YOLOv4-Defect method can greatly improve the recognition efficiency and accuracy and reduce the size and computation consumption of the model. Originality/value This paper proposed an improved YOLOv4 named as YOLOv4-Defect for the detection of surface defect, which is conducive to application in various industrial scenarios with limited storage and computing resources, and meets the requirements of high real-time and precision.


Author(s):  
Aska E. Mehyadin ◽  
Adnan Mohsin Abdulazeez ◽  
Dathar Abas Hasan ◽  
Jwan N. Saeed

The bird classifier is a system that is equipped with an area machine learning technology and uses a machine learning method to store and classify bird calls. Bird species can be known by recording only the sound of the bird, which will make it easier for the system to manage. The system also provides species classification resources to allow automated species detection from observations that can teach a machine how to recognize whether or classify the species. Non-undesirable noises are filtered out of and sorted into data sets, where each sound is run via a noise suppression filter and a separate classification procedure so that the most useful data set can be easily processed. Mel-frequency cepstral coefficient (MFCC) is used and tested through different algorithms, namely Naïve Bayes, J4.8 and Multilayer perceptron (MLP), to classify bird species. J4.8 has the highest accuracy (78.40%) and is the best. Accuracy and elapsed time are (39.4 seconds).


2020 ◽  
Author(s):  
Ying Bi ◽  
Bing Xue ◽  
Mengjie Zhang

© Springer International Publishing AG, part of Springer Nature 2018. Feature extraction is an essential process for image data dimensionality reduction and classification. However, feature extraction is very difficult and often requires human intervention. Genetic Programming (GP) can achieve automatic feature extraction and image classification but the majority of existing methods extract low-level features from raw images without any image-related operations. Furthermore, the work on the combination of image-related operators/descriptors in GP for feature extraction and image classification is limited. This paper proposes a multi-layer GP approach (MLGP) to performing automatic high-level feature extraction and classification. A new program structure, a new function set including a number of image operators/descriptors and two region detectors, and a new terminal set are designed in this approach. The performance of the proposed method is examined on six different data sets of varying difficulty and compared with five GP based methods and 42 traditional image classification methods. Experimental results show that the proposed method achieves better or comparable performance than these baseline methods. Further analysis on the example programs evolved by the proposed MLGP method reveals the good interpretability of MLGP and gives insight into how this method can effectively extract high-level features for image classification.


2021 ◽  
Author(s):  
◽  
Urvesh Bhowan

<p>In classification,machine learning algorithms can suffer a performance bias when data sets are unbalanced. Binary data sets are unbalanced when one class is represented by only a small number of training examples (called the minority class), while the other class makes up the rest (majority class). In this scenario, the induced classifiers typically have high accuracy on the majority class but poor accuracy on the minority class. As the minority class typically represents the main class-of-interest in many real-world problems, accurately classifying examples from this class can be at least as important as, and in some cases more important than, accurately classifying examples from the majority class. Genetic Programming (GP) is a promising machine learning technique based on the principles of Darwinian evolution to automatically evolve computer programs to solve problems. While GP has shown much success in evolving reliable and accurate classifiers for typical classification tasks with balanced data, GP, like many other learning algorithms, can evolve biased classifiers when data is unbalanced. This is because traditional training criteria such as the overall success rate in the fitness function in GP, can be influenced by the larger number of examples from the majority class.  This thesis proposes a GP approach to classification with unbalanced data. The goal is to develop new internal cost-adjustment techniques in GP to improve classification performances on both the minority class and the majority class. By focusing on internal cost-adjustment within GP rather than the traditional databalancing techniques, the unbalanced data can be used directly or "as is" in the learning process. This removes any dependence on a sampling algorithm to first artificially re-balance the input data prior to the learning process. This thesis shows that by developing a number of new methods in GP, genetic program classifiers with good classification ability on the minority and the majority classes can be evolved. This thesis evaluates these methods on a range of binary benchmark classification tasks with unbalanced data. This thesis demonstrates that unlike tasks with multiple balanced classes where some dynamic (non-static) classification strategies perform significantly better than the simple static classification strategy, either a static or dynamic strategy shows no significant difference in the performance of evolved GP classifiers on these binary tasks. For this reason, the rest of the thesis uses this static classification strategy.  This thesis proposes several new fitness functions in GP to perform cost adjustment between the minority and the majority classes, allowing the unbalanced data sets to be used directly in the learning process without sampling. Using the Area under the Receiver Operating Characteristics (ROC) curve (also known as the AUC) to measure how well a classifier performs on the minority and majority classes, these new fitness functions find genetic program classifiers with high AUC on the tasks on both classes, and with fast GP training times. These GP methods outperform two popular learning algorithms, namely, Naive Bayes and Support Vector Machines on the tasks, particularly when the level of class imbalance is large, where both algorithms show biased classification performances.  This thesis also proposes a multi-objective GP (MOGP) approach which treats the accuracies of the minority and majority classes separately in the learning process. The MOGP approach evolves a good set of trade-off solutions (a Pareto front) in a single run that perform as well as, and in some cases better than, multiple runs of canonical single-objective GP (SGP). In SGP, individual genetic program solutions capture the performance trade-off between the two objectives (minority and majority class accuracy) using an ROC curve; whereas in MOGP, this requirement is delegated to multiple genetic program solutions along the Pareto front.  This thesis also shows how multiple Pareto front classifiers can be combined into an ensemble where individual members vote on the class label. Two ensemble diversity measures are developed in the fitness functions which treat the diversity on both the minority and the majority classes as equally important; otherwise, these measures risk being biased toward the majority class. The evolved ensembles outperform their individual members on the tasks due to good cooperation between members.  This thesis further improves the ensemble performances by developing a GP approach to ensemble selection, to quickly find small groups of individuals that cooperate very well together in the ensemble. The pruned ensembles use much fewer individuals to achieve performances that are as good as larger (unpruned) ensembles, particularly on tasks with high levels of class imbalance, thereby reducing the total time to evaluate the ensemble.</p>


2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
Gurman Gill ◽  
Reinhard R. Beichel

Dynamic and longitudinal lung CT imaging produce 4D lung image data sets, enabling applications like radiation treatment planning or assessment of response to treatment of lung diseases. In this paper, we present a 4D lung segmentation method that mutually utilizes all individual CT volumes to derive segmentations for each CT data set. Our approach is based on a 3D robust active shape model and extends it to fully utilize 4D lung image data sets. This yields an initial segmentation for the 4D volume, which is then refined by using a 4D optimal surface finding algorithm. The approach was evaluated on a diverse set of 152 CT scans of normal and diseased lungs, consisting of total lung capacity and functional residual capacity scan pairs. In addition, a comparison to a 3D segmentation method and a registration based 4D lung segmentation approach was performed. The proposed 4D method obtained an average Dice coefficient of0.9773±0.0254, which was statistically significantly better (pvalue≪0.001) than the 3D method (0.9659±0.0517). Compared to the registration based 4D method, our method obtained better or similar performance, but was 58.6% faster. Also, the method can be easily expanded to process 4D CT data sets consisting of several volumes.


Author(s):  
Brendan Juba ◽  
Hai S. Le

Practitioners of data mining and machine learning have long observed that the imbalance of classes in a data set negatively impacts the quality of classifiers trained on that data. Numerous techniques for coping with such imbalances have been proposed, but nearly all lack any theoretical grounding. By contrast, the standard theoretical analysis of machine learning admits no dependence on the imbalance of classes at all. The basic theorems of statistical learning establish the number of examples needed to estimate the accuracy of a classifier as a function of its complexity (VC-dimension) and the confidence desired; the class imbalance does not enter these formulas anywhere. In this work, we consider the measures of classifier performance in terms of precision and recall, a measure that is widely suggested as more appropriate to the classification of imbalanced data. We observe that whenever the precision is moderately large, the worse of the precision and recall is within a small constant factor of the accuracy weighted by the class imbalance. A corollary of this observation is that a larger number of examples is necessary and sufficient to address class imbalance, a finding we also illustrate empirically.


2016 ◽  
Vol 12 (4) ◽  
pp. 448-476 ◽  
Author(s):  
Amir Hosein Keyhanipour ◽  
Behzad Moshiri ◽  
Maryam Piroozmand ◽  
Farhad Oroumchian ◽  
Ali Moeini

Purpose Learning to rank algorithms inherently faces many challenges. The most important challenges could be listed as high-dimensionality of the training data, the dynamic nature of Web information resources and lack of click-through data. High dimensionality of the training data affects effectiveness and efficiency of learning algorithms. Besides, most of learning to rank benchmark datasets do not include click-through data as a very rich source of information about the search behavior of users while dealing with the ranked lists of search results. To deal with these limitations, this paper aims to introduce a novel learning to rank algorithm by using a set of complex click-through features in a reinforcement learning (RL) model. These features are calculated from the existing click-through information in the data set or even from data sets without any explicit click-through information. Design/methodology/approach The proposed ranking algorithm (QRC-Rank) applies RL techniques on a set of calculated click-through features. QRC-Rank is as a two-steps process. In the first step, Transformation phase, a compact benchmark data set is created which contains a set of click-through features. These feature are calculated from the original click-through information available in the data set and constitute a compact representation of click-through information. To find most effective click-through feature, a number of scenarios are investigated. The second phase is Model-Generation, in which a RL model is built to rank the documents. This model is created by applying temporal difference learning methods such as Q-Learning and SARSA. Findings The proposed learning to rank method, QRC-rank, is evaluated on WCL2R and LETOR4.0 data sets. Experimental results demonstrate that QRC-Rank outperforms the state-of-the-art learning to rank methods such as SVMRank, RankBoost, ListNet and AdaRank based on the precision and normalized discount cumulative gain evaluation criteria. The use of the click-through features calculated from the training data set is a major contributor to the performance of the system. Originality/value In this paper, we have demonstrated the viability of the proposed features that provide a compact representation for the click through data in a learning to rank application. These compact click-through features are calculated from the original features of the learning to rank benchmark data set. In addition, a Markov Decision Process model is proposed for the learning to rank problem using RL, including the sets of states, actions, rewarding strategy and the transition function.


2019 ◽  
Vol 16 (1) ◽  
pp. 155-178 ◽  
Author(s):  
Kristina Andric ◽  
Damir Kalpic ◽  
Zoran Bohacek

In this paper we investigate the role of sample size and class distribution in credit risk assessments, focusing on real life imbalanced data sets. Choosing the optimal sample is of utmost importance for the quality of predictive models and has become an increasingly important topic with the recent advances in automating lending decision processes and the ever growing richness in data collected by financial institutions. To address the observed research gap, a large-scale experimental evaluation of real-life data sets of different characteristics was performed, using several classification algorithms and performance measures. Results indicate that various factors play a role in determining the optimal class distribution, namely the performance measure, classification algorithm and data set characteristics. The study also provides valuable insight on how to design the training sample to maximize prediction performance and the suitability of using different classification algorithms by assessing their sensitivity to class imbalance and sample size.


2021 ◽  
Vol 5 (1) ◽  
pp. 75-91
Author(s):  
Sri Astuti Thamrin ◽  
Dian Sidik ◽  
Hedi Kuswanto ◽  
Armin Lawi ◽  
Ansariadi Ansariadi

The accuracy of the data class is very important in classification with a machine learning approach. The more accurate the existing data sets and classes, the better the output generated by machine learning. In fact, classification can experience imbalance class data in which each class does not have the same portion of the data set it has. The existence of data imbalance will affect the classification accuracy. One of the easiest ways to correct imbalanced data classes is to balance it. This study aims to explore the problem of data class imbalance in the medium case dataset and to address the imbalance of data classes as well. The Synthetic Minority Over-Sampling Technique (SMOTE) method is used to overcome the problem of class imbalance in obesity status in Indonesia 2013 Basic Health Research (RISKESDAS). The results show that the number of obese class (13.9%) and non-obese class (84.6%). This means that there is an imbalance in the data class with moderate criteria. Moreover, SMOTE with over-sampling 600% can improve the level of minor classes (obesity). As consequence, the classes of obesity status balanced. Therefore, SMOTE technique was better compared to without SMOTE in exploring the obesity status of Indonesia RISKESDAS 2013.


Sign in / Sign up

Export Citation Format

Share Document