Research on Feature Selection for Imbalanced Problem from Fault Diagnosis on Gear

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.466-467.886 ◽

2012 ◽

Vol 466-467 ◽

pp. 886-890

Author(s):

Tian Yu Liu

Keyword(s):

Fault Diagnosis ◽

Class Imbalance ◽

Classification Performance ◽

Class Imbalance Problem ◽

Data Set ◽

Prediction Ability ◽

Imbalance Problem ◽

Gear Fault ◽

Balanced Distribution ◽

Negative Effect

Defect is one of the important factors resulting in gear fault, so it is significant to study the technology of defect diagnosis for gear. Class imbalance problem is encountered in the fault diagnosis, which causes seriously negative effect on the performance of classifiers that assume a balanced distribution of classes. Though it is critical, few previous works paid attention to this class imbalance problem in the fault diagnosis of gear. In imbalanced problems, some features are redundant and even irrelevant. These features will hurt the generalization performance of learning machines. Here we propose PREE (Prediction Risk based feature selectionfor EasyEnsemble) to solve the class imbalanced problem in the fault diagnosis of gear. Experimental results on UCI data sets and gear data set show that PREE improves the classification performance and prediction ability on the imbalanced dataset.

Download Full-text

Image Classifying Based on Cost-Sensitive Layered Cascade Learning

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.701-702.453 ◽

2014 ◽

Vol 701-702 ◽

pp. 453-458

Author(s):

Feng Huang ◽

Yun Liang ◽

Li Huang ◽

Ji Ming Yao ◽

Wen Feng Tian

Keyword(s):

Image Classification ◽

Class Imbalance ◽

Classification Performance ◽

Machine Learning Algorithms ◽

Class Imbalance Problem ◽

Data Set ◽

Misclassification Cost ◽

Imbalance Problem ◽

Specific Category ◽

The Cost

Image Classification is an important means of image processing, Traditional research of image classification usually based on following assumptions: aiming for the overall classification accuracy, sample of different category has the same importance in data set and all the misclassification brings same cost. Unfortunately, class imbalance and cost sensitive are ubiquitous in classification in real world process, sample size of specific category in data set may much more than others and misclassification cost is sharp distinction between different categories. High dimension of eigenvector caused by diversity content of images and the big complexity gap between distinguish different categories of images are common problems when dealing with image Classification, therefore, one single machine learning algorithms is not sufficient when dealing with complex image classification contains the above characteristics. To cure the above problems, a layered cascade image classifying method based on cost-sensitive and class-imbalance was proposed, a set of cascading learning was build, and the inner patterns of images of specific category was learned in different stages, also, the cost function was introduced, thus, the method can effectively respond to the cost-sensitive and class-imbalance problem of image classifying. Moreover, the structure of this method is flexible as the layer of cascading and the algorithm in every stage can be readjusted based on business requirements of image classifying. The result of application in sensitive image classifying for smart grid indicates that this image classifying based on cost-sensitive layered cascade learning obtains better image classification performance than the existing methods.

Download Full-text

Experimental Study on Class Imbalance Problem Using an Oil Spill Training Data Set

British Journal of Mathematics & Computer Science ◽

10.9734/bjmcs/2017/32860 ◽

2017 ◽

Vol 21 (5) ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Xi Ouyang ◽

Yuan Chen ◽

Bing Wei

Keyword(s):

Experimental Study ◽

Oil Spill ◽

Class Imbalance ◽

Training Data ◽

Class Imbalance Problem ◽

Data Set ◽

Imbalance Problem

Download Full-text

Research on expansion and classification of imbalanced data based on SMOTE algorithm

Scientific Reports ◽

10.1038/s41598-021-03430-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Shujuan Wang ◽

Yuntao Dai ◽

Jihong Shen ◽

Jingxue Xuan

Keyword(s):

Big Data ◽

Class Imbalance ◽

Imbalanced Data ◽

Original Data ◽

Classification Performance ◽

Parameter Selection ◽

Sample Collection ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Sample Points

AbstractWith the development of artificial intelligence, big data classification technology provides the advantageous help for the medicine auxiliary diagnosis research. While due to the different conditions in the different sample collection, the medical big data is often imbalanced. The class-imbalance problem has been reported as a serious obstacle to the classification performance of many standard learning algorithms. SMOTE algorithm could be used to generate sample points randomly to improve imbalance rate, but its application is affected by the marginalization generation and blindness of parameter selection. Focusing on this problem, an improved SMOTE algorithm based on Normal distribution is proposed in this paper, so that the new sample points are distributed closer to the center of the minority sample with a higher probability to avoid the marginalization of the expanded data. Experiments show that the classification effect is better when use proposed algorithm to expand the imbalanced dataset of Pima, WDBC, WPBC, Ionosphere and Breast-cancer-wisconsin than the original SMOTE algorithm. In addition, the parameter selection of the proposed algorithm is analyzed and it is found that the classification effect is the best when the distribution characteristics of the original data was maintained best by selecting appropriate parameters in our designed experiments.

Download Full-text

Fake News and Imbalanced Data Perspective

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch011 ◽

2021 ◽

pp. 195-210

Author(s):

Isha Y. Agarwal ◽

Dipti P. Rana

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Point Of View ◽

Quality Data ◽

Data Detection ◽

Fake News ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Balanced Distribution ◽

Media Ecosystem

Fake news has grabbed attention lately. In this chapter, the issue is tackled from the point of view of collection of quality data (i.e., instances of fake and real news articles on a balanced distribution of subjects). It is predicted that in the near future, fake news will supersede true news. In the media ecosystem this will create a natural imbalance of data. Due to the unbounded scale and imbalance existence of data, detection of fake news is challenging. The class imbalance problem in fake news is yet to be explored. The problem of imbalance exists as fake news instances increase in some cases more than real news. The goal of this chapter is to demonstrate the effect of class imbalance of real and fake news instances on detection using classification models. This work aims to assist researchers to better resolve the problem by illustrating the precise existence of the relationship between the imbalance and the resulting impact on the output of the classifier. In particular, the authors determine that data imbalance and accuracy are inversely proportional to each other.

Download Full-text

Predicting Student Academic Performance with Ensemble Classification Method on Imbalanced Educational Data

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d7741.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 1543-1551

Keyword(s):

Academic Performance ◽

Time Management ◽

Class Imbalance ◽

Original Data ◽

Learning Rate ◽

Ensemble Classification ◽

Class Imbalance Problem ◽

Data Set ◽

Imbalance Problem ◽

Original Dataset

Education benefits a person in sustaining his present and future by assuring the goal of life. At present universities and colleges are mainly focusing to improve the academic performance of the students. Recently, many studies have concentrated on employing several machine learning models in the field of higher education to assist both the teachers and the students to identify their problems and can take remedial measures to improve their performances. Some of the earlier studies have been discussed about class imbalance problem and achieved poor prediction outcomes due to low performance of the classifiers. In this paper, we aim to improve the classification/prediction outcomes with a rule-based ensemble model based on various sampling strategies by addressing the class imbalance problem. The dataset used for this study has been collected from Vignan’s Lara and Vignan’s Nirula institutions based on considering various factors such as Attendance Percentage, No of Backlogs, Adjustable Nature, Concentration, Result History, Discipline in class, Usage of Social Media, Degree of Intelligence, Understanding of Subjects, Event Participation, Time Management, Extra Classes, Alternative Learning Skills, Logical Thinking, Bad Habits, Parents Education, Health Condition, Planning for higher studies, Family Support, Time Management, and Aggregate. To evaluate the efficiency, we also considered and compared our original dataset with different benchmark datasets and the performance measures of the proposed method is also tested with various sampling methods based on a learning rate parameter ranging between 0.1 and 0.8. The original data set with the re-sampling method with the proposed method achieved maximum precision values at a learning rate 0.3 with an accuracy rate of 98.36%. Finally, the obtained results were also compared with several baseline classifiers like Naïve Bayes, SVM, MLP, KNN, and OneR on the collected original datasets.

Download Full-text

Improvement in Detecting the Fate of Covid-19 Patients and Rule-based Analysis to Discover the Most Important Rules Governing their Fate

10.21203/rs.3.rs-515541/v1 ◽

2021 ◽

Author(s):

Sadegh Ilbeigipour ◽

Amir Albadvi ◽

Elham Akhondzadeh Noughabi

Keyword(s):

Sampling Method ◽

Class Imbalance ◽

Supervised Machine Learning ◽

Classification Models ◽

Class Imbalance Problem ◽

Data Set ◽

Rule Based ◽

Imbalance Problem ◽

The World ◽

Range Of Values

Abstract The world today faces a new challenge that is unprecedented in the last 100 years. The emergence of a new coronavirus has led to a human catastrophe. The new coronavirus is the cause of the Covid-19 disease, which kills many people in the world every day. Scientists in various sciences have been looking for solutions to this problem so far. In addition to general vaccination, maintaining social distance and hygienic principles are the most well-known strategies to prevent Covid-19 infection. In this research, we have tried to examine the symptoms of Covid-19 cases through different supervised machine learning methods. We solved the class imbalance problem using the SMOTE up-sampling method and then developed some classification models to predict the recovery or death of patients. Besides, we implemented a rule-based technique to identify important symptoms that affect patients' fate and calculate the range of values in these features that lead to recovery or death of patients. Our results showed that the random forest model with 94% accuracy, 95.2% sensitivity, 92.7% specification, 93.2% precision, and 94.2% F-score outperforms state-of-the-art classification models. Finally, we identified the ten most significant rules in the data set. The rules state that different combinations of 6 features in certain ranges of their values lead to patients' recovery with 90% confidence. In conclusion, the classification results in this study show better performance than recent researches. Besides, help physicians consider other important factors in improving health services to different groups of Covid-19 patients.

Download Full-text

Detection of Myocardial Infarction Using ECG and Multi-Scale Feature Concatenate

Sensors ◽

10.3390/s21051906 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1906

Author(s):

Jia-Zheng Jian ◽

Tzong-Rong Ger ◽

Han-Hua Lai ◽

Chi-Ming Ku ◽

Chiung-An Chen ◽

...

Keyword(s):

Myocardial Infarction ◽

Network Structure ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Multi Scale ◽

Imbalance Problem ◽

Average Accuracy ◽

Significant Difference ◽

Electrocardiogram Ecg

Diverse computer-aided diagnosis systems based on convolutional neural networks were applied to automate the detection of myocardial infarction (MI) found in electrocardiogram (ECG) for early diagnosis and prevention. However, issues, particularly overfitting and underfitting, were not being taken into account. In other words, it is unclear whether the network structure is too simple or complex. Toward this end, the proposed models were developed by starting with the simplest structure: a multi-lead features-concatenate narrow network (N-Net) in which only two convolutional layers were included in each lead branch. Additionally, multi-scale features-concatenate networks (MSN-Net) were also implemented where larger features were being extracted through pooling the signals. The best structure was obtained via tuning both the number of filters in the convolutional layers and the number of inputting signal scales. As a result, the N-Net reached a 95.76% accuracy in the MI detection task, whereas the MSN-Net reached an accuracy of 61.82% in the MI locating task. Both networks give a higher average accuracy and a significant difference of p < 0.001 evaluated by the U test compared with the state-of-the-art. The models are also smaller in size thus are suitable to fit in wearable devices for offline monitoring. In conclusion, testing throughout the simple and complex network structure is indispensable. However, the way of dealing with the class imbalance problem and the quality of the extracted features are yet to be discussed.

Download Full-text

A Novel Focal Phi Loss for Power Line Segmentation with Auxiliary Classifier U-Net

Sensors ◽

10.3390/s21082803 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2803

Author(s):

Rabeea Jaffari ◽

Manzoor Ahmed Hashmani ◽

Constantino Carlos Reyes-Aldasoro

Keyword(s):

Loss Function ◽

Class Imbalance ◽

Power Line ◽

Aerial Images ◽

Class Imbalance Problem ◽

Trade Off ◽

Urban Scenes ◽

Imbalance Problem ◽

A Minor ◽

Evaluation Parameters

The segmentation of power lines (PLs) from aerial images is a crucial task for the safe navigation of unmanned aerial vehicles (UAVs) operating at low altitudes. Despite the advances in deep learning-based approaches for PL segmentation, these models are still vulnerable to the class imbalance present in the data. The PLs occupy only a minimal portion (1–5%) of the aerial images as compared to the background region (95–99%). Generally, this class imbalance problem is addressed via the use of PL-specific detectors in conjunction with the popular class balanced cross entropy (BBCE) loss function. However, these PL-specific detectors do not work outside their application areas and a BBCE loss requires hyperparameter tuning for class-wise weights, which is not trivial. Moreover, the BBCE loss results in low dice scores and precision values and thus, fails to achieve an optimal trade-off between dice scores, model accuracy, and precision–recall values. In this work, we propose a generalized focal loss function based on the Matthews correlation coefficient (MCC) or the Phi coefficient to address the class imbalance problem in PL segmentation while utilizing a generic deep segmentation architecture. We evaluate our loss function by improving the vanilla U-Net model with an additional convolutional auxiliary classifier head (ACU-Net) for better learning and faster model convergence. The evaluation of two PL datasets, namely the Mendeley Power Line Dataset and the Power Line Dataset of Urban Scenes (PLDU), where PLs occupy around 1% and 2% of the aerial images area, respectively, reveal that our proposed loss function outperforms the popular BBCE loss by 16% in PL dice scores on both the datasets, 19% in precision and false detection rate (FDR) values for the Mendeley PL dataset and 15% in precision and FDR values for the PLDU with a minor degradation in the accuracy and recall values. Moreover, our proposed ACU-Net outperforms the baseline vanilla U-Net for the characteristic evaluation parameters in the range of 1–10% for both the PL datasets. Thus, our proposed loss function with ACU-Net achieves an optimal trade-off for the characteristic evaluation parameters without any bells and whistles. Our code is available at Github.

Download Full-text