Imbalance class problems in data mining: a review

<span>The imbalanced data problems in data mining are common nowadays, which occur due to skewed nature of data. These problems impact the classification process negatively in machine learning process. In such problems, classes have different ratios of specimens in which a large number of specimens belong to one class and the other class has fewer specimens that is usually an essential class, but unfortunately misclassified by many classifiers. So far, significant research is performed to address the imbalanced data problems by implementing different techniques and approaches. In this research, a comprehensive survey is performed to identify the challenges of handling imbalanced class problems during classification process using machine learning algorithms. We discuss the issues of classifiers which endorse bias for majority class and ignore the minority class. Furthermore, the viable solutions and potential future directions are provided to handle the problems<em>.</em></span>

Download Full-text

An Improved Random Forest Algorithm for Class-Imbalanced Data Classification and its Application in PAD Risk Factors Analysis

The Open Electrical & Electronic Engineering Journal ◽

10.2174/1874129001307010062 ◽

2013 ◽

Vol 7 (1) ◽

pp. 62-70 ◽

Cited By ~ 9

Author(s):

Dengju Yao ◽

Jing Yang ◽

Xiaojuan Zhan

Keyword(s):

Machine Learning ◽

Random Forest ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

Majority Voting ◽

Training Dataset ◽

Random Forest Algorithm ◽

Research Subjects ◽

Minority Class ◽

Imbalanced Data Classification

The classification problem is one of the important research subjects in the field of machine learning. However, most machine learning algorithms train a classifier based on the assumption that the number of training examples of classes is almost equal. When a classifier was trained on imbalanced data, the performance of the classifier declined clearly. For resolving the class-imbalanced problem, an improved random forest algorithm was proposed based on sampling with replacement. We extracted multiple example subsets randomly with replacement from majority class, and the example number of extracted example subsets is as the same with minority class example dataset. Then, multiple new training datasets were constructed by combining the each exacted majority example subset and minority class dataset respectively, and multiple random forest classifiers were training on these training dataset. For a prediction example, the class was determined by majority voting of multiple random forest classifiers. The experimental results on five groups UCI datasets and a real clinical dataset show that the proposed method could deal with the class-imbalanced data problem and the improved random forest algorithm outperformed original random forest and other methods in literatures.

Download Full-text

SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features

Applied System Innovation ◽

10.3390/asi4010018 ◽

2021 ◽

Vol 4 (1) ◽

pp. 18

Author(s):

Mimi Mukherjee ◽

Matloob Khushi

Keyword(s):

Machine Learning ◽

Synthetic Data ◽

Machine Learning Algorithms ◽

The Other ◽

Classification Models ◽

Target Class ◽

Minority Class ◽

Novel Method ◽

The Difference ◽

Real World Datasets

Real-world datasets are heavily skewed where some classes are significantly outnumbered by the other classes. In these situations, machine learning algorithms fail to achieve substantial efficacy while predicting these underrepresented instances. To solve this problem, many variations of synthetic minority oversampling methods (SMOTE) have been proposed to balance datasets which deal with continuous features. However, for datasets with both nominal and continuous features, SMOTE-NC is the only SMOTE-based oversampling technique to balance the data. In this paper, we present a novel minority oversampling method, SMOTE-ENC (SMOTE—Encoded Nominal and Continuous), in which nominal features are encoded as numeric values and the difference between two such numeric values reflects the amount of change of association with the minority class. Our experiments show that classification models using the SMOTE-ENC method offer better prediction than models using SMOTE-NC when the dataset has a substantial number of nominal features and also when there is some association between the categorical features and the target class. Additionally, our proposed method addressed one of the major limitations of the SMOTE-NC algorithm. SMOTE-NC can be applied only on mixed datasets that have features consisting of both continuous and nominal features and cannot function if all the features of the dataset are nominal. Our novel method has been generalized to be applied to both mixed datasets and nominal-only datasets.

Download Full-text

Machine Learning for Design Optimization of Electromagnetic Devices: Recent Developments and Future Directions

Applied Sciences ◽

10.3390/app11041627 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1627

Author(s):

Yanbin Li ◽

Gang Lei ◽

Gerd Bramerdorfer ◽

Sheng Peng ◽

Xiaodong Sun ◽

...

Keyword(s):

Machine Learning ◽

Design Optimization ◽

Optimization Methods ◽

Machine Learning Algorithms ◽

Cloud Services ◽

Robust Design Optimization ◽

Support Vector ◽

Future Directions ◽

Electromagnetic Devices ◽

Recent Developments

This paper reviews the recent developments of design optimization methods for electromagnetic devices, with a focus on machine learning methods. First, the recent advances in multi-objective, multidisciplinary, multilevel, topology, fuzzy, and robust design optimization of electromagnetic devices are overviewed. Second, a review is presented to the performance prediction and design optimization of electromagnetic devices based on the machine learning algorithms, including artificial neural network, support vector machine, extreme learning machine, random forest, and deep learning. Last, to meet modern requirements of high manufacturing/production quality and lifetime reliability, several promising topics, including the application of cloud services and digital twin, are discussed as future directions for design optimization of electromagnetic devices.

Download Full-text

Predicting Student Failure in University Examination using Machine Learning Algorithms

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.e2643.039520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 956-959

Keyword(s):

Machine Learning ◽

Data Mining ◽

Performance Management ◽

Student Performance ◽

Learning Algorithms ◽

Educational Data Mining ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Social Characteristics ◽

Student Failure

Student Performance Management is one of the key pillars of the higher education institutions since it directly impacts the student’s career prospects and college rankings. This paper follows the path of learning analytics and educational data mining by applying machine learning techniques in student data for identifying students who are at the more likely to fail in the university examinations and thus providing needed interventions for improved student performance. The Paper uses data mining approach with 10 fold cross validation to classify students based on predictors which are demographic and social characteristics of the students. This paper compares five popular machine learning algorithms Rep Tree, Jrip, Random Forest, Random Tree, Naive Bayes algorithms based on overall classifier accuracy as well as other class specific indicators i.e. precision, recall, f-measure. Results proved that Rep tree algorithm outperformed other machine learning algorithms in classifying students who are at more likely to fail in the examinations.

Download Full-text

Dr. Phish: Phishing Website Detector

E3S Web of Conferences ◽

10.1051/e3sconf/202129701032 ◽

2021 ◽

Vol 297 ◽

pp. 01032

Author(s):

Harish Kumar ◽

Anshal Prasad ◽

Ninad Rane ◽

Nilay Tamane ◽

Anjali Yeole

Keyword(s):

Machine Learning ◽

Data Mining ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Cyber Crime ◽

Data Mining Algorithms ◽

Learning Techniques ◽

Mining Algorithms ◽

Host Properties ◽

New Strategies

Phishing is a common attack on credulous people by making them disclose their unique information. It is a type of cyber-crime where false sites allure exploited people to give delicate data. This paper deals with methods for detecting phishing websites by analyzing various features of URLs by Machine learning techniques. This experimentation discusses the methods used for detection of phishing websites based on lexical features, host properties and page importance properties. We consider various data mining algorithms for evaluation of the features in order to get a better understanding of the structure of URLs that spread phishing. To protect end users from visiting these sites, we can try to identify the phishing URLs by analyzing their lexical and host-based features.A particular challenge in this domain is that criminals are constantly making new strategies to counter our defense measures. To succeed in this contest, we need Machine Learning algorithms that continually adapt to new examples and features of phishing URLs.

Download Full-text

Lead-based virtual screening and prediction of EGFR inhibitors using PubChem’s database with data mining and machine learning algorithms

10.1021/scimeetings.0c03836 ◽

2020 ◽

Cited By ~ 1

Author(s):

Kedan He

Keyword(s):

Machine Learning ◽

Data Mining ◽

Virtual Screening ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Egfr Inhibitors

Download Full-text

Integration of synthetic minority oversampling technique for imbalanced class

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i1.pp102-108 ◽

2019 ◽

Vol 13 (1) ◽

pp. 102

Author(s):

Noviyanti Santoso ◽

Wahyu Wibowo ◽

Hilda Hikmawati

Keyword(s):

Machine Learning ◽

Data Mining ◽

Support Vector Machine ◽

Class Imbalance ◽

Original Data ◽

Support Vector ◽

Classification Methods ◽

Problematic Issue ◽

Imbalanced Class ◽

F Measure

In the data mining, a class imbalance is a problematic issue to look for the solutions. It probably because machine learning is constructed by using algorithms with assuming the number of instances in each balanced class, so when using a class imbalance, it is possible that the prediction results are not appropriate. They are solutions offered to solve class imbalance issues, including oversampling, undersampling, and synthetic minority oversampling technique (SMOTE). Both oversampling and undersampling have its disadvantages, so SMOTE is an alternative to overcome it. By integrating SMOTE in the data mining classification method such as Naive Bayes, Support Vector Machine (SVM), and Random Forest (RF) is expected to improve the performance of accuracy. In this research, it was found that the data of SMOTE gave better accuracy than the original data. In addition to the three classification methods used, RF gives the highest average AUC, F-measure, and G-means score.

Download Full-text

Machine Learning for Landslides Prevention: A Survey

10.36227/techrxiv.12546098 ◽

2020 ◽

Author(s):

Zhengjing Ma ◽

Gang Mei

Keyword(s):

Machine Learning ◽

State Of The Art ◽

Human Life ◽

Machine Learning Algorithms ◽

Warning Systems ◽

Negative Effects ◽

Relevant Research ◽

Related Information ◽

Urgent Task ◽

Comprehensive Survey

Landslides are one of the most critical categories of natural disasters worldwide and induce severely destructive outcomes to human life and the overall economic system. To reduce its negative effects, landslides prevention has become an urgent task, which includes investigating landslide-related information and predicting potential landslides. Machine learning is a state-of-the-art analytics tool that has been widely used in landslides prevention. This paper presents a comprehensive survey of relevant research on machine learning applied in landslides prevention, mainly focusing on (1) landslides detection based on images, (2) landslides susceptibility assessment, and (3) the development of landslide warning systems. Moreover, this paper discusses the current challenges and potential opportunities in the application of machine learning algorithms for landslides prevention.

Download Full-text

Construction of Rapid Early Warning and Comprehensive Analysis Models for Urban Waterlogging Based on AutoML and Comparison of the other three Machine Learning Algorithms

Journal of Hydrology ◽

10.1016/j.jhydrol.2021.127367 ◽

2021 ◽

pp. 127367

Author(s):

Yuchen Guo ◽

Lihong Quan ◽

Lili Song ◽

Hao Liang

Keyword(s):

Machine Learning ◽

Early Warning ◽

Learning Algorithms ◽

Comprehensive Analysis ◽

Machine Learning Algorithms ◽

The Other ◽

Analysis Models

Download Full-text

Machine Learning for Business Analytics

Advances in Data Mining and Database Management - Challenges and Applications of Data Analytics in Social Perspectives ◽

10.4018/978-1-7998-2566-1.ch013 ◽

2021 ◽

pp. 232-256

Author(s):

Kağan Okatan

Keyword(s):

Machine Learning ◽

Data Mining ◽

Social Media ◽

Big Data ◽

Machine Learning Algorithms ◽

Decision Makers ◽

Business Analytics ◽

Business Intelligence Systems ◽

Long Time ◽

Rules Of The Game

All these types of analytics have been answering business questions for a long time about the principal methods of investigating data warehouses. Especially data mining and business intelligence systems support decision makers to reach the information they want. Many existing systems are trying to keep up with a phenomenon that has changed the rules of the game in recent years. This is undoubtedly the undeniable attraction of 'big data'. In particular, the issue of evaluating the big data generated especially by social media is among the most up-to-date issues of business analytics, and this issue demonstrates the importance of integrating machine learning into business analytics. This section introduces the prominent machine learning algorithms that are increasingly used for business analytics and emphasizes their application areas.

Download Full-text