scholarly journals Imbalance class problems in data mining: a review

Author(s):  
Haseeb Ali ◽  
Mohd Najib Mohd Salleh ◽  
Rohmat Saedudin ◽  
Kashif Hussain ◽  
Muhammad Faheem Mushtaq

<span>The imbalanced data problems in data mining are common nowadays, which occur due to skewed nature of data. These problems impact the classification process negatively in machine learning process. In such problems, classes have different ratios of specimens in which a large number of specimens belong to one class and the other class has fewer specimens that is usually an essential class, but unfortunately misclassified by many classifiers. So far, significant research is performed to address the imbalanced data problems by implementing different techniques and approaches. In this research, a comprehensive survey is performed to identify the challenges of handling imbalanced class problems during classification process using machine learning algorithms. We discuss the issues of classifiers which endorse bias for majority class and ignore the minority class. Furthermore, the viable solutions and potential future directions are provided to handle the problems<em>.</em></span>

2013 ◽  
Vol 7 (1) ◽  
pp. 62-70 ◽  
Author(s):  
Dengju Yao ◽  
Jing Yang ◽  
Xiaojuan Zhan

The classification problem is one of the important research subjects in the field of machine learning. However, most machine learning algorithms train a classifier based on the assumption that the number of training examples of classes is almost equal. When a classifier was trained on imbalanced data, the performance of the classifier declined clearly. For resolving the class-imbalanced problem, an improved random forest algorithm was proposed based on sampling with replacement. We extracted multiple example subsets randomly with replacement from majority class, and the example number of extracted example subsets is as the same with minority class example dataset. Then, multiple new training datasets were constructed by combining the each exacted majority example subset and minority class dataset respectively, and multiple random forest classifiers were training on these training dataset. For a prediction example, the class was determined by majority voting of multiple random forest classifiers. The experimental results on five groups UCI datasets and a real clinical dataset show that the proposed method could deal with the class-imbalanced data problem and the improved random forest algorithm outperformed original random forest and other methods in literatures.


2021 ◽  
Vol 4 (1) ◽  
pp. 18
Author(s):  
Mimi Mukherjee ◽  
Matloob Khushi

Real-world datasets are heavily skewed where some classes are significantly outnumbered by the other classes. In these situations, machine learning algorithms fail to achieve substantial efficacy while predicting these underrepresented instances. To solve this problem, many variations of synthetic minority oversampling methods (SMOTE) have been proposed to balance datasets which deal with continuous features. However, for datasets with both nominal and continuous features, SMOTE-NC is the only SMOTE-based oversampling technique to balance the data. In this paper, we present a novel minority oversampling method, SMOTE-ENC (SMOTE—Encoded Nominal and Continuous), in which nominal features are encoded as numeric values and the difference between two such numeric values reflects the amount of change of association with the minority class. Our experiments show that classification models using the SMOTE-ENC method offer better prediction than models using SMOTE-NC when the dataset has a substantial number of nominal features and also when there is some association between the categorical features and the target class. Additionally, our proposed method addressed one of the major limitations of the SMOTE-NC algorithm. SMOTE-NC can be applied only on mixed datasets that have features consisting of both continuous and nominal features and cannot function if all the features of the dataset are nominal. Our novel method has been generalized to be applied to both mixed datasets and nominal-only datasets.


2021 ◽  
Vol 11 (4) ◽  
pp. 1627
Author(s):  
Yanbin Li ◽  
Gang Lei ◽  
Gerd Bramerdorfer ◽  
Sheng Peng ◽  
Xiaodong Sun ◽  
...  

This paper reviews the recent developments of design optimization methods for electromagnetic devices, with a focus on machine learning methods. First, the recent advances in multi-objective, multidisciplinary, multilevel, topology, fuzzy, and robust design optimization of electromagnetic devices are overviewed. Second, a review is presented to the performance prediction and design optimization of electromagnetic devices based on the machine learning algorithms, including artificial neural network, support vector machine, extreme learning machine, random forest, and deep learning. Last, to meet modern requirements of high manufacturing/production quality and lifetime reliability, several promising topics, including the application of cloud services and digital twin, are discussed as future directions for design optimization of electromagnetic devices.


Student Performance Management is one of the key pillars of the higher education institutions since it directly impacts the student’s career prospects and college rankings. This paper follows the path of learning analytics and educational data mining by applying machine learning techniques in student data for identifying students who are at the more likely to fail in the university examinations and thus providing needed interventions for improved student performance. The Paper uses data mining approach with 10 fold cross validation to classify students based on predictors which are demographic and social characteristics of the students. This paper compares five popular machine learning algorithms Rep Tree, Jrip, Random Forest, Random Tree, Naive Bayes algorithms based on overall classifier accuracy as well as other class specific indicators i.e. precision, recall, f-measure. Results proved that Rep tree algorithm outperformed other machine learning algorithms in classifying students who are at more likely to fail in the examinations.


2021 ◽  
Vol 297 ◽  
pp. 01032
Author(s):  
Harish Kumar ◽  
Anshal Prasad ◽  
Ninad Rane ◽  
Nilay Tamane ◽  
Anjali Yeole

Phishing is a common attack on credulous people by making them disclose their unique information. It is a type of cyber-crime where false sites allure exploited people to give delicate data. This paper deals with methods for detecting phishing websites by analyzing various features of URLs by Machine learning techniques. This experimentation discusses the methods used for detection of phishing websites based on lexical features, host properties and page importance properties. We consider various data mining algorithms for evaluation of the features in order to get a better understanding of the structure of URLs that spread phishing. To protect end users from visiting these sites, we can try to identify the phishing URLs by analyzing their lexical and host-based features.A particular challenge in this domain is that criminals are constantly making new strategies to counter our defense measures. To succeed in this contest, we need Machine Learning algorithms that continually adapt to new examples and features of phishing URLs.


Author(s):  
Noviyanti Santoso ◽  
Wahyu Wibowo ◽  
Hilda Hikmawati

In the data mining, a class imbalance is a problematic issue to look for the solutions. It probably because machine learning is constructed by using algorithms with assuming the number of instances in each balanced class, so when using a class imbalance, it is possible that the prediction results are not appropriate. They are solutions offered to solve class imbalance issues, including oversampling, undersampling, and synthetic minority oversampling technique (SMOTE). Both oversampling and undersampling have its disadvantages, so SMOTE is an alternative to overcome it. By integrating SMOTE in the data mining classification method such as Naive Bayes, Support Vector Machine (SVM), and Random Forest (RF) is expected to improve the performance of accuracy. In this research, it was found that the data of SMOTE gave better accuracy than the original data. In addition to the three classification methods used, RF gives the highest average AUC, F-measure, and G-means score.


2020 ◽  
Author(s):  
Zhengjing Ma ◽  
Gang Mei

Landslides are one of the most critical categories of natural disasters worldwide and induce severely destructive outcomes to human life and the overall economic system. To reduce its negative effects, landslides prevention has become an urgent task, which includes investigating landslide-related information and predicting potential landslides. Machine learning is a state-of-the-art analytics tool that has been widely used in landslides prevention. This paper presents a comprehensive survey of relevant research on machine learning applied in landslides prevention, mainly focusing on (1) landslides detection based on images, (2) landslides susceptibility assessment, and (3) the development of landslide warning systems. Moreover, this paper discusses the current challenges and potential opportunities in the application of machine learning algorithms for landslides prevention.


Author(s):  
Kağan Okatan

All these types of analytics have been answering business questions for a long time about the principal methods of investigating data warehouses. Especially data mining and business intelligence systems support decision makers to reach the information they want. Many existing systems are trying to keep up with a phenomenon that has changed the rules of the game in recent years. This is undoubtedly the undeniable attraction of 'big data'. In particular, the issue of evaluating the big data generated especially by social media is among the most up-to-date issues of business analytics, and this issue demonstrates the importance of integrating machine learning into business analytics. This section introduces the prominent machine learning algorithms that are increasingly used for business analytics and emphasizes their application areas.


Sign in / Sign up

Export Citation Format

Share Document