scholarly journals Data Level Approach for Multiclass Imbalance Financial Data

2020 ◽  
Vol 19 ◽  

In the real world, the class imbalance problem is a common issue in which classifier gives more importance to the majority class whereas less importance to the minority class. In class imbalance, imbalance metrics would not be suitable to evaluate the performance of classifiers with error rate or predictive accuracy. One type of imbalance data -handling method is resampling. In this paper, three resampling methods, oversampling, under-sampling and hybrid, methods are used with different approaches for in class imbalance of two different financial data to see the impact of class imbalance ratios on performance measures of nine different classification algorithms. Aiming to achieve better change classification performance, the performance of the classification algorithms, Bayes Net, Navie Bayes, J48, Random Forest Meta-Attribute Selected Classifier, MetaClassification via Regression, Meta-Logitboost, Logistic Regression, and Decision Tree, are measured on two Canadian Banks multiclass imbalance data with the performance measures, Precision, Recall, ROC Area and Kappa Statistic, by using WEKA software. The outcome of these performance measurements compared with three different resampling methods. The results provide us with a clear picture on the overall impact of class imbalance on the classification dataset and they indicate that proposed resampling methods can also be used for in class imbalance problems

2020 ◽  
pp. 096228022098048
Author(s):  
Olga Lyashevska ◽  
Fiona Malone ◽  
Eugene MacCarthy ◽  
Jens Fiehler ◽  
Jan-Hendrik Buhk ◽  
...  

Imbalance between positive and negative outcomes, a so-called class imbalance, is a problem generally found in medical data. Imbalanced data hinder the performance of conventional classification methods which aim to improve the overall accuracy of the model without accounting for uneven distribution of the classes. To rectify this, the data can be resampled by oversampling the positive (minority) class until the classes are approximately equally represented. After that, a prediction model such as gradient boosting algorithm can be fitted with greater confidence. This classification method allows for non-linear relationships and deep interactive effects while focusing on difficult areas by iterative shifting towards problematic observations. In this study, we demonstrate application of these methods to medical data and develop a practical framework for evaluation of features contributing into the probability of stroke.


Author(s):  
Sreeja N. K.

Learning a classifier from imbalanced data is one of the most challenging research problems. Data imbalance occurs when the number of instances belonging to one class is much less than the number of instances belonging to the other class. A standard classifier is biased towards the majority class and therefore misclassifies the minority class instances. Minority class instances may be regarded as rare events or unusual patterns that could potentially have a negative impact on the society. Therefore, detection of such events is considered significant. This chapter proposes a FireWorks-based Hybrid ReSampling (FWHRS) algorithm to resample imbalance data. It is used with Weighted Pattern Matching based classifier (PMC+) for classification. FWHRS-PMC+ was evaluated on 44 imbalanced binary datasets. Experiments reveal FWHRS-PMC+ is effective in classification of imbalanced data. Empirical results were validated using non-parametric statistical tests.


Author(s):  
Dariusz Brzezinski ◽  
Leandro L. Minku ◽  
Tomasz Pewinski ◽  
Jerzy Stefanowski ◽  
Artur Szumaczuk

AbstractClass imbalance introduces additional challenges when learning classifiers from concept drifting data streams. Most existing work focuses on designing new algorithms for dealing with the global imbalance ratio and does not consider other data complexities. Independent research on static imbalanced data has highlighted the influential role of local data difficulty factors such as minority class decomposition and presence of unsafe types of examples. Despite often being present in real-world data, the interactions between concept drifts and local data difficulty factors have not been investigated in concept drifting data streams yet. We thoroughly study the impact of such interactions on drifting imbalanced streams. For this purpose, we put forward a new categorization of concept drifts for class imbalanced problems. Through comprehensive experiments with synthetic and real data streams, we study the influence of concept drifts, global class imbalance, local data difficulty factors, and their combinations, on predictions of representative online classifiers. Experimental results reveal the high influence of new considered factors and their local drifts, as well as differences in existing classifiers’ reactions to such factors. Combinations of multiple factors are the most challenging for classifiers. Although existing classifiers are partially capable of coping with global class imbalance, new approaches are needed to address challenges posed by imbalanced data streams.


2020 ◽  
Vol 19 (01) ◽  
pp. 2040014
Author(s):  
Neda Abdelhamid ◽  
Arun Padmavathy ◽  
David Peebles ◽  
Fadi Thabtah ◽  
Daymond Goulder-Horobin

Machine learning (ML) is a branch of computer science that is rapidly gaining popularity within the healthcare arena due to its ability to explore large datasets to discover useful patterns that can be interepreted for decision-making and prediction. ML techniques are used for the analysis of clinical parameters and their combinations for prognosis, therapy planning and support and patient management and wellbeing. In this research, we investigate a crucial problem associated with medical applications such as autism spectrum disorder (ASD) data imbalances in which cases are far more than just controls in the dataset. In autism diagnosis data, the number of possible instances is linked with one class, i.e. the no ASD is larger than the ASD, and this may cause performance issues such as models favouring the majority class and undermining the minority class. This research experimentally measures the impact of class imbalance issue on the performance of different classifiers on real autism datasets when various data imbalance approaches are utilised in the pre-processing phase. We employ oversampling techniques, such as Synthetic Minority Oversampling (SMOTE), and undersampling with different classifiers including Naive Bayes, RIPPER, C4.5 and Random Forest to measure the impact of these on the performance of the models derived in terms of area under curve and other metrics. Results pinpoint that oversampling techniques are superior to undersampling techniques, at least for the toddlers’ autism dataset that we consider, and suggest that further work should look at incorporating sampling techniques with feature selection to generate models that do not overfit the dataset.


Author(s):  
Saman Riaz ◽  
Ali Arshad ◽  
Licheng Jiao

Software fault prediction is the very consequent research topic for software quality assurance. Data driven approaches provide robust mechanisms to deal with software fault prediction. However, the prediction performance of the model highly depends on the quality of dataset. Many software datasets suffers from the problem of class imbalance. In this regard, under-sampling is a popular data pre-processing method in dealing with class imbalance problem, Easy Ensemble (EE) present a robust approach to achieve a high classification rate and address the biasness towards majority class samples. However, imbalance class is not the only issue that harms performance of classifiers. Some noisy examples and irrelevant features may additionally reduce the rate of predictive accuracy of the classifier. In this paper, we proposed two-stage data pre-processing which incorporates feature selection and a new Rough set Easy Ensemble scheme. In feature selection stage, we eliminate the irrelevant features by feature ranking algorithm. In the second stage of a new Rough set Easy Ensemble by incorporating Rough K nearest neighbor rule filter (RK) afore executing Easy Ensemble (EE), named RKEE for short. RK can remove noisy examples from both minority and majority class. Experimental evaluation on real-world software projects, such as NASA and Eclipse dataset, is performed in order to demonstrate the effectiveness of our proposed approach. Furthermore, this paper comprehensively investigates the influencing factor in our approach. Such as, the impact of Rough set theory on noise-filter, the relationship between model performance and imbalance ratio etc. comprehensive experiments indicate that the proposed approach shows outstanding performance with significance in terms of area-under-the-curve (AUC).


2017 ◽  
Vol 42 (2) ◽  
pp. 149-176 ◽  
Author(s):  
Szymon Wojciechowski ◽  
Szymon Wilk

Abstract In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN.


2020 ◽  
Vol 30 (Supplement_5) ◽  
Author(s):  
M Poldrugovac ◽  
J E Amuah ◽  
H Wei-Randall ◽  
P Sidhom ◽  
K Morris ◽  
...  

Abstract Background Evidence of the impact of public reporting of healthcare performance on quality improvement is not yet sufficient to draw conclusions with certainty, despite the important policy implications. This study explored the impact of implementing public reporting of performance indicators of long-term care facilities in Canada. The objective was to analyse whether improvements can be observed in performance measures after publication. Methods We considered 16 performance indicators in long-term care in Canada, 8 of which are publicly reported at a facility level, while the other 8 are privately reported. We analysed data from the Continuing Care Reporting System managed by the Canadian Institute for Health Information and based on information collection with RAI-MDS 2.0 © between the fiscal years 2011 and 2018. A multilevel model was developed to analyse time trends, before and after publication, which started in 2015. The analysis was also stratified by key sample characteristics, such as the facilities' jurisdiction, size, urban or rural location and performance prior to publication. Results Data from 1087 long-term care facilities were included. Among the 8 publicly reported indicators, the trend in the period after publication did not change significantly in 5 cases, improved in 2 cases and worsened in 1 case. Among the 8 privately reported indicators, no change was observed in 7, and worsening in 1 indicator. The stratification of the data suggests that for those indicators that were already improving prior to public reporting, there was either no change in trend or there was a decrease in the rate of improvement after publication. For those indicators that showed a worsening trend prior to public reporting, the contrary was observed. Conclusions Our findings suggest public reporting of performance data can support change. The trends of performance indicators prior to publication appear to have an impact on whether further change will occur after publication. Key messages Public reporting is likely one of the factors affecting change in performance in long-term care facilities. Public reporting of performance measures in long-term care facilities may support improvements in particular in cases where improvement was not observed before publication.


Author(s):  
Grant Duwe

As the use of risk assessments for correctional populations has grown, so has concern that these instruments exacerbate existing racial and ethnic disparities. While much of the attention arising from this concern has focused on how algorithms are designed, relatively little consideration has been given to how risk assessments are used. To this end, the present study tests whether application of the risk principle would help preserve predictive accuracy while, at the same time, mitigate disparities. Using a sample of 9,529 inmates released from Minnesota prisons who had been assessed multiple times during their confinement on a fully-automated risk assessment, this study relies on both actual and simulated data to examine the impact of program assignment decisions on changes in risk level from intake to release. The findings showed that while the risk principle was used in practice to some extent, the simulated results showed that greater adherence to the risk principle would increase reductions in risk levels and minimize the disparities observed at intake. The simulated data further revealed the most favorable outcomes would be achieved by not only applying the risk principle, but also by expanding program capacity for the higher-risk inmates in order to adequately reduce their risk.


Energies ◽  
2021 ◽  
Vol 14 (5) ◽  
pp. 1253
Author(s):  
Maja Piesiewicz ◽  
Marlena Ciechan-Kujawa ◽  
Paweł Kufel

Integrated reports combine financial and non-financial data into a comprehensive report outlining the company’s value creation process. Our objective is to find the completeness of disclosures, which is a crucial aspect of an integrated report’s quality. This study contributes to the integrated reporting examination by identifying quantitative and qualitative gaps when applying Integrated Reporting standards, focusing on the energy sector. We conducted the study on 57 published integrated reports of listed companies in Poland. The content of each report was examined for 49 features divided into eight areas. We identify the strengths and weaknesses of current reporting performance and the impact of the company’s sector on reports’ quality. We noted that there are significant differences among the areas. The major problems concern implementing IIRC’s framework on the connections between the business model and the organization’s strategy, risks, opportunities, and performance. Our research also noted that the level of specific disclosures might be related to a company’s ownership structure. We investigated the significance of differences among companies from the energy and non-energy sectors using statistical methods. As a result of the study, we obtained that disclosures’ completeness depends on the operation sector. The companies in the energy sector publish higher-quality integrated reports than companies in the other sectors.


Sign in / Sign up

Export Citation Format

Share Document